1 Install R and RStudio
2 Install and Load Libraries
3 Scrape Amazon Reviews
- 3.1 Scrape one page
- 3.2 Scrape multiple pages
4 Clean Data
5 Tokenize the Text
- 5.1 Remove Stop Words
6 Count Token per Rating
7 Sentiment Analysis
- 7.1 Sentiment Lexicons
- 7.2 Adding Sentiment Info to our Data
8 More Resources
9 Exit Survey

1 Install R and RStudio

Make sure you install both R and RStudio for this workshop.

Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
Download and install RStudio from https://rstudio.com/products/rstudio/download/#download

If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0 by entering sessionInfo() in your console.

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.2  magrittr_1.5    tools_4.0.2     htmltools_0.5.0
##  [5] yaml_2.2.1      stringi_1.5.3   rmarkdown_2.4.7 knitr_1.30     
##  [9] stringr_1.4.0   xfun_0.18       digest_0.6.26   rlang_0.4.8    
## [13] evaluate_0.14

2 Install and Load Libraries

For this workshop, we will be using three R packages. Make sure you install these packages before proceeding.

install.packages("tidytext")
install.packages("tidyverse")
install.packages("rvest")

Once the packages are installed, you can load them using library().

library(tidytext)
library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:purrr':
## 
##     pluck

## The following object is masked from 'package:readr':
## 
##     guess_encoding

3 Scrape Amazon Reviews

We will be scraping Amazon reviews to build a corpus. In addition to the text of each review, we will be scraping the rating (number of stars out of 5 stars) of each review.

3.1 Scrape one page

To scrape an individual page of reviews, first find the product you want to scrape the reviews for. I chose the AmazonBasics Pre-sharpened Wood Cased #2 HB Pencils. Scroll all the way down, past the reviews, and click on See all reviews. This will direct you to the first page of reviews. Each page contains a total of 10 reviews. Click on Next page and note how the url changes. Go pack to the first page, copy the url and create a url variable with it.

# set url to desired review website
url <- "https://www.amazon.com/AmazonBasics-Pre-sharpened-Wood-Cased-Pencils/product-reviews/B071JM699P/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1"

We can now read the html in from this url using read_html.

# download website
amazon_reviews <- read_html(url)

Inspecting the page source of our Amazon review page, we see that the node we want has class="review-text-content". Our next step is to parse the html code for these nodes.

# get nodes for review text (all 10)
review_text <- amazon_reviews %>%
  html_nodes(".review-text-content") %>%
  html_text()

We will also parse the html for the stars, which has class="a-icon-alt". These will give us three extra ratings, which are at the top of the page. We need to make sure we keep only the last 10 ratings using tail().

# get nodes for rating (last 10)
review_rate <- amazon_reviews %>%
  html_nodes(".a-icon-alt") %>%
  html_text() %>%
  tail(10)

We can now combine the text and the rate (i.e., stars) in one data frame.

# build data frame
pencil_reviews <- data.frame(text = review_text,
                             rate = review_rate)

# inspect data
glimpse(pencil_reviews)

## Rows: 10
## Columns: 2
## $ text <chr> "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  Great price for order decen…
## $ rate <chr> "4.0 out of 5 stars", "1.0 out of 5 stars", "3.0 out of 5 stars"…

3.2 Scrape multiple pages

Since Amazon review pages has the number of each page in the url (i.e., pageNumber=1), you can scrape multiple pages changing the page number and doing the same as we did above, with each page.

# do the same as above, but with 200 pages
# start with an empty data frame
pencil_reviews <- data.frame()

# base url has everything by the last number, which indicates page number
base_url <- "https://www.amazon.com/AmazonBasics-Pre-sharpened-Wood-Cased-Pencils/product-reviews/B071JM699P/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber="

# create a for loop for page number
for (i in 1:200){
  # add base url to page number
 url <- paste0(base_url, i)
 
 # read in html file
 amazon_reviews <- read_html(url)
 
 # get nodes
 review_text <- amazon_reviews %>%
   html_nodes(".review-text-content") %>%
   html_text()
 
 review_rate <- amazon_reviews %>%
   html_nodes(".a-icon-alt") %>%
   html_text() %>%
   tail(10)
 
 # bind rows of this page data frame with the rest of the data
 pencil_reviews <- bind_rows(pencil_reviews,
                             data.frame(text = review_text,
                                        rate = review_rate)
                             )
}

4 Clean Data

The for loop above takes a couple of minutes to run. So here’s how to read in pre-scraped and saved data.

# read data in (so we don't have to run the for loop above during the workshop)
pencil_reviews <- read_csv("data/pencil_reviews.csv")

## Parsed with column specification:
## cols(
##   text = col_character(),
##   rate = col_character()
## )

Always inspect your data.

# inspect data
glimpse(pencil_reviews)

## Rows: 2,000
## Columns: 2
## $ text <chr> "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  Great price for order decen…
## $ rate <chr> "4.0 out of 5 stars", "1.0 out of 5 stars", "3.0 out of 5 stars"…

head(pencil_reviews)

## # A tibble: 6 x 2
##   text                                                           rate           
##   <chr>                                                          <chr>          
## 1 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  Great price for order de… 4.0 out of 5 s…
## 2 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  As a teacher I need penc… 1.0 out of 5 s…
## 3 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  I got these pre-sharpene… 3.0 out of 5 s…
## 4 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  I was so excited to buy … 1.0 out of 5 s…
## 5 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  These are not great.  Th… 3.0 out of 5 s…
## 6 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  These pencils are hard t… 1.0 out of 5 s…

We need to remove line breaks (\n).

# replace \n with a space
pencil_reviews %>%
  mutate(text = gsub("\\n", " ", text)) %>%
  head()

## # A tibble: 6 x 2
##   text                                                           rate           
##   <chr>                                                          <chr>          
## 1 "                       Great price for order decent pencils … 4.0 out of 5 s…
## 2 "                       As a teacher I need pencils to last, … 1.0 out of 5 s…
## 3 "                       I got these pre-sharpened pencils for… 3.0 out of 5 s…
## 4 "                       I was so excited to buy these for my … 1.0 out of 5 s…
## 5 "                       These are not great.  The lead breaks… 3.0 out of 5 s…
## 6 "                       These pencils are hard to sharpen, th… 1.0 out of 5 s…

# trim leading white space
pencil_reviews %>%
  mutate(text = gsub("\\n", "", text)) %>%
  mutate(text = trimws(text)) %>%
  head()

## # A tibble: 6 x 2
##   text                                                           rate           
##   <chr>                                                          <chr>          
## 1 Great price for order decent pencils although I'm a pencil sn… 4.0 out of 5 s…
## 2 As a teacher I need pencils to last, the first pencil I sharp… 1.0 out of 5 s…
## 3 I got these pre-sharpened pencils for convenience, because 50… 3.0 out of 5 s…
## 4 I was so excited to buy these for my classroom, but I would n… 1.0 out of 5 s…
## 5 These are not great.  The lead breaks constantly.  I always c… 3.0 out of 5 s…
## 6 These pencils are hard to sharpen, they break easy, and they … 1.0 out of 5 s…

# looks good, overwrite data
pencil_reviews <- pencil_reviews %>%
  mutate(text = gsub("\\n", " ", text)) %>%
  mutate(text = trimws(text))

Let’s remove out of 5 stars from rate.

# replace "out of 5 stars" with nothing
pencil_reviews %>%
  mutate(rate = sub("out of 5 stars", "", rate)) %>%
  select(rate) %>%
  head()

## # A tibble: 6 x 1
##   rate  
##   <chr> 
## 1 "4.0 "
## 2 "1.0 "
## 3 "3.0 "
## 4 "1.0 "
## 5 "3.0 "
## 6 "1.0 "

# looks good, overwrite data
pencil_reviews <- pencil_reviews %>%
  mutate(rate = sub("out of 5 stars", "", rate))

# inspect data
glimpse(pencil_reviews)

## Rows: 2,000
## Columns: 2
## $ text <chr> "Great price for order decent pencils although I'm a pencil snob…
## $ rate <chr> "4.0 ", "1.0 ", "3.0 ", "1.0 ", "3.0 ", "1.0 ", "1.0 ", "3.0 ", …

The variable rate is a chr type (character or string). We need it to be a numeric data type.

# mutate rate to be a number
pencil_reviews <- pencil_reviews %>%
  mutate(rate = parse_number(rate))

# inspect data
glimpse(pencil_reviews)

## Rows: 2,000
## Columns: 2
## $ text <chr> "Great price for order decent pencils although I'm a pencil snob…
## $ rate <dbl> 4, 1, 3, 1, 3, 1, 1, 3, 5, 5, 4, 1, 1, 5, 5, 5, 5, 1, 5, 3, 5, 5…

We can now do things like mean of rate,

mean(pencil_reviews$rate)

## [1] 4.408

5 Tokenize the Text

Before we tokenize each review text, we need to create a unique review ID to calculate range later.

# add new review_id column
pencil_reviews <- pencil_reviews %>%
  mutate(review_id = row_number())

Now, we can tokenize the text keeping info on what words belong to the same review.

# tokenize words
pencil_reviews_tokenized <- pencil_reviews %>%
  unnest_tokens(word, text)

# inspect data
pencil_reviews_tokenized %>%
  head()

## # A tibble: 6 x 3
##    rate review_id word   
##   <dbl>     <int> <chr>  
## 1     4         1 great  
## 2     4         1 price  
## 3     4         1 for    
## 4     4         1 order  
## 5     4         1 decent 
## 6     4         1 pencils

In addition to individual word tokenization, unnest_tokens() offers a number of tokenization formatis, including ngrams. Here’s how to get bigrams.

# tokenize bigrams
pencil_reviews_bigrams <- pencil_reviews %>%
  unnest_tokens(ngram, text, token = "ngrams", n = 2)

# inspect data
pencil_reviews_bigrams %>%
  head()

## # A tibble: 6 x 3
##    rate review_id ngram       
##   <dbl>     <int> <chr>       
## 1     1         2 as a        
## 2     1         2 a teacher   
## 3     1         2 teacher i   
## 4     1         2 i need      
## 5     1         2 need pencils
## 6     1         2 pencils to

We will be working with individual words from this point on.

5.1 Remove Stop Words

Stop words are words that are very common in a language, but might not carry a lot of meaning, like function words. Stop words often include pronouns as well, modals, and frequent adverbs.

# check stop_words data frame
stop_words %>%
  count(word) %>%
  arrange(-n)

## # A tibble: 728 x 2
##    word        n
##    <chr>   <int>
##  1 down        4
##  2 would       4
##  3 a           3
##  4 about       3
##  5 above       3
##  6 after       3
##  7 again       3
##  8 against     3
##  9 all         3
## 10 an          3
## # … with 718 more rows

# the smallest lexicon is snowball
stop_words %>%
  count(lexicon)

## # A tibble: 3 x 2
##   lexicon      n
##   <chr>    <int>
## 1 onix       404
## 2 SMART      571
## 3 snowball   174

Let’s filter the stop words to keep only words from the snowball lexicon.

my_stop_words <- stop_words %>%
  filter(lexicon == "snowball")

We now use this filtered data frame with an anti_join to keep only words that are not in the stop words list.

# remove stop words from pencil reviews tokenized
pencil_reviews_clean <- pencil_reviews_tokenized %>%
  anti_join(my_stop_words)

## Joining, by = "word"

6 Count Token per Rating

With our data clean, we can start counting words.

# most frequent tokens per rate
pencil_reviews_clean %>%
  count(word, rate) %>%
  arrange(-n)

## # A tibble: 5,719 x 3
##    word       rate     n
##    <chr>     <dbl> <int>
##  1 pencils       5   845
##  2 great         5   519
##  3 sharpened     5   339
##  4 good          5   324
##  5 school        5   252
##  6 pencil        5   203
##  7 quality       5   190
##  8 price         5   181
##  9 pencils       1   159
## 10 pre           5   153
## # … with 5,709 more rows

Plotting the data makes it easier to compare frequent tokens across different ratings.

pencil_reviews_clean %>%
  count(word, rate) %>%
  group_by(rate) %>%
  top_n(10) %>%
  ggplot(aes(x = n, 
             y = reorder_within(word, n, rate))) +
  geom_col() +
  facet_wrap(~rate, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")

## Selecting by n

Issues: unbalanced data, we don’t know what words are important to each category.

How many tokens per rate?

# count number of tokens (i.e., rows) per rate
pencil_reviews_clean %>%
  count(rate)

## # A tibble: 5 x 2
##    rate     n
##   <dbl> <int>
## 1     1  2870
## 2     2  1163
## 3     3  1594
## 4     4  2124
## 5     5 15186

We can normalize it.

# get total word count per rate, rename n so it's total instead
rate_word_count <- pencil_reviews_clean %>%
  count(rate) %>%
  rename(total = n)

# get count for each word per rate
word_count_per_rate <- pencil_reviews_clean %>%
  count(word, rate)

# normalize individual word count
# first merge two counts
word_count_normalized <- left_join(word_count_per_rate,
                                   rate_word_count)

## Joining, by = "rate"

# create a new normalized count column, with n divided by total
word_count_normalized <- word_count_normalized %>%
  mutate(norm_n = (n/total)*1000)

Plotting it again, by normalized frequency instead.

word_count_normalized %>%
  group_by(rate) %>%
  top_n(10) %>%
  ggplot(aes(x = norm_n, 
             y = reorder_within(word, norm_n, rate))) +
  geom_col() +
  facet_wrap(~rate, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")

## Selecting by norm_n

We can calculate term frequency inverse document frequency (tf-idf) instead of normalized frequency. The goal in using tf-idf is to decrease the weight for commonly used words (i.e., words used across all documents) and increase the weight for words that are less frequent in other documents in that collection.

# calculate tf-idf based on n, providing the word column and the category col
word_tf_idf <- word_count_normalized  %>%
  bind_tf_idf(word, rate, n)

# inspect data
word_tf_idf %>%
  head()

## # A tibble: 6 x 8
##   word    rate     n total norm_n        tf   idf   tf_idf
##   <chr>  <dbl> <int> <int>  <dbl>     <dbl> <dbl>    <dbl>
## 1 0          3     1  1594 0.627  0.000627  1.61  0.00101 
## 2 0,1011     5     1 15186 0.0659 0.0000659 1.61  0.000106
## 3 0.08       5     1 15186 0.0659 0.0000659 1.61  0.000106
## 4 1          1     3  2870 1.05   0.00105   0.223 0.000233
## 5 1          2     2  1163 1.72   0.00172   0.223 0.000384
## 6 1          3     2  1594 1.25   0.00125   0.223 0.000280

We can also add range, to decide what words to keep and understand tf-idf a little better.

# calculate range per word
word_range <- pencil_reviews_clean %>%
  distinct(word, review_id) %>%
  count(word) %>%
  rename(range = n)

# add range to data frame with left_join
word_tf_idf <- left_join(word_tf_idf, word_range)

## Joining, by = "word"

# inspect data
word_tf_idf %>%
  head()

## # A tibble: 6 x 9
##   word    rate     n total norm_n        tf   idf   tf_idf range
##   <chr>  <dbl> <int> <int>  <dbl>     <dbl> <dbl>    <dbl> <int>
## 1 0          3     1  1594 0.627  0.000627  1.61  0.00101      1
## 2 0,1011     5     1 15186 0.0659 0.0000659 1.61  0.000106     1
## 3 0.08       5     1 15186 0.0659 0.0000659 1.61  0.000106     1
## 4 1          1     3  2870 1.05   0.00105   0.223 0.000233    14
## 5 1          2     2  1163 1.72   0.00172   0.223 0.000384    14
## 6 1          3     2  1594 1.25   0.00125   0.223 0.000280    14

# what's the mean range?
mean(word_tf_idf$range)

## [1] 13.74121

Plotting it again, by tf-idf filtering by range.

word_tf_idf %>%
  filter(range > 20) %>%
  group_by(rate) %>%
  top_n(n = 10, wt = tf_idf) %>%
  ggplot(aes(x = tf_idf, 
             y = reorder_within(word, tf_idf, rate))) +
  geom_col() +
  facet_wrap(~rate, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")

7 Sentiment Analysis

Sentiment Analysis can be done a number of ways, and more recently, a number of machine learning algorithms have been used to try to extra sentiments (or polarity of opinion) from text. We will use here a much simpler approach, based on hand-coded sentiment for individual lexical units.

7.1 Sentiment Lexicons

The package tidytext has a sentiments data set.

# first 10 rows
sentiments %>%
  head(10)

## # A tibble: 10 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative

# how many total words
sentiments %>%
  nrow()

## [1] 6786

# how many for each sentiment
sentiments %>%
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

We can use the function get_sentiments() to get the sentiments for different lexicons (these lexicons are available under different licenses, so be sure that the license for the lexicon you want to use is appropriate for your project).

# AFINN from Finn Årup Nielsen (http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html)
get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

# bing from Bing Liu and collaborators (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

# nrc from Saif Mohammad and Peter Turney (http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)
get_sentiments("nrc")

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows

7.2 Adding Sentiment Info to our Data

We use inner_join to keep only words that are in our sentiments data frame, add the corresponding sentiment column to each word.

# start with our clean tokenized data
# use inner_join to add sentiment from bing lexicon
pencil_reviews_sentiment <- pencil_reviews_clean %>%
  inner_join(sentiments)

## Joining, by = "word"

# inspect data
pencil_reviews_sentiment %>%
  count(rate, sentiment) %>%
  arrange(n)

## # A tibble: 10 x 3
##     rate sentiment     n
##    <dbl> <chr>     <int>
##  1     4 negative     91
##  2     2 positive    106
##  3     2 negative    116
##  4     3 negative    134
##  5     3 positive    228
##  6     1 positive    251
##  7     1 negative    328
##  8     4 positive    340
##  9     5 negative    451
## 10     5 positive   2512

Let’s count how many positive and negative words we are left with.

# count tokens in each sentiment
pencil_reviews_sentiment %>%
  count(sentiment)

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   1120
## 2 positive   3437

Since we have a rating for each review, we can calculate the mean and standard deviation of these ratings per sentiment. The hypothesis here is that positive words will have a higher mean rating than negative words.

# mean rate of each sentiment
pencil_reviews_sentiment %>%
  group_by(sentiment) %>%
  summarise(mean_rate = mean(rate),
            sd_rate = sd(rate))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 3
##   sentiment mean_rate sd_rate
##   <chr>         <dbl>   <dbl>
## 1 negative       3.20    1.71
## 2 positive       4.38    1.20

# regression of sentiment by rate
pencil_reviews_sentiment %>%
  mutate(dep_var = ifelse(sentiment == "negative", 0, 1)) %>%
  lm(formula = dep_var ~ rate) %>%
  summary()

## 
## Call:
## lm(formula = dep_var ~ rate, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8512  0.1488  0.1488  0.1488  0.5760 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.317235   0.018014   17.61   <2e-16 ***
## rate        0.106787   0.004154   25.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4024 on 4555 degrees of freedom
## Multiple R-squared:  0.1267, Adjusted R-squared:  0.1265 
## F-statistic: 660.8 on 1 and 4555 DF,  p-value: < 2.2e-16

We can also recalculate our tf-idf based on our two new categories.

# first we count instances of each word per sentiment
word_sentiment_count <- pencil_reviews_sentiment %>%
  count(word, sentiment) 

# calculate tf-idf based on n, providing the word column and the category 
# column, which is sentiment this time around
sentiment_tf_idf <- word_sentiment_count  %>%
  bind_tf_idf(word, sentiment, n)

# we can add range here too, which we already calculated 
sentiment_tf_idf <- left_join(sentiment_tf_idf, word_range)

## Joining, by = "word"

# inspect data
sentiment_tf_idf %>%
  head()

## # A tibble: 6 x 7
##   word        sentiment     n       tf   idf   tf_idf range
##   <chr>       <chr>     <int>    <dbl> <dbl>    <dbl> <int>
## 1 abundant    positive      1 0.000291 0.693 0.000202     1
## 2 abuse       negative      3 0.00268  0.693 0.00186      3
## 3 achievement positive      1 0.000291 0.693 0.000202     1
## 4 adequate    positive      1 0.000291 0.693 0.000202     1
## 5 advantage   positive      1 0.000291 0.693 0.000202     1
## 6 afford      positive      2 0.000582 0.693 0.000403     2

Plotting it by tf-idf filtering by range across different sentiments.

sentiment_tf_idf %>%
  group_by(sentiment) %>%
  top_n(n = 10, wt = tf_idf) %>%
  mutate(tf_idf = ifelse(sentiment == "negative",
                         -tf_idf,
                         tf_idf)) %>%
  ggplot(aes(x = tf_idf, 
             y = reorder(word, tf_idf),
             fill = sentiment)) +
  geom_col() +
  geom_label(aes(label = range), show.legend = FALSE) +
  labs(y = "") +
  ggtitle("Top 10 most important words per sentiment",
          subtitle = "labels show range") +
  theme_bw()

8 More Resources

Text Mining with R - Free online book.

9 Exit Survey

Please take some time to fill out this workshop’s exit survey.

Corpus Data Scraping and Sentiment Analysis

Adriana Picoral

November 7, 2020