1 Install R and RStudio

Make sure you install both R and RStudio for this workshop.

  1. Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)

  2. Download and install RStudio from https://rstudio.com/products/rstudio/download/#download

If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0 by entering sessionInfo() in your console.

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.2  magrittr_1.5    tools_4.0.2     htmltools_0.5.0
##  [5] yaml_2.2.1      stringi_1.5.3   rmarkdown_2.4.7 knitr_1.30     
##  [9] stringr_1.4.0   xfun_0.18       digest_0.6.26   rlang_0.4.8    
## [13] evaluate_0.14

2 Install and Load Libraries

For this workshop, we will be using three R packages. Make sure you install these packages before proceeding.

install.packages("tidytext")
install.packages("tidyverse")
install.packages("rvest")

Once the packages are installed, you can load them using library().

library(tidytext)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding

3 Scrape Amazon Reviews

We will be scraping Amazon reviews to build a corpus. In addition to the text of each review, we will be scraping the rating (number of stars out of 5 stars) of each review.

3.1 Scrape one page

To scrape an individual page of reviews, first find the product you want to scrape the reviews for. I chose the AmazonBasics Pre-sharpened Wood Cased #2 HB Pencils. Scroll all the way down, past the reviews, and click on See all reviews. This will direct you to the first page of reviews. Each page contains a total of 10 reviews. Click on Next page and note how the url changes. Go pack to the first page, copy the url and create a url variable with it.

# set url to desired review website
url <- "https://www.amazon.com/AmazonBasics-Pre-sharpened-Wood-Cased-Pencils/product-reviews/B071JM699P/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1"

We can now read the html in from this url using read_html.

# download website
amazon_reviews <- read_html(url)

Inspecting the page source of our Amazon review page, we see that the node we want has class="review-text-content". Our next step is to parse the html code for these nodes.

# get nodes for review text (all 10)
review_text <- amazon_reviews %>%
  html_nodes(".review-text-content") %>%
  html_text()

We will also parse the html for the stars, which has class="a-icon-alt". These will give us three extra ratings, which are at the top of the page. We need to make sure we keep only the last 10 ratings using tail().

# get nodes for rating (last 10)
review_rate <- amazon_reviews %>%
  html_nodes(".a-icon-alt") %>%
  html_text() %>%
  tail(10)

We can now combine the text and the rate (i.e., stars) in one data frame.

# build data frame
pencil_reviews <- data.frame(text = review_text,
                             rate = review_rate)

# inspect data
glimpse(pencil_reviews)
## Rows: 10
## Columns: 2
## $ text <chr> "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  Great price for order decen…
## $ rate <chr> "4.0 out of 5 stars", "1.0 out of 5 stars", "3.0 out of 5 stars"…

3.2 Scrape multiple pages

Since Amazon review pages has the number of each page in the url (i.e., pageNumber=1), you can scrape multiple pages changing the page number and doing the same as we did above, with each page.

# do the same as above, but with 200 pages
# start with an empty data frame
pencil_reviews <- data.frame()

# base url has everything by the last number, which indicates page number
base_url <- "https://www.amazon.com/AmazonBasics-Pre-sharpened-Wood-Cased-Pencils/product-reviews/B071JM699P/ref=cm_cr_getr_d_paging_btm_prev_1?ie=UTF8&reviewerType=all_reviews&pageNumber="

# create a for loop for page number
for (i in 1:200){
  # add base url to page number
 url <- paste0(base_url, i)
 
 # read in html file
 amazon_reviews <- read_html(url)
 
 # get nodes
 review_text <- amazon_reviews %>%
   html_nodes(".review-text-content") %>%
   html_text()
 
 review_rate <- amazon_reviews %>%
   html_nodes(".a-icon-alt") %>%
   html_text() %>%
   tail(10)
 
 # bind rows of this page data frame with the rest of the data
 pencil_reviews <- bind_rows(pencil_reviews,
                             data.frame(text = review_text,
                                        rate = review_rate)
                             )
}

4 Clean Data

The for loop above takes a couple of minutes to run. So here’s how to read in pre-scraped and saved data.

# read data in (so we don't have to run the for loop above during the workshop)
pencil_reviews <- read_csv("data/pencil_reviews.csv")
## Parsed with column specification:
## cols(
##   text = col_character(),
##   rate = col_character()
## )

Always inspect your data.

# inspect data
glimpse(pencil_reviews)
## Rows: 2,000
## Columns: 2
## $ text <chr> "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  Great price for order decen…
## $ rate <chr> "4.0 out of 5 stars", "1.0 out of 5 stars", "3.0 out of 5 stars"…
head(pencil_reviews)
## # A tibble: 6 x 2
##   text                                                           rate           
##   <chr>                                                          <chr>          
## 1 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  Great price for order de… 4.0 out of 5 s…
## 2 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  As a teacher I need penc… 1.0 out of 5 s…
## 3 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  I got these pre-sharpene… 3.0 out of 5 s…
## 4 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  I was so excited to buy … 1.0 out of 5 s…
## 5 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  These are not great.  Th… 3.0 out of 5 s…
## 6 "\n\n\n\n\n\n\n\n\n\n  \n  \n    \n  These pencils are hard t… 1.0 out of 5 s…

We need to remove line breaks (\n).

# replace \n with a space
pencil_reviews %>%
  mutate(text = gsub("\\n", " ", text)) %>%
  head()
## # A tibble: 6 x 2
##   text                                                           rate           
##   <chr>                                                          <chr>          
## 1 "                       Great price for order decent pencils … 4.0 out of 5 s…
## 2 "                       As a teacher I need pencils to last, … 1.0 out of 5 s…
## 3 "                       I got these pre-sharpened pencils for… 3.0 out of 5 s…
## 4 "                       I was so excited to buy these for my … 1.0 out of 5 s…
## 5 "                       These are not great.  The lead breaks… 3.0 out of 5 s…
## 6 "                       These pencils are hard to sharpen, th… 1.0 out of 5 s…
# trim leading white space
pencil_reviews %>%
  mutate(text = gsub("\\n", "", text)) %>%
  mutate(text = trimws(text)) %>%
  head()
## # A tibble: 6 x 2
##   text                                                           rate           
##   <chr>                                                          <chr>          
## 1 Great price for order decent pencils although I'm a pencil sn… 4.0 out of 5 s…
## 2 As a teacher I need pencils to last, the first pencil I sharp… 1.0 out of 5 s…
## 3 I got these pre-sharpened pencils for convenience, because 50… 3.0 out of 5 s…
## 4 I was so excited to buy these for my classroom, but I would n… 1.0 out of 5 s…
## 5 These are not great.  The lead breaks constantly.  I always c… 3.0 out of 5 s…
## 6 These pencils are hard to sharpen, they break easy, and they … 1.0 out of 5 s…
# looks good, overwrite data
pencil_reviews <- pencil_reviews %>%
  mutate(text = gsub("\\n", " ", text)) %>%
  mutate(text = trimws(text))

Let’s remove out of 5 stars from rate.

# replace "out of 5 stars" with nothing
pencil_reviews %>%
  mutate(rate = sub("out of 5 stars", "", rate)) %>%
  select(rate) %>%
  head()
## # A tibble: 6 x 1
##   rate  
##   <chr> 
## 1 "4.0 "
## 2 "1.0 "
## 3 "3.0 "
## 4 "1.0 "
## 5 "3.0 "
## 6 "1.0 "
# looks good, overwrite data
pencil_reviews <- pencil_reviews %>%
  mutate(rate = sub("out of 5 stars", "", rate))

# inspect data
glimpse(pencil_reviews)
## Rows: 2,000
## Columns: 2
## $ text <chr> "Great price for order decent pencils although I'm a pencil snob…
## $ rate <chr> "4.0 ", "1.0 ", "3.0 ", "1.0 ", "3.0 ", "1.0 ", "1.0 ", "3.0 ", …

The variable rate is a chr type (character or string). We need it to be a numeric data type.

# mutate rate to be a number
pencil_reviews <- pencil_reviews %>%
  mutate(rate = parse_number(rate))

# inspect data
glimpse(pencil_reviews)
## Rows: 2,000
## Columns: 2
## $ text <chr> "Great price for order decent pencils although I'm a pencil snob…
## $ rate <dbl> 4, 1, 3, 1, 3, 1, 1, 3, 5, 5, 4, 1, 1, 5, 5, 5, 5, 1, 5, 3, 5, 5…

We can now do things like mean of rate,

mean(pencil_reviews$rate)
## [1] 4.408

5 Tokenize the Text

Before we tokenize each review text, we need to create a unique review ID to calculate range later.

# add new review_id column
pencil_reviews <- pencil_reviews %>%
  mutate(review_id = row_number())

Now, we can tokenize the text keeping info on what words belong to the same review.

# tokenize words
pencil_reviews_tokenized <- pencil_reviews %>%
  unnest_tokens(word, text)

# inspect data
pencil_reviews_tokenized %>%
  head()
## # A tibble: 6 x 3
##    rate review_id word   
##   <dbl>     <int> <chr>  
## 1     4         1 great  
## 2     4         1 price  
## 3     4         1 for    
## 4     4         1 order  
## 5     4         1 decent 
## 6     4         1 pencils

In addition to individual word tokenization, unnest_tokens() offers a number of tokenization formatis, including ngrams. Here’s how to get bigrams.

# tokenize bigrams
pencil_reviews_bigrams <- pencil_reviews %>%
  unnest_tokens(ngram, text, token = "ngrams", n = 2)

# inspect data
pencil_reviews_bigrams %>%
  head()
## # A tibble: 6 x 3
##    rate review_id ngram       
##   <dbl>     <int> <chr>       
## 1     1         2 as a        
## 2     1         2 a teacher   
## 3     1         2 teacher i   
## 4     1         2 i need      
## 5     1         2 need pencils
## 6     1         2 pencils to

We will be working with individual words from this point on.

5.1 Remove Stop Words

Stop words are words that are very common in a language, but might not carry a lot of meaning, like function words. Stop words often include pronouns as well, modals, and frequent adverbs.

# check stop_words data frame
stop_words %>%
  count(word) %>%
  arrange(-n)
## # A tibble: 728 x 2
##    word        n
##    <chr>   <int>
##  1 down        4
##  2 would       4
##  3 a           3
##  4 about       3
##  5 above       3
##  6 after       3
##  7 again       3
##  8 against     3
##  9 all         3
## 10 an          3
## # … with 718 more rows
# the smallest lexicon is snowball
stop_words %>%
  count(lexicon)
## # A tibble: 3 x 2
##   lexicon      n
##   <chr>    <int>
## 1 onix       404
## 2 SMART      571
## 3 snowball   174

Let’s filter the stop words to keep only words from the snowball lexicon.

my_stop_words <- stop_words %>%
  filter(lexicon == "snowball")

We now use this filtered data frame with an anti_join to keep only words that are not in the stop words list.

# remove stop words from pencil reviews tokenized
pencil_reviews_clean <- pencil_reviews_tokenized %>%
  anti_join(my_stop_words)
## Joining, by = "word"

6 Count Token per Rating

With our data clean, we can start counting words.

# most frequent tokens per rate
pencil_reviews_clean %>%
  count(word, rate) %>%
  arrange(-n)
## # A tibble: 5,719 x 3
##    word       rate     n
##    <chr>     <dbl> <int>
##  1 pencils       5   845
##  2 great         5   519
##  3 sharpened     5   339
##  4 good          5   324
##  5 school        5   252
##  6 pencil        5   203
##  7 quality       5   190
##  8 price         5   181
##  9 pencils       1   159
## 10 pre           5   153
## # … with 5,709 more rows

Plotting the data makes it easier to compare frequent tokens across different ratings.

pencil_reviews_clean %>%
  count(word, rate) %>%
  group_by(rate) %>%
  top_n(10) %>%
  ggplot(aes(x = n, 
             y = reorder_within(word, n, rate))) +
  geom_col() +
  facet_wrap(~rate, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")
## Selecting by n

Issues: unbalanced data, we don’t know what words are important to each category.

How many tokens per rate?

# count number of tokens (i.e., rows) per rate
pencil_reviews_clean %>%
  count(rate)
## # A tibble: 5 x 2
##    rate     n
##   <dbl> <int>
## 1     1  2870
## 2     2  1163
## 3     3  1594
## 4     4  2124
## 5     5 15186

We can normalize it.

# get total word count per rate, rename n so it's total instead
rate_word_count <- pencil_reviews_clean %>%
  count(rate) %>%
  rename(total = n)

# get count for each word per rate
word_count_per_rate <- pencil_reviews_clean %>%
  count(word, rate)

# normalize individual word count
# first merge two counts
word_count_normalized <- left_join(word_count_per_rate,
                                   rate_word_count)
## Joining, by = "rate"
# create a new normalized count column, with n divided by total
word_count_normalized <- word_count_normalized %>%
  mutate(norm_n = (n/total)*1000)

Plotting it again, by normalized frequency instead.

word_count_normalized %>%
  group_by(rate) %>%
  top_n(10) %>%
  ggplot(aes(x = norm_n, 
             y = reorder_within(word, norm_n, rate))) +
  geom_col() +
  facet_wrap(~rate, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")
## Selecting by norm_n

We can calculate term frequency inverse document frequency (tf-idf) instead of normalized frequency. The goal in using tf-idf is to decrease the weight for commonly used words (i.e., words used across all documents) and increase the weight for words that are less frequent in other documents in that collection.

# calculate tf-idf based on n, providing the word column and the category col
word_tf_idf <- word_count_normalized  %>%
  bind_tf_idf(word, rate, n)

# inspect data
word_tf_idf %>%
  head()
## # A tibble: 6 x 8
##   word    rate     n total norm_n        tf   idf   tf_idf
##   <chr>  <dbl> <int> <int>  <dbl>     <dbl> <dbl>    <dbl>
## 1 0          3     1  1594 0.627  0.000627  1.61  0.00101 
## 2 0,1011     5     1 15186 0.0659 0.0000659 1.61  0.000106
## 3 0.08       5     1 15186 0.0659 0.0000659 1.61  0.000106
## 4 1          1     3  2870 1.05   0.00105   0.223 0.000233
## 5 1          2     2  1163 1.72   0.00172   0.223 0.000384
## 6 1          3     2  1594 1.25   0.00125   0.223 0.000280

We can also add range, to decide what words to keep and understand tf-idf a little better.

# calculate range per word
word_range <- pencil_reviews_clean %>%
  distinct(word, review_id) %>%
  count(word) %>%
  rename(range = n)

# add range to data frame with left_join
word_tf_idf <- left_join(word_tf_idf, word_range)
## Joining, by = "word"
# inspect data
word_tf_idf %>%
  head()
## # A tibble: 6 x 9
##   word    rate     n total norm_n        tf   idf   tf_idf range
##   <chr>  <dbl> <int> <int>  <dbl>     <dbl> <dbl>    <dbl> <int>
## 1 0          3     1  1594 0.627  0.000627  1.61  0.00101      1
## 2 0,1011     5     1 15186 0.0659 0.0000659 1.61  0.000106     1
## 3 0.08       5     1 15186 0.0659 0.0000659 1.61  0.000106     1
## 4 1          1     3  2870 1.05   0.00105   0.223 0.000233    14
## 5 1          2     2  1163 1.72   0.00172   0.223 0.000384    14
## 6 1          3     2  1594 1.25   0.00125   0.223 0.000280    14
# what's the mean range?
mean(word_tf_idf$range)
## [1] 13.74121

Plotting it again, by tf-idf filtering by range.

word_tf_idf %>%
  filter(range > 20) %>%
  group_by(rate) %>%
  top_n(n = 10, wt = tf_idf) %>%
  ggplot(aes(x = tf_idf, 
             y = reorder_within(word, tf_idf, rate))) +
  geom_col() +
  facet_wrap(~rate, scales = "free_y") +
  scale_y_reordered() +
  labs(y = "")