Before we start

This workshop is part of the LAEL Research Bazaar, which is a celebration of the golden jubilee of the Graduate Program in Applied Linguistics and Language Studies (LAEL), at the Pontifical Catholic University of São Paulo (PUCSP), Brazil.

If you arrived at this page before August 7, 2020, you can register for the Zoom synchronous session. If you arrived here at a later date, you can watch the recorded session on facebook.

Installing R and R Studio

If you are running your R code in your computer, you need to install both R and RStudio. Alternatively, you can create a free account at http://rstudio.cloud and run your R code in the cloud. Either way, we will be using the same IDE (i.e., RStudio).

What’s an IDE? IDE stands for integrated development environment, and its goal is to facilitate coding by integrating a text editor, a console and other tools into one window.

I’ve never installed R and RStudio in my computer OR I’m not sure I have R and RStudio installed in my computer

  1. Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)
  2. Download and install RStudio from https://rstudio.com/products/rstudio/download/#download

I already have R and RStudio installed

  1. Open RStudio
  2. Check your R version by entering sessionInfo() on your console.
  3. The latest release for R was June 22, 2020 (R version 4.0.2 Taking Off Again). If your R version is older than the most recent version, please follow step 1 in the previous section to update R.
  4. Check your RStudio version, if your version is older than Version 1.3.x, please follow step 2 in the previous section to update RStudio.

How often should I update R and RStudio? Always make sure that you have the latest version of R, RStudio, and the packages you’re using in your code to ensure you are not running into bugs that are caused by having older versions installed in your computer.

When asked, Jenny Bryan summarizes the importance of keeping your system up-to-date saying that “You will always eventually have a reason that you must update. So you can either do that very infrequently, suffer with old versions in the middle, and experience great pain at update. Or admit that maintaining your system is a normal ongoing activity, and do it more often.”


You can ensure your packages are also up-to-date by clicking on “Tools” on your RStudio top menu bar, and selecting “Check for Packages Updates…”

A word on IDEs and Installing Packages

What’s an IDE? IDE stands for integrated development environment, and its goal is to facilitate coding by integrating a text editor, a console and other tools into one window.

We are using RStudio as our IDE for this workshop. You can either download and install R and RStudio on your computer (for instructions on how to do so, see the “Before we start” section) or create a free account at http://rstudio.cloud and run your R code in the cloud.

Please ensure you have the lasted version of R and RStudio, otherwise some packages we are using for this workshop will not install correctly.

When you open RStudio, here’s what you see:

Your console (i.e., where you run commands) will show up on the left. The character > in your console indicates it’s ready to receive a command. Type your command where you see > and press ENTER (a.k.a RETURN)

We will start by installing the packages that we will use in today’s workshop. In your console, enter the following.

install.packages("tidyverse")
install.packages("tidymodels")
install.packages("textrecipes")
install.packages("skimr")
install.packages("glmnet")
install.packages("randomForest")

Overview

Here is the general workflow for our machine learning project:

  1. Pre-Process Data: import, inspect and tidy data (skimr and tidyverse); transform it (textrecipes and tidymodels)

  2. Train Models: train the model with training data (tidyverse, glmnet, and randomForest)

  3. Inspect and Evaluate Models: inspect trained model, predict test data, get accuracy rates (tidymodels)

1. Pre-Process Data

1.1. Import and Tidy

For this workshop we are going to build a binary classifier with Twitter data. The dataset we are using today comprise Tweets by Donald Trump and Hillary Clinton in 2016, during the US Presidential Election.

We are going to start by download the data from Kaggle.

First, we load tidyverse.

# load tidyverse
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Always check where your data is, you can do that with the dir() function and the folder name where you believe your data should be.

# check if data is in the expected folder
dir("data")
## [1] "tweets.csv"

After ensuring you have the path (i.e., folder name) to your data, load your data with read_csv() function (not that this function is different from read.csv())

# load data
tweets <- read_csv("data/tweets.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   id = col_double(),
##   is_retweet = col_logical(),
##   time = col_datetime(format = ""),
##   in_reply_to_status_id = col_double(),
##   in_reply_to_user_id = col_double(),
##   is_quote_status = col_logical(),
##   retweet_count = col_double(),
##   favorite_count = col_double(),
##   longitude = col_double(),
##   latitude = col_double(),
##   truncated = col_logical()
## )
## See spec(...) for full column specifications.

Always inspect your data. Use head() to check the first six rows of data.

# check first 6 rows of data
head(tweets)
## # A tibble: 6 x 28
##        id handle text  is_retweet original_author time               
##     <dbl> <chr>  <chr> <lgl>      <chr>           <dttm>             
## 1 7.81e17 Hilla… "The… FALSE      <NA>            2016-09-28 00:22:34
## 2 7.81e17 Hilla… "Las… TRUE       timkaine        2016-09-27 23:45:00
## 3 7.81e17 Hilla… "Cou… TRUE       POTUS           2016-09-27 23:26:40
## 4 7.81e17 Hilla… "If … FALSE      <NA>            2016-09-27 23:08:41
## 5 7.81e17 Hilla… "Bot… FALSE      <NA>            2016-09-27 22:30:27
## 6 7.81e17 realD… "Joi… FALSE      <NA>            2016-09-27 22:13:24
## # … with 22 more variables: in_reply_to_screen_name <chr>,
## #   in_reply_to_status_id <dbl>, in_reply_to_user_id <dbl>,
## #   is_quote_status <lgl>, lang <chr>, retweet_count <dbl>,
## #   favorite_count <dbl>, longitude <dbl>, latitude <dbl>, place_id <chr>,
## #   place_full_name <chr>, place_name <chr>, place_type <chr>,
## #   place_country_code <chr>, place_country <chr>,
## #   place_contained_within <chr>, place_attributes <chr>,
## #   place_bounding_box <chr>, source_url <chr>, truncated <lgl>,
## #   entities <chr>, extended_entities <chr>

I also like to use the skim() function from the skimr library.

# inspect data
library(skimr)
skim(tweets)
Data summary
Name tweets
Number of rows 6444
Number of columns 28
_______________________
Column type frequency:
character 17
logical 3
numeric 7
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
handle 0 1.00 14 15 0 2 0
text 0 1.00 14 148 0 6434 0
original_author 5722 0.11 3 15 0 278 0
in_reply_to_screen_name 6236 0.03 3 15 0 11 0
lang 0 1.00 2 3 0 8 0
place_id 6240 0.03 16 16 0 85 0
place_full_name 6240 0.03 8 45 0 85 0
place_name 6240 0.03 4 45 0 85 0
place_type 6240 0.03 3 12 0 5 0
place_country_code 6240 0.03 2 2 0 2 0
place_country 6240 0.03 13 14 0 2 0
place_contained_within 6240 0.03 2 2 0 1 0
place_attributes 6240 0.03 2 2 0 1 0
place_bounding_box 6240 0.03 185 259 0 85 0
source_url 0 1.00 18 44 0 8 0
entities 0 1.00 64 1531 0 4632 0
extended_entities 5096 0.21 613 3581 0 1348 0

Variable type: logical

skim_variable n_missing complete_rate mean count
is_retweet 0 1 0.11 FAL: 5722, TRU: 722
is_quote_status 0 1 0.03 FAL: 6234, TRU: 210
truncated 0 1 0.01 FAL: 6404, TRU: 40

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 7.413582e+17 2.697548e+16 6.842170e+17 7.226281e+17 7.464104e+17 7.616921e+17 7.809256e+17 ▃▃▆▇▇
in_reply_to_status_id 6242 0.03 7.654779e+17 1.756255e+16 7.217700e+17 7.634849e+17 7.755059e+17 7.767838e+17 7.808319e+17 ▁▂▁▁▇
in_reply_to_user_id 6236 0.03 1.282416e+09 2.622182e+08 2.176524e+07 1.339836e+09 1.339836e+09 1.339836e+09 1.536792e+09 ▁▁▁▁▇
retweet_count 0 1.00 4.396180e+03 8.162690e+03 1.230000e+02 1.457500e+03 2.825000e+03 5.403500e+03 4.901800e+05 ▇▁▁▁▁
favorite_count 0 1.00 1.165068e+04 1.499807e+04 2.740000e+02 3.866250e+03 7.696500e+03 1.511825e+04 6.603840e+05 ▇▁▁▁▁
longitude 6432 0.00 -8.529000e+01 1.606000e+01 -1.184100e+02 -9.033000e+01 -7.470000e+01 -7.389000e+01 -7.388000e+01 ▂▁▁▂▇
latitude 6432 0.00 3.931000e+01 3.120000e+00 3.345000e+01 3.930000e+01 4.071000e+01 4.077000e+01 4.199000e+01 ▂▁▁▁▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time 0 1 2016-01-05 03:36:53 2016-09-28 00:22:34 2016-06-24 18:31:27 6443

Let’s check our labels (remember we are building a binary classifier).

# who's tweeting?
tweets %>%
  count(handle) 
## # A tibble: 2 x 2
##   handle              n
##   <chr>           <int>
## 1 HillaryClinton   3226
## 2 realDonaldTrump  3218

We can also check other variables, including the period these tweets were tweeted out.

# when were they tweeting?
min(tweets$time)
## [1] "2016-01-05 03:36:53 UTC"
max(tweets$time)
## [1] "2016-09-28 00:22:34 UTC"

In this workshop we will focus on the text only, so we’ll keep the label (i.e., Trump and Clinton) and the text.

# tidy data to variables of interest
clean_tweets <- tweets %>%
  select(handle, text)

Here’s what the first six rows look like now.

head(clean_tweets)
## # A tibble: 6 x 2
##   handle         text                                                           
##   <chr>          <chr>                                                          
## 1 HillaryClinton "The question in this election: Who can put the plans into act…
## 2 HillaryClinton "Last night, Donald Trump said not paying taxes was \"smart.\"…
## 3 HillaryClinton "Couldn't be more proud of @HillaryClinton. Her vision and com…
## 4 HillaryClinton "If we stand together, there's nothing we can't do. \n\nMake s…
## 5 HillaryClinton "Both candidates were asked about how they'd confront racial i…
## 6 realDonaldTru… "Join me for a 3pm rally - tomorrow at the Mid-America Center …

Here’s an overview of the data we are going to be working with.

skim(clean_tweets)
Data summary
Name clean_tweets
Number of rows 6444
Number of columns 2
_______________________
Column type frequency:
character 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
handle 0 1 14 15 0 2 0
text 0 1 14 148 0 6434 0

1.2. Sample and Transform

Now that we have our data ready with our label (i.e. handle) and the text we want to use in our classifier, let’s load the tidymodels library for some magic.

library(tidymodels)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom     0.7.0      ✓ recipes   0.1.13
## ✓ dials     0.0.8      ✓ rsample   0.0.7 
## ✓ infer     0.5.3      ✓ tune      0.1.1 
## ✓ modeldata 0.0.2      ✓ workflows 0.1.3 
## ✓ parsnip   0.1.3      ✓ yardstick 0.0.7
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()

Since we will be randomizing and dividing our data into two data sets (i.e. training and test), we need to set a seed to make our results reproducible. That ensures that every time you run your code, you get the same set of tweets for each of your data sets. You can choose any number you want.

# set seed
set.seed(42)

We’ll divide our tweets into two data sets. For the training data, we will get 80% of the tweets. For the test data, we will get 20% of the tweets.

# rsample creates a randomized training and test split of the original data
# first, determine the distribution of your split
data_split <- rsample::initial_split(data = clean_tweets,
                                     prop = 0.80)
data_split
## <Analysis/Assess/Total>
## <5156/1288/6444>

Once we have our data split determine, we can create our two data sets using the training() and testing() functions.

# second, create each data set
train_data <- training(data_split)
test_data <- testing(data_split)

We need to “translate” our text into numeric variables. The package textrecipes makes this process much easier. Our “recipe” for our data transformation needs to tokenize our text first (step_tokenize()). Then we will filter our tokens by selecting a minimum frequency of 50 (step_tokenfilter). Our final step will be to get the term frequency for each token in each tweet (step_tf).

# now we create our "recipe" (i.e., feature engineering)
library(textrecipes)
my_rec <- recipe(handle ~ text, data = train_data) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, min_times = 50) %>%
  step_tf(text) %>%
  prep()

Now that we have our recipe ready, we can transform our training and testing data.

# apply pre processing to data
prepped_train_data <- juice(my_rec)
prepped_test_data <- bake(my_rec, test_data)

Take a look at both data sets.

head(prepped_train_data)
## # A tibble: 6 x 101
##   handle tf_text_a tf_text_about tf_text_again tf_text_all tf_text_america
##   <fct>      <dbl>         <dbl>         <dbl>       <dbl>           <dbl>
## 1 Hilla…         0             0             0           0               0
## 2 Hilla…         0             0             0           0               0
## 3 Hilla…         0             0             0           0               0
## 4 Hilla…         1             1             0           0               0
## 5 realD…         1             0             0           0               1
## 6 Hilla…         0             0             0           0               0
## # … with 95 more variables: tf_text_american <dbl>, tf_text_amp <dbl>,
## #   tf_text_an <dbl>, tf_text_and <dbl>, tf_text_are <dbl>, tf_text_as <dbl>,
## #   tf_text_at <dbl>, tf_text_back <dbl>, tf_text_be <dbl>, tf_text_been <dbl>,
## #   tf_text_big <dbl>, tf_text_but <dbl>, tf_text_by <dbl>,
## #   tf_text_campaign <dbl>, tf_text_can <dbl>, tf_text_clinton <dbl>,
## #   tf_text_country <dbl>, tf_text_crooked <dbl>, tf_text_cruz <dbl>,
## #   tf_text_do <dbl>, tf_text_donald <dbl>, tf_text_for <dbl>,
## #   tf_text_from <dbl>, tf_text_get <dbl>, tf_text_going <dbl>,
## #   tf_text_great <dbl>, tf_text_has <dbl>, tf_text_have <dbl>,
## #   tf_text_he <dbl>, tf_text_her <dbl>, tf_text_hillary <dbl>,
## #   tf_text_his <dbl>, tf_text_how <dbl>, tf_text_https <dbl>, tf_text_i <dbl>,
## #   tf_text_if <dbl>, tf_text_in <dbl>, tf_text_is <dbl>, tf_text_it <dbl>,
## #   tf_text_just <dbl>, tf_text_like <dbl>, tf_text_make <dbl>,
## #   tf_text_makeamericagreatagain <dbl>, tf_text_many <dbl>, tf_text_me <dbl>,
## #   tf_text_more <dbl>, tf_text_my <dbl>, tf_text_never <dbl>,
## #   tf_text_new <dbl>, tf_text_no <dbl>, tf_text_not <dbl>, tf_text_now <dbl>,
## #   tf_text_of <dbl>, tf_text_on <dbl>, tf_text_one <dbl>, tf_text_only <dbl>,
## #   tf_text_or <dbl>, tf_text_our <dbl>, tf_text_out <dbl>,
## #   tf_text_people <dbl>, tf_text_potus <dbl>, tf_text_president <dbl>,
## #   tf_text_realdonaldtrump <dbl>, tf_text_she <dbl>, tf_text_should <dbl>,
## #   tf_text_so <dbl>, tf_text_t.co <dbl>, tf_text_than <dbl>,
## #   tf_text_thank <dbl>, tf_text_that <dbl>, tf_text_the <dbl>,
## #   tf_text_their <dbl>, tf_text_them <dbl>, tf_text_they <dbl>,
## #   tf_text_this <dbl>, tf_text_time <dbl>, tf_text_to <dbl>,
## #   tf_text_today <dbl>, tf_text_trump <dbl>, `tf_text_trump's` <dbl>,
## #   tf_text_trump2016 <dbl>, tf_text_up <dbl>, tf_text_us <dbl>,
## #   tf_text_very <dbl>, tf_text_vote <dbl>, tf_text_was <dbl>,
## #   tf_text_we <dbl>, tf_text_what <dbl>, tf_text_when <dbl>,
## #   tf_text_who <dbl>, tf_text_will <dbl>, tf_text_with <dbl>,
## #   tf_text_would <dbl>, tf_text_you <dbl>, tf_text_your <dbl>

Now we are ready to train our models.

2. Train Models

We are training two models in this workshop with two engines: logistic regression using the glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models package and random forest with the randomForest: Breiman and Cutler’s Random Forests for Classification and Regression package.

2.1. Logistic Regression

First we define a logistic regression model with the glmnet engine. The parameters mixture and penalty are hyperparameters, which need to be tuned for model optimization (which will not be covered in this workshop).

# define model GLMNET
glmnet_model <- logistic_reg(mixture = 0, penalty = 0.1) %>%
  set_engine("glmnet")

Then we train the model defining our response variable (i.e., our two labels in handle) and our prediction variables (i.e., all of the other columns, hence the .). Our data is our prepared training data.

# train model
my_glm_model <- glmnet_model %>%
  fit(handle ~ ., data = prepped_train_data)

It’s that simple. Let’s train our second model before moving on to evaluating and predicting with this first model.

2.2. Random Forests

Our process is similar with for this model. We first define a random forest model with the randomForest engine with classification mode (response variable is categorical). Number of trees is a hyperparameter.

# define model random forests
rndm_frst_model <- rand_forest(trees = 100, mode = "classification") %>%
  set_engine("randomForest")

Then we train the model defining our response variable (i.e., our two labels in handle) and our prediction variables (i.e., all of the other columns, hence the .). Our data is our prepared training data.

# train model
my_rndm_frst_model <- rndm_frst_model %>%
  fit(handle ~ ., data = prepped_train_data)

We have our two models trained.

3. Inspect and Evaluate models

3.1. Logistic Regression

For our logistic regression model, we can look at the coefficients of each feature (in this case, each individual token is a feature). I like the broom library because the tidy() transforms models into nice (i.e., tidy) tables (i.e., tibbles). We will filter our model coefficients to get only the last step (i.e., 100) coefficients. Let’s also remove the intercept and clean up the name of the features to keep only the tokens themselves.

library(broom)
my_coefs <- my_glm_model$fit %>%
  tidy() %>%
  filter(step == 100) %>%
  filter(term != "(Intercept)") %>%
  mutate(term = gsub('tf_text_', '', term))

We will visualize the top 20 features (i.e., tokens with highest absolute coefficients) for each label. Label reference is assigned by alphabetical order (HillaryClinton = 0, realDonaldTrump = 1). As such, positive coefficients are those features that increase the probability of a tweet being classified as authored by Trump.

# visualize coefs
my_coefs %>%
  group_by(estimate > 0) %>%
  top_n(20, abs(estimate)) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
  geom_col() +
  coord_flip() +
  labs(
    x = NULL,
    title = "Features with highest (absolute) coefficients"
  )

Some performance measures that can be extracted from your model are accuracy (i.e. is the prediction label the same as the truth?), precision (how many predicted positives were actually true positives), and recall (how many true positives were actually detected as positives by the prediction).

# evaluate model
my_glm_model %>%
  predict(new_data = prepped_test_data) %>%
  mutate(truth = prepped_test_data$handle) %>%
  accuracy(truth, .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.836
my_glm_model %>%
  predict(new_data = prepped_test_data) %>%
  mutate(truth = prepped_test_data$handle) %>%
  precision(truth, .pred_class)
## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.814
my_glm_model %>%
  predict(new_data = prepped_test_data) %>%
  mutate(truth = prepped_test_data$handle) %>%
  recall(truth, .pred_class)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.869

I also create a data frame with the truth, the prediction, and the probability for each label for a more in depth analysis or errors.

We start with predicting just the labels for the test data.

# create prediction labels for logistic regression model
handle_glm_prediction <- my_glm_model %>%
  predict(new_data = prepped_test_data)

head(handle_glm_prediction)
## # A tibble: 6 x 1
##   .pred_class    
##   <fct>          
## 1 HillaryClinton 
## 2 HillaryClinton 
## 3 HillaryClinton 
## 4 realDonaldTrump
## 5 HillaryClinton 
## 6 HillaryClinton

We then predict probabilities instead.

# create probabilities for each label
probs_glm_prediction <- my_glm_model %>%
  predict(new_data = prepped_test_data,
          type = "prob")

head(probs_glm_prediction)
## # A tibble: 6 x 2
##   .pred_HillaryClinton .pred_realDonaldTrump
##                  <dbl>                 <dbl>
## 1                0.762                0.238 
## 2                0.916                0.0838
## 3                0.708                0.292 
## 4                0.268                0.732 
## 5                0.704                0.296 
## 6                0.686                0.314

Finally, we combine the predicted labels and predicted probabilities with the original testing data frame.

# combine everything
model_predictions <- bind_cols(test_data,
                               handle_glm_prediction,
                               probs_glm_prediction) %>%
  mutate(accurate = (handle == .pred_class))

head(model_predictions)
## # A tibble: 6 x 6
##   handle   text           .pred_class .pred_HillaryCl… .pred_realDonal… accurate
##   <chr>    <chr>          <fct>                  <dbl>            <dbl> <lgl>   
## 1 Hillary… "Last night, … HillaryCli…            0.762           0.238  TRUE    
## 2 Hillary… "When Donald … HillaryCli…            0.916           0.0838 TRUE    
## 3 Hillary… "3) Has Trump… HillaryCli…            0.708           0.292  TRUE    
## 4 realDon… "Great aftern… realDonald…            0.268           0.732  TRUE    
## 5 Hillary… "It's #Nation… HillaryCli…            0.704           0.296  TRUE    
## 6 Hillary… "When you wor… HillaryCli…            0.686           0.314  TRUE

3.2. Random Forests

The process to evaluate the random forest model is the same for performance measures.

# evaluate model
my_rndm_frst_model %>%
  predict(new_data = prepped_test_data) %>%
  mutate(truth = prepped_test_data$handle) %>%
  accuracy(truth, .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.858
my_rndm_frst_model %>%
  predict(new_data = prepped_test_data) %>%
  mutate(truth = prepped_test_data$handle) %>%
  precision(truth, .pred_class)
## # A tibble: 1 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.832
my_rndm_frst_model %>%
  predict(new_data = prepped_test_data) %>%
  mutate(truth = prepped_test_data$handle) %>%
  recall(truth, .pred_class)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.894

This model does not provide coefficients for each feature. Instead, it creates a number (in our case 100) of decision trees with a decision threshold for each node. If the answer to “does this feature surpass the threshold” is no, then you move to the daughter node on the left, if it is yes, go to the daughter node on the right.

We can get the total number of nodes in each of the 100 trees the model created.

# ndbigtree gives us the number of nodes in the tree
my_rndm_frst_model$fit$forest$ndbigtree
##   [1] 1957 1809 1921 1903 1841 1887 1775 1887 1839 1775 1865 1949 1929 1875 1977
##  [16] 1835 1731 1737 1835 1955 1787 1881 1823 1873 1881 1767 1729 1901 1717 1829
##  [31] 1831 1785 1767 1891 1957 1965 1723 1941 1937 1943 1889 1703 1767 1861 1867
##  [46] 1933 1891 1869 1791 1797 1869 1871 1911 1847 1841 1869 2017 1937 1821 1887
##  [61] 1753 1829 1867 1815 1743 1881 1719 1791 1743 1955 1945 1819 1899 1919 1887
##  [76] 1817 1763 1757 1755 1827 1865 1785 1865 1957 1903 1821 1779 1779 1881 1909
##  [91] 1743 1981 1817 2001 1945 1847 1887 1957 1891 1955

Let’s take a look at the first 10 nodes in the last tree.

# check individual trees
randomForest::getTree(my_rndm_frst_model$fit, k = 100,
                                labelVar = TRUE)  %>%
    head(n = 10) %>%
    mutate(`split var` = gsub("tf_text_", "", `split var`))
##    left daughter right daughter       split var split point status prediction
## 1              2              3              be         0.5      1       <NA>
## 2              4              5          donald         0.5      1       <NA>
## 3              6              7               i         0.5      1       <NA>
## 4              8              9              we         0.5      1       <NA>
## 5             10             11              he         0.5      1       <NA>
## 6             12             13            t.co         0.5      1       <NA>
## 7             14             15              at         0.5      1       <NA>
## 8             16             17         hillary         0.5      1       <NA>
## 9             18             19             all         0.5      1       <NA>
## 10            20             21 realdonaldtrump         0.5      1       <NA>

We can create a data frame of prediction information, like we did for the first model. We follow the same steps. First we create the label predictions.

# create prediction labels
handle_rndm_frst_prediction <- my_rndm_frst_model %>%
  predict(new_data = prepped_test_data)

head(handle_rndm_frst_prediction)
## # A tibble: 6 x 1
##   .pred_class    
##   <fct>          
## 1 HillaryClinton 
## 2 HillaryClinton 
## 3 HillaryClinton 
## 4 realDonaldTrump
## 5 HillaryClinton 
## 6 HillaryClinton

Then the probability predictions.

# create probabilities for each label
probs_rndm_frst_prediction <- my_rndm_frst_model %>%
  predict(new_data = prepped_test_data,
          type = "prob")

head(probs_rndm_frst_prediction)
## # A tibble: 6 x 2
##   .pred_HillaryClinton .pred_realDonaldTrump
##                  <dbl>                 <dbl>
## 1                 0.91                  0.09
## 2                 0.87                  0.13
## 3                 0.72                  0.28
## 4                 0.09                  0.91
## 5                 0.88                  0.12
## 6                 0.88                  0.12

Then we put everything together with the original test data frame.

# combine everything
model_predictions <- bind_cols(test_data,
                               handle_rndm_frst_prediction,
                               probs_rndm_frst_prediction) %>%
  mutate(accurate = (handle == .pred_class))

head(model_predictions)
## # A tibble: 6 x 6
##   handle   text           .pred_class .pred_HillaryCl… .pred_realDonal… accurate
##   <chr>    <chr>          <fct>                  <dbl>            <dbl> <lgl>   
## 1 Hillary… "Last night, … HillaryCli…             0.91             0.09 TRUE    
## 2 Hillary… "When Donald … HillaryCli…             0.87             0.13 TRUE    
## 3 Hillary… "3) Has Trump… HillaryCli…             0.72             0.28 TRUE    
## 4 realDon… "Great aftern… realDonald…             0.09             0.91 TRUE    
## 5 Hillary… "It's #Nation… HillaryCli…             0.88             0.12 TRUE    
## 6 Hillary… "When you wor… HillaryCli…             0.88             0.12 TRUE

That’s all I had planned for this workshop. To review, these are the basic steps we covered in today’s workshop:

  1. Pre-Process Data: import, inspect and tidy data (skimr and tidyverse); transform it (textrecipes and tidymodels)

  2. Train Models: train the model with training data (tidyverse, glmnet, and randomForest)

  3. Inspect and Evaluate Models: inspect trained model, predict test data, get accuracy rates (tidymodels)

With a modular workflow, you can now make changes to your code in step 1 and run all the other steps as they are.

Ideas on How to Change your Recipe

List of Resources

About Me

My name is Adriana Picoral and I’m an assistant professor of data science in the School of Information at the University of Arizona. I’m also the founder of the R-Ladies Tucson chapter.