This workshop is part of the LAEL Research Bazaar, which is a celebration of the golden jubilee of the Graduate Program in Applied Linguistics and Language Studies (LAEL), at the Pontifical Catholic University of São Paulo (PUCSP), Brazil.
If you arrived at this page before August 7, 2020, you can register for the Zoom synchronous session. If you arrived here at a later date, you can watch the recorded session on facebook.
If you are running your R code in your computer, you need to install both R and RStudio. Alternatively, you can create a free account at http://rstudio.cloud and run your R code in the cloud. Either way, we will be using the same IDE (i.e., RStudio).
What’s an IDE? IDE stands for integrated development environment, and its goal is to facilitate coding by integrating a text editor, a console and other tools into one window.
sessionInfo()
on your console.How often should I update R and RStudio? Always make sure that you have the latest version of R, RStudio, and the packages you’re using in your code to ensure you are not running into bugs that are caused by having older versions installed in your computer.
When asked, Jenny Bryan summarizes the importance of keeping your system up-to-date saying that “You will always eventually have a reason that you must update. So you can either do that very infrequently, suffer with old versions in the middle, and experience great pain at update. Or admit that maintaining your system is a normal ongoing activity, and do it more often.”
You can ensure your packages are also up-to-date by clicking on “Tools” on your RStudio top menu bar, and selecting “Check for Packages Updates…”
What’s an IDE? IDE stands for integrated development environment, and its goal is to facilitate coding by integrating a text editor, a console and other tools into one window.
We are using RStudio as our IDE for this workshop. You can either download and install R and RStudio on your computer (for instructions on how to do so, see the “Before we start” section) or create a free account at http://rstudio.cloud and run your R code in the cloud.
Please ensure you have the lasted version of R and RStudio, otherwise some packages we are using for this workshop will not install correctly.
When you open RStudio, here’s what you see:
Your console (i.e., where you run commands) will show up on the left. The character >
in your console indicates it’s ready to receive a command. Type your command where you see >
and press ENTER
(a.k.a RETURN
)
We will start by installing the packages that we will use in today’s workshop. In your console, enter the following.
install.packages("tidyverse")
install.packages("tidymodels")
install.packages("textrecipes")
install.packages("skimr")
install.packages("glmnet")
install.packages("randomForest")
Here is the general workflow for our machine learning project:
Pre-Process Data: import, inspect and tidy data (skimr
and tidyverse
); transform it (textrecipes
and tidymodels
)
Train Models: train the model with training data (tidyverse
, glmnet
, and randomForest
)
Inspect and Evaluate Models: inspect trained model, predict test data, get accuracy rates (tidymodels
)
For this workshop we are going to build a binary classifier with Twitter data. The dataset we are using today comprise Tweets by Donald Trump and Hillary Clinton in 2016, during the US Presidential Election.
We are going to start by download the data from Kaggle.
First, we load tidyverse
.
# load tidyverse
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Always check where your data is, you can do that with the dir()
function and the folder name where you believe your data should be.
# check if data is in the expected folder
dir("data")
## [1] "tweets.csv"
After ensuring you have the path (i.e., folder name) to your data, load your data with read_csv()
function (not that this function is different from read.csv())
# load data
tweets <- read_csv("data/tweets.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## id = col_double(),
## is_retweet = col_logical(),
## time = col_datetime(format = ""),
## in_reply_to_status_id = col_double(),
## in_reply_to_user_id = col_double(),
## is_quote_status = col_logical(),
## retweet_count = col_double(),
## favorite_count = col_double(),
## longitude = col_double(),
## latitude = col_double(),
## truncated = col_logical()
## )
## See spec(...) for full column specifications.
Always inspect your data. Use head()
to check the first six rows of data.
# check first 6 rows of data
head(tweets)
## # A tibble: 6 x 28
## id handle text is_retweet original_author time
## <dbl> <chr> <chr> <lgl> <chr> <dttm>
## 1 7.81e17 Hilla… "The… FALSE <NA> 2016-09-28 00:22:34
## 2 7.81e17 Hilla… "Las… TRUE timkaine 2016-09-27 23:45:00
## 3 7.81e17 Hilla… "Cou… TRUE POTUS 2016-09-27 23:26:40
## 4 7.81e17 Hilla… "If … FALSE <NA> 2016-09-27 23:08:41
## 5 7.81e17 Hilla… "Bot… FALSE <NA> 2016-09-27 22:30:27
## 6 7.81e17 realD… "Joi… FALSE <NA> 2016-09-27 22:13:24
## # … with 22 more variables: in_reply_to_screen_name <chr>,
## # in_reply_to_status_id <dbl>, in_reply_to_user_id <dbl>,
## # is_quote_status <lgl>, lang <chr>, retweet_count <dbl>,
## # favorite_count <dbl>, longitude <dbl>, latitude <dbl>, place_id <chr>,
## # place_full_name <chr>, place_name <chr>, place_type <chr>,
## # place_country_code <chr>, place_country <chr>,
## # place_contained_within <chr>, place_attributes <chr>,
## # place_bounding_box <chr>, source_url <chr>, truncated <lgl>,
## # entities <chr>, extended_entities <chr>
I also like to use the skim()
function from the skimr
library.
# inspect data
library(skimr)
skim(tweets)
Name | tweets |
Number of rows | 6444 |
Number of columns | 28 |
_______________________ | |
Column type frequency: | |
character | 17 |
logical | 3 |
numeric | 7 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
handle | 0 | 1.00 | 14 | 15 | 0 | 2 | 0 |
text | 0 | 1.00 | 14 | 148 | 0 | 6434 | 0 |
original_author | 5722 | 0.11 | 3 | 15 | 0 | 278 | 0 |
in_reply_to_screen_name | 6236 | 0.03 | 3 | 15 | 0 | 11 | 0 |
lang | 0 | 1.00 | 2 | 3 | 0 | 8 | 0 |
place_id | 6240 | 0.03 | 16 | 16 | 0 | 85 | 0 |
place_full_name | 6240 | 0.03 | 8 | 45 | 0 | 85 | 0 |
place_name | 6240 | 0.03 | 4 | 45 | 0 | 85 | 0 |
place_type | 6240 | 0.03 | 3 | 12 | 0 | 5 | 0 |
place_country_code | 6240 | 0.03 | 2 | 2 | 0 | 2 | 0 |
place_country | 6240 | 0.03 | 13 | 14 | 0 | 2 | 0 |
place_contained_within | 6240 | 0.03 | 2 | 2 | 0 | 1 | 0 |
place_attributes | 6240 | 0.03 | 2 | 2 | 0 | 1 | 0 |
place_bounding_box | 6240 | 0.03 | 185 | 259 | 0 | 85 | 0 |
source_url | 0 | 1.00 | 18 | 44 | 0 | 8 | 0 |
entities | 0 | 1.00 | 64 | 1531 | 0 | 4632 | 0 |
extended_entities | 5096 | 0.21 | 613 | 3581 | 0 | 1348 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
is_retweet | 0 | 1 | 0.11 | FAL: 5722, TRU: 722 |
is_quote_status | 0 | 1 | 0.03 | FAL: 6234, TRU: 210 |
truncated | 0 | 1 | 0.01 | FAL: 6404, TRU: 40 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 7.413582e+17 | 2.697548e+16 | 6.842170e+17 | 7.226281e+17 | 7.464104e+17 | 7.616921e+17 | 7.809256e+17 | ▃▃▆▇▇ |
in_reply_to_status_id | 6242 | 0.03 | 7.654779e+17 | 1.756255e+16 | 7.217700e+17 | 7.634849e+17 | 7.755059e+17 | 7.767838e+17 | 7.808319e+17 | ▁▂▁▁▇ |
in_reply_to_user_id | 6236 | 0.03 | 1.282416e+09 | 2.622182e+08 | 2.176524e+07 | 1.339836e+09 | 1.339836e+09 | 1.339836e+09 | 1.536792e+09 | ▁▁▁▁▇ |
retweet_count | 0 | 1.00 | 4.396180e+03 | 8.162690e+03 | 1.230000e+02 | 1.457500e+03 | 2.825000e+03 | 5.403500e+03 | 4.901800e+05 | ▇▁▁▁▁ |
favorite_count | 0 | 1.00 | 1.165068e+04 | 1.499807e+04 | 2.740000e+02 | 3.866250e+03 | 7.696500e+03 | 1.511825e+04 | 6.603840e+05 | ▇▁▁▁▁ |
longitude | 6432 | 0.00 | -8.529000e+01 | 1.606000e+01 | -1.184100e+02 | -9.033000e+01 | -7.470000e+01 | -7.389000e+01 | -7.388000e+01 | ▂▁▁▂▇ |
latitude | 6432 | 0.00 | 3.931000e+01 | 3.120000e+00 | 3.345000e+01 | 3.930000e+01 | 4.071000e+01 | 4.077000e+01 | 4.199000e+01 | ▂▁▁▁▇ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
time | 0 | 1 | 2016-01-05 03:36:53 | 2016-09-28 00:22:34 | 2016-06-24 18:31:27 | 6443 |
Let’s check our labels (remember we are building a binary classifier).
# who's tweeting?
tweets %>%
count(handle)
## # A tibble: 2 x 2
## handle n
## <chr> <int>
## 1 HillaryClinton 3226
## 2 realDonaldTrump 3218
We can also check other variables, including the period these tweets were tweeted out.
# when were they tweeting?
min(tweets$time)
## [1] "2016-01-05 03:36:53 UTC"
max(tweets$time)
## [1] "2016-09-28 00:22:34 UTC"
In this workshop we will focus on the text only, so we’ll keep the label (i.e., Trump and Clinton) and the text.
# tidy data to variables of interest
clean_tweets <- tweets %>%
select(handle, text)
Here’s what the first six rows look like now.
head(clean_tweets)
## # A tibble: 6 x 2
## handle text
## <chr> <chr>
## 1 HillaryClinton "The question in this election: Who can put the plans into act…
## 2 HillaryClinton "Last night, Donald Trump said not paying taxes was \"smart.\"…
## 3 HillaryClinton "Couldn't be more proud of @HillaryClinton. Her vision and com…
## 4 HillaryClinton "If we stand together, there's nothing we can't do. \n\nMake s…
## 5 HillaryClinton "Both candidates were asked about how they'd confront racial i…
## 6 realDonaldTru… "Join me for a 3pm rally - tomorrow at the Mid-America Center …
Here’s an overview of the data we are going to be working with.
skim(clean_tweets)
Name | clean_tweets |
Number of rows | 6444 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
character | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
handle | 0 | 1 | 14 | 15 | 0 | 2 | 0 |
text | 0 | 1 | 14 | 148 | 0 | 6434 | 0 |
Now that we have our data ready with our label (i.e. handle) and the text we want to use in our classifier, let’s load the tidymodels
library for some magic.
library(tidymodels)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom 0.7.0 ✓ recipes 0.1.13
## ✓ dials 0.0.8 ✓ rsample 0.0.7
## ✓ infer 0.5.3 ✓ tune 0.1.1
## ✓ modeldata 0.0.2 ✓ workflows 0.1.3
## ✓ parsnip 0.1.3 ✓ yardstick 0.0.7
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
Since we will be randomizing and dividing our data into two data sets (i.e. training and test), we need to set a seed to make our results reproducible. That ensures that every time you run your code, you get the same set of tweets for each of your data sets. You can choose any number you want.
# set seed
set.seed(42)
We’ll divide our tweets into two data sets. For the training data, we will get 80% of the tweets. For the test data, we will get 20% of the tweets.
# rsample creates a randomized training and test split of the original data
# first, determine the distribution of your split
data_split <- rsample::initial_split(data = clean_tweets,
prop = 0.80)
data_split
## <Analysis/Assess/Total>
## <5156/1288/6444>
Once we have our data split determine, we can create our two data sets using the training()
and testing()
functions.
# second, create each data set
train_data <- training(data_split)
test_data <- testing(data_split)
We need to “translate” our text into numeric variables. The package textrecipes
makes this process much easier. Our “recipe” for our data transformation needs to tokenize our text first (step_tokenize()
). Then we will filter our tokens by selecting a minimum frequency of 50 (step_tokenfilter
). Our final step will be to get the term frequency for each token in each tweet (step_tf
).
# now we create our "recipe" (i.e., feature engineering)
library(textrecipes)
my_rec <- recipe(handle ~ text, data = train_data) %>%
step_tokenize(text) %>%
step_tokenfilter(text, min_times = 50) %>%
step_tf(text) %>%
prep()
Now that we have our recipe ready, we can transform our training and testing data.
# apply pre processing to data
prepped_train_data <- juice(my_rec)
prepped_test_data <- bake(my_rec, test_data)
Take a look at both data sets.
head(prepped_train_data)
## # A tibble: 6 x 101
## handle tf_text_a tf_text_about tf_text_again tf_text_all tf_text_america
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hilla… 0 0 0 0 0
## 2 Hilla… 0 0 0 0 0
## 3 Hilla… 0 0 0 0 0
## 4 Hilla… 1 1 0 0 0
## 5 realD… 1 0 0 0 1
## 6 Hilla… 0 0 0 0 0
## # … with 95 more variables: tf_text_american <dbl>, tf_text_amp <dbl>,
## # tf_text_an <dbl>, tf_text_and <dbl>, tf_text_are <dbl>, tf_text_as <dbl>,
## # tf_text_at <dbl>, tf_text_back <dbl>, tf_text_be <dbl>, tf_text_been <dbl>,
## # tf_text_big <dbl>, tf_text_but <dbl>, tf_text_by <dbl>,
## # tf_text_campaign <dbl>, tf_text_can <dbl>, tf_text_clinton <dbl>,
## # tf_text_country <dbl>, tf_text_crooked <dbl>, tf_text_cruz <dbl>,
## # tf_text_do <dbl>, tf_text_donald <dbl>, tf_text_for <dbl>,
## # tf_text_from <dbl>, tf_text_get <dbl>, tf_text_going <dbl>,
## # tf_text_great <dbl>, tf_text_has <dbl>, tf_text_have <dbl>,
## # tf_text_he <dbl>, tf_text_her <dbl>, tf_text_hillary <dbl>,
## # tf_text_his <dbl>, tf_text_how <dbl>, tf_text_https <dbl>, tf_text_i <dbl>,
## # tf_text_if <dbl>, tf_text_in <dbl>, tf_text_is <dbl>, tf_text_it <dbl>,
## # tf_text_just <dbl>, tf_text_like <dbl>, tf_text_make <dbl>,
## # tf_text_makeamericagreatagain <dbl>, tf_text_many <dbl>, tf_text_me <dbl>,
## # tf_text_more <dbl>, tf_text_my <dbl>, tf_text_never <dbl>,
## # tf_text_new <dbl>, tf_text_no <dbl>, tf_text_not <dbl>, tf_text_now <dbl>,
## # tf_text_of <dbl>, tf_text_on <dbl>, tf_text_one <dbl>, tf_text_only <dbl>,
## # tf_text_or <dbl>, tf_text_our <dbl>, tf_text_out <dbl>,
## # tf_text_people <dbl>, tf_text_potus <dbl>, tf_text_president <dbl>,
## # tf_text_realdonaldtrump <dbl>, tf_text_she <dbl>, tf_text_should <dbl>,
## # tf_text_so <dbl>, tf_text_t.co <dbl>, tf_text_than <dbl>,
## # tf_text_thank <dbl>, tf_text_that <dbl>, tf_text_the <dbl>,
## # tf_text_their <dbl>, tf_text_them <dbl>, tf_text_they <dbl>,
## # tf_text_this <dbl>, tf_text_time <dbl>, tf_text_to <dbl>,
## # tf_text_today <dbl>, tf_text_trump <dbl>, `tf_text_trump's` <dbl>,
## # tf_text_trump2016 <dbl>, tf_text_up <dbl>, tf_text_us <dbl>,
## # tf_text_very <dbl>, tf_text_vote <dbl>, tf_text_was <dbl>,
## # tf_text_we <dbl>, tf_text_what <dbl>, tf_text_when <dbl>,
## # tf_text_who <dbl>, tf_text_will <dbl>, tf_text_with <dbl>,
## # tf_text_would <dbl>, tf_text_you <dbl>, tf_text_your <dbl>
Now we are ready to train our models.
We are training two models in this workshop with two engines: logistic regression using the glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models package and random forest with the randomForest: Breiman and Cutler’s Random Forests for Classification and Regression package.
First we define a logistic regression model with the glmnet
engine. The parameters mixture
and penalty
are hyperparameters, which need to be tuned for model optimization (which will not be covered in this workshop).
# define model GLMNET
glmnet_model <- logistic_reg(mixture = 0, penalty = 0.1) %>%
set_engine("glmnet")
Then we train the model defining our response variable (i.e., our two labels in handle
) and our prediction variables (i.e., all of the other columns, hence the .
). Our data is our prepared training data.
# train model
my_glm_model <- glmnet_model %>%
fit(handle ~ ., data = prepped_train_data)
It’s that simple. Let’s train our second model before moving on to evaluating and predicting with this first model.
Our process is similar with for this model. We first define a random forest model with the randomForest
engine with classification
mode (response variable is categorical). Number of trees is a hyperparameter.
# define model random forests
rndm_frst_model <- rand_forest(trees = 100, mode = "classification") %>%
set_engine("randomForest")
Then we train the model defining our response variable (i.e., our two labels in handle
) and our prediction variables (i.e., all of the other columns, hence the .
). Our data is our prepared training data.
# train model
my_rndm_frst_model <- rndm_frst_model %>%
fit(handle ~ ., data = prepped_train_data)
We have our two models trained.
For our logistic regression model, we can look at the coefficients of each feature (in this case, each individual token is a feature). I like the broom
library because the tidy()
transforms models into nice (i.e., tidy) tables (i.e., tibbles). We will filter our model coefficients to get only the last step (i.e., 100) coefficients. Let’s also remove the intercept
and clean up the name of the features to keep only the tokens themselves.
library(broom)
my_coefs <- my_glm_model$fit %>%
tidy() %>%
filter(step == 100) %>%
filter(term != "(Intercept)") %>%
mutate(term = gsub('tf_text_', '', term))
We will visualize the top 20 features (i.e., tokens with highest absolute coefficients) for each label. Label reference is assigned by alphabetical order (HillaryClinton = 0, realDonaldTrump = 1). As such, positive coefficients are those features that increase the probability of a tweet being classified as authored by Trump.
# visualize coefs
my_coefs %>%
group_by(estimate > 0) %>%
top_n(20, abs(estimate)) %>%
ungroup() %>%
ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
geom_col() +
coord_flip() +
labs(
x = NULL,
title = "Features with highest (absolute) coefficients"
)
Some performance measures that can be extracted from your model are accuracy (i.e. is the prediction label the same as the truth?), precision (how many predicted positives were actually true positives), and recall (how many true positives were actually detected as positives by the prediction).
# evaluate model
my_glm_model %>%
predict(new_data = prepped_test_data) %>%
mutate(truth = prepped_test_data$handle) %>%
accuracy(truth, .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.836
my_glm_model %>%
predict(new_data = prepped_test_data) %>%
mutate(truth = prepped_test_data$handle) %>%
precision(truth, .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.814
my_glm_model %>%
predict(new_data = prepped_test_data) %>%
mutate(truth = prepped_test_data$handle) %>%
recall(truth, .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.869
I also create a data frame with the truth, the prediction, and the probability for each label for a more in depth analysis or errors.
We start with predicting just the labels for the test data.
# create prediction labels for logistic regression model
handle_glm_prediction <- my_glm_model %>%
predict(new_data = prepped_test_data)
head(handle_glm_prediction)
## # A tibble: 6 x 1
## .pred_class
## <fct>
## 1 HillaryClinton
## 2 HillaryClinton
## 3 HillaryClinton
## 4 realDonaldTrump
## 5 HillaryClinton
## 6 HillaryClinton
We then predict probabilities instead.
# create probabilities for each label
probs_glm_prediction <- my_glm_model %>%
predict(new_data = prepped_test_data,
type = "prob")
head(probs_glm_prediction)
## # A tibble: 6 x 2
## .pred_HillaryClinton .pred_realDonaldTrump
## <dbl> <dbl>
## 1 0.762 0.238
## 2 0.916 0.0838
## 3 0.708 0.292
## 4 0.268 0.732
## 5 0.704 0.296
## 6 0.686 0.314
Finally, we combine the predicted labels and predicted probabilities with the original testing data frame.
# combine everything
model_predictions <- bind_cols(test_data,
handle_glm_prediction,
probs_glm_prediction) %>%
mutate(accurate = (handle == .pred_class))
head(model_predictions)
## # A tibble: 6 x 6
## handle text .pred_class .pred_HillaryCl… .pred_realDonal… accurate
## <chr> <chr> <fct> <dbl> <dbl> <lgl>
## 1 Hillary… "Last night, … HillaryCli… 0.762 0.238 TRUE
## 2 Hillary… "When Donald … HillaryCli… 0.916 0.0838 TRUE
## 3 Hillary… "3) Has Trump… HillaryCli… 0.708 0.292 TRUE
## 4 realDon… "Great aftern… realDonald… 0.268 0.732 TRUE
## 5 Hillary… "It's #Nation… HillaryCli… 0.704 0.296 TRUE
## 6 Hillary… "When you wor… HillaryCli… 0.686 0.314 TRUE
The process to evaluate the random forest model is the same for performance measures.
# evaluate model
my_rndm_frst_model %>%
predict(new_data = prepped_test_data) %>%
mutate(truth = prepped_test_data$handle) %>%
accuracy(truth, .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.858
my_rndm_frst_model %>%
predict(new_data = prepped_test_data) %>%
mutate(truth = prepped_test_data$handle) %>%
precision(truth, .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.832
my_rndm_frst_model %>%
predict(new_data = prepped_test_data) %>%
mutate(truth = prepped_test_data$handle) %>%
recall(truth, .pred_class)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.894
This model does not provide coefficients for each feature. Instead, it creates a number (in our case 100) of decision trees with a decision threshold for each node. If the answer to “does this feature surpass the threshold” is no, then you move to the daughter node on the left, if it is yes, go to the daughter node on the right.
We can get the total number of nodes in each of the 100 trees the model created.
# ndbigtree gives us the number of nodes in the tree
my_rndm_frst_model$fit$forest$ndbigtree
## [1] 1957 1809 1921 1903 1841 1887 1775 1887 1839 1775 1865 1949 1929 1875 1977
## [16] 1835 1731 1737 1835 1955 1787 1881 1823 1873 1881 1767 1729 1901 1717 1829
## [31] 1831 1785 1767 1891 1957 1965 1723 1941 1937 1943 1889 1703 1767 1861 1867
## [46] 1933 1891 1869 1791 1797 1869 1871 1911 1847 1841 1869 2017 1937 1821 1887
## [61] 1753 1829 1867 1815 1743 1881 1719 1791 1743 1955 1945 1819 1899 1919 1887
## [76] 1817 1763 1757 1755 1827 1865 1785 1865 1957 1903 1821 1779 1779 1881 1909
## [91] 1743 1981 1817 2001 1945 1847 1887 1957 1891 1955
Let’s take a look at the first 10 nodes in the last tree.
# check individual trees
randomForest::getTree(my_rndm_frst_model$fit, k = 100,
labelVar = TRUE) %>%
head(n = 10) %>%
mutate(`split var` = gsub("tf_text_", "", `split var`))
## left daughter right daughter split var split point status prediction
## 1 2 3 be 0.5 1 <NA>
## 2 4 5 donald 0.5 1 <NA>
## 3 6 7 i 0.5 1 <NA>
## 4 8 9 we 0.5 1 <NA>
## 5 10 11 he 0.5 1 <NA>
## 6 12 13 t.co 0.5 1 <NA>
## 7 14 15 at 0.5 1 <NA>
## 8 16 17 hillary 0.5 1 <NA>
## 9 18 19 all 0.5 1 <NA>
## 10 20 21 realdonaldtrump 0.5 1 <NA>
We can create a data frame of prediction information, like we did for the first model. We follow the same steps. First we create the label predictions.
# create prediction labels
handle_rndm_frst_prediction <- my_rndm_frst_model %>%
predict(new_data = prepped_test_data)
head(handle_rndm_frst_prediction)
## # A tibble: 6 x 1
## .pred_class
## <fct>
## 1 HillaryClinton
## 2 HillaryClinton
## 3 HillaryClinton
## 4 realDonaldTrump
## 5 HillaryClinton
## 6 HillaryClinton
Then the probability predictions.
# create probabilities for each label
probs_rndm_frst_prediction <- my_rndm_frst_model %>%
predict(new_data = prepped_test_data,
type = "prob")
head(probs_rndm_frst_prediction)
## # A tibble: 6 x 2
## .pred_HillaryClinton .pred_realDonaldTrump
## <dbl> <dbl>
## 1 0.91 0.09
## 2 0.87 0.13
## 3 0.72 0.28
## 4 0.09 0.91
## 5 0.88 0.12
## 6 0.88 0.12
Then we put everything together with the original test data frame.
# combine everything
model_predictions <- bind_cols(test_data,
handle_rndm_frst_prediction,
probs_rndm_frst_prediction) %>%
mutate(accurate = (handle == .pred_class))
head(model_predictions)
## # A tibble: 6 x 6
## handle text .pred_class .pred_HillaryCl… .pred_realDonal… accurate
## <chr> <chr> <fct> <dbl> <dbl> <lgl>
## 1 Hillary… "Last night, … HillaryCli… 0.91 0.09 TRUE
## 2 Hillary… "When Donald … HillaryCli… 0.87 0.13 TRUE
## 3 Hillary… "3) Has Trump… HillaryCli… 0.72 0.28 TRUE
## 4 realDon… "Great aftern… realDonald… 0.09 0.91 TRUE
## 5 Hillary… "It's #Nation… HillaryCli… 0.88 0.12 TRUE
## 6 Hillary… "When you wor… HillaryCli… 0.88 0.12 TRUE
That’s all I had planned for this workshop. To review, these are the basic steps we covered in today’s workshop:
Pre-Process Data: import, inspect and tidy data (skimr
and tidyverse
); transform it (textrecipes
and tidymodels
)
Train Models: train the model with training data (tidyverse
, glmnet
, and randomForest
)
Inspect and Evaluate Models: inspect trained model, predict test data, get accuracy rates (tidymodels
)
With a modular workflow, you can now make changes to your code in step 1 and run all the other steps as they are.
Try replacing step_tf()
with step_tfidf()
or step_sequence_onehot()
Try adding step_stem()
and/or step_stopwords()
Try adding step_ngram()
My name is Adriana Picoral and I’m an assistant professor of data science in the School of Information at the University of Arizona. I’m also the founder of the R-Ladies Tucson chapter.