If you are running your R code in your computer, you need to install both R and RStudio. Alternatively, you can create a free account at http://rstudio.cloud and run your R code in the cloud. Either way, we will be using the same IDE (i.e., RStudio).
What’s an IDE? IDE stands for integrated development environment, and its goal is to facilitate coding by integrating a text editor, a console and other tools into one window.
sessionInfo()
on your console.How often should I update R and RStudio? Always make sure that you have the latest version of R, RStudio, and the packages you’re using in your code to ensure you are not running into bugs that are caused by having older versions installed in your computer.
When asked, Jenny Bryan summarizes the importance of keeping your system up-to-date saying that “You will always eventually have a reason that you must update. So you can either do that very infrequently, suffer with old versions in the middle, and experience great pain at update. Or admit that maintaining your system is a normal ongoing activity, and do it more often.”
You can ensure your packages are also up-to-date by clicking on “Tools” on your RStudio top menu bar, and selecting “Check for Packages Updates…”
What’s an IDE? IDE stands for integrated development environment, and its goal is to facilitate coding by integrating a text editor, a console and other tools into one window.
We are using RStudio as our IDE for this workshop. You can either download and install R and RStudio on your computer (for instructions on how to do so, see the “Before we start” section) or create a free account at http://rstudio.cloud and run your R code in the cloud.
Please ensure you have the lasted version of R and RStudio, otherwise some packages we are using for this workshop will not install correctly.
When you open RStudio, here’s what you see:
Your console (i.e., where you run commands) will show up on the left. The character >
in your console indicates it’s ready to receive a command. Type your command where you see >
and press ENTER
(a.k.a RETURN
)
We will start by installing the packages that we will use in today’s workshop. In your console, enter the following.
install.packages("rtweet")
install.packages("tidyverse")
install.packages("stringi")
install.packages("tidytext")
Here is the general workflow for this workshop:
Download tweets using the Twitter API with the rtweet
package
Inspect and clean tweets
Extract emojis from tweets
Count tokens that co-occur with emojis using the tidytext
package
Demonstrate annotation with spacyr
package
The first step is to load the rtweet
library.
# load rtweet
library(rtweet)
In this workshop, we will download public users’ timelines. I chose two famous people with the most followers on Twitter. I also set n
to 3,200
tweets, which is the maximum amount of tweets you can download without a developer key. Once you run the get_timeline()
function below, your browser should pop-up an authentication request. So make sure you are logged to your Twitter account.
# get timelines
tweets <- get_timeline(c("rihanna", "katyperry"), n = 3200)
If you were timed out or were unable to download tweets, you can read in a file I prepared for this workshop so you can follow along the other steps.
# alternate get timelines7
tweets <- readRDS("tweets.rds")
Load tidyverse
library.
# load tidyverse
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.1
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x purrr::flatten() masks rtweet::flatten()
## x dplyr::lag() masks stats::lag()
Check how many tweets we retrieved per user.
# count tweets by user
tweets %>%
count(screen_name)
## # A tibble: 2 x 2
## screen_name n
## <chr> <int>
## 1 katyperry 3190
## 2 rihanna 3176
Chek what’s the date range for the tweets.
# get min and max of dates tweets were created by users
tweets %>%
group_by(screen_name) %>%
summarise(begin = min(created_at),
end = max(created_at))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## screen_name begin end
## <chr> <dttm> <dttm>
## 1 katyperry 2017-02-13 05:29:21 2020-09-25 00:58:42
## 2 rihanna 2013-02-11 11:29:45 2020-09-25 18:09:51
We are not interested in retweets, just original tweets.
original_tweets <- tweets %>%
filter(!is_retweet)
First we select some columns from our tweet data, so the data frame is more manageable.
original_tweets_selected <- original_tweets %>%
select(status_id, screen_name, text, source,
favorite_count, retweet_count)
Then we extract anything that is not ascii from each tweet. We need to load the strongi
library first.
library(stringi)
Then we use the stri_match_all()
function to match anything that is not ascii. That will generate a list of non-ascii tokens.
original_tweets_selected$code <- stri_match_all(original_tweets_selected$text,
regex = '[^[:ascii:]]')
We unnest the list of symbols, so we can match them with our emoji dictionary.
original_tweets_unnested <- original_tweets_selected %>%
unnest(code)
Finally we can combine our tweets with our emoji dictionary.
original_tweets_unnested <- left_join(original_tweets_unnested,
emojis)
## Joining, by = "code"
Any row that has no description means that that row didn’t have a match with any emoji in our dictionary. We can clean our data up, removing these rows.
original_tweet_emojis <- original_tweets_unnested %>%
filter(!is.na(description))
Now we can count symbol (i.e., emoji) per user.
# which emojis do users use the most?
original_tweet_emojis %>%
filter(!is.na(description)) %>%
group_by(screen_name, code) %>%
summarise(n = n()) %>%
group_by(screen_name) %>%
top_n(5) %>%
arrange(-n)
## `summarise()` regrouping output by 'screen_name' (override with `.groups` argument)
## Selecting by n
## # A tibble: 10 x 3
## # Groups: screen_name [2]
## screen_name code[,1] n
## <chr> <chr> <int>
## 1 katyperry ❤ 251
## 2 katyperry ❗ 174
## 3 katyperry 👁 147
## 4 katyperry ✨ 140
## 5 katyperry ♀ 126
## 6 rihanna 💪 35
## 7 rihanna 🎈 24
## 8 rihanna 🙏 24
## 9 rihanna ♀ 20
## 10 rihanna 💋 15
The problem is one of the users use many more emojis than the other. So we need to calculate total emoji use count per user, to calculate a percentage of emoji used per user.
# which emojis do users use the most?
original_tweet_emojis %>%
filter(!is.na(description)) %>%
group_by(screen_name, code) %>%
summarise(n = n()) %>%
mutate(total = sum(n),
percent = n/total) %>%
arrange(-percent)
## `summarise()` regrouping output by 'screen_name' (override with `.groups` argument)
## # A tibble: 586 x 5
## # Groups: screen_name [2]
## screen_name code[,1] n total percent
## <chr> <chr> <int> <int> <dbl>
## 1 rihanna 💪 35 419 0.0835
## 2 rihanna 🎈 24 419 0.0573
## 3 rihanna 🙏 24 419 0.0573
## 4 katyperry ❤ 251 4420 0.0568
## 5 rihanna ♀ 20 419 0.0477
## 6 katyperry ❗ 174 4420 0.0394
## 7 rihanna 💋 15 419 0.0358
## 8 rihanna 🙌 14 419 0.0334
## 9 katyperry 👁 147 4420 0.0333
## 10 katyperry ✨ 140 4420 0.0317
## # … with 576 more rows
We can also look at what words most often co-occur with emojis. Let’s load the tidytext
library to look at word collocation.
library(tidytext)
We start by tokenizing our tweets.
# tokenize original tweets
tweet_emojis_tokenized <- original_tweet_emojis %>%
unnest_tokens(word, text)
Then we can calculate what words co-occur with what emojis by counting symbol and word by user.
# most common words with emojis
tweet_emojis_tokenized %>%
count(screen_name, code, word) %>%
arrange(-n)
## # A tibble: 53,345 x 4
## screen_name code[,1] word n
## <chr> <chr> <chr> <int>
## 1 katyperry ❤ https 191
## 2 katyperry ❤ t.co 191
## 3 katyperry ❗ https 185
## 4 katyperry ❗ t.co 185
## 5 katyperry ❤ you 172
## 6 katyperry 👁 https 171
## 7 katyperry 👁 t.co 171
## 8 katyperry ✨ https 132
## 9 katyperry ✨ t.co 132
## 10 katyperry ❤ the 131
## # … with 53,335 more rows
Most common tokens are non words, so we have some clean up to do. We create a list of tokens to remove. I’m adding stopwords to the list as well.
# clean up words
to_remove <- c("https", "t.co", "amp",
stopwords::stopwords())
To filter things out, you can do an anti_join()
but I find doing a negation of %in%
is more straight-forward.
# create not in function
`%notin%` <- Negate(`%in%`)
# filter our any words that are in our list of tokens to remove
tweets_tokenized_clean <- tweet_emojis_tokenized %>%
filter(word %notin% to_remove)
With our clean data set, we can count words that co-occur with emojis.
# most common words with emojis
tweets_tokenized_clean %>%
count(screen_name, code, word) %>%
arrange(-n)
## # A tibble: 40,818 x 4
## screen_name code[,1] word n
## <chr> <chr> <chr> <int>
## 1 katyperry ❤ americanidol 64
## 2 katyperry 👏 americanidol 55
## 3 katyperry ♀ americanidol 47
## 4 katyperry ❤ love 46
## 5 katyperry ❗ witnessthetour 44
## 6 katyperry 👁 witness 40
## 7 katyperry ✨ americanidol 36
## 8 katyperry 👁 witnessthetour 36
## 9 katyperry 👏 hey 33
## 10 katyperry 👁 now 31
## # … with 40,808 more rows
Let’s plot these results.
# plot
tweets_tokenized_clean %>%
count(screen_name, code, description, word) %>%
group_by(screen_name) %>%
arrange(-n) %>%
top_n(20) %>%
ggplot(aes(x = reorder(word, n), y = n)) +
geom_label(aes(label = description),
position = "jitter",
size = 2) +
coord_flip() +
facet_wrap(~screen_name, scales = "free") +
xlab("")
## Selecting by n
For this part of the workshop, you need python installed in your computer. You then need to install spaCy (e.g., sudo pip install -U spacy
) and download the model for tagging (e.g., python -m spacy download en_core_web_sm
).
Once you have your computer set up, you can install the spacyr
library.
install.packages("spacyr")
Then, we need to load spacyr
.
library(spacyr)
##
## Attaching package: 'spacyr'
## The following object is masked from 'package:rtweet':
##
## get_tokens
Once the spacyr
successfully found your python
and spacy
installation, you annotate tweets.
annotated_text <- spacy_parse(original_tweet_emojis$text,
dependency = FALSE)
## Finding a python executable with spaCy installed...
## spaCy (language model: en_core_web_sm) is installed in /usr/local/bin/python3
## successfully initialized (spaCy Version: 2.3.2, language model: en_core_web_sm)
## (python options: type = "python_executable", value = "/usr/local/bin/python3")
To combine our annotation with our original data frame, we need to add doc_id to our original data frame.
# add doc_id to original dataframe
original_tweet_emojis$doc_id <- paste0("text", c(1:nrow(original_tweet_emojis)))
# combine data frames
annotated_text <- left_join(annotated_text,
original_tweet_emojis)
## Joining, by = "doc_id"
We can now count word by their part of speech. Like VERB for example.
# most common verbs that co-occur with emojis
# emojis are often tagged as verbs
# filter for tokens that are words
annotated_text %>%
filter(pos == "VERB") %>%
filter(grepl('[:ascii:]', token)) %>%
count(code, lemma) %>%
arrange(-n) %>%
head(10)
## code lemma n
## 1 ❤ can 31
## 2 \U0001f481 can 21
## 3 ♀ can 20
## 4 ❤ see 20
## 5 ❤ americanidol 20
## 6 ❤ come 18
## 7 ♥ ’ 17
## 8 \u2728 see 17
## 9 \U0001f441 can 16
## 10 ♀ make 15
Or adjectives.
# most common addjectives that co-occur with emojis
# and that are bigrams
# emojis are often tagged as verbs
# filter for tokens that are words
annotated_text %>%
filter(pos == "ADJ") %>%
filter(grepl('[:ascii:]', token)) %>%
count(code, lemma) %>%
arrange(-n) %>%
head(10)
## code lemma n
## 1 ❤ big 12
## 2 \U0001f33c first 9
## 3 ❤ beautiful 8
## 4 \U0001f447 https://t.co/twric5zqb2 8
## 5 \U0001f6a8 available 8
## 6 \u2728 little 6
## 7 ❤ grateful 6
## 8 ❤ good 6
## 9 ❤ available 6
## 10 \U0001f3b6 right 6
We can use the lead()
function for bigrams.
# get bigrams
annotated_text$bigram <- paste(annotated_text$token, lead(annotated_text$token))
And look at bigrams starting with an adjective that co-occur with emojis.
# most common verbs that co-occur with emojis
# and that are bigrams
# emojis are often tagged as verbs
# filter for tokens that are words
annotated_text %>%
filter(pos == "ADJ") %>%
filter(grepl('[:ascii:]', token)) %>%
count(code, bigram) %>%
arrange(-n) %>%
head(10)
## code bigram n
## 1 \U0001f447 https://t.co/twRIC5zQB2 JOIN 7
## 2 \u2b07 awake this 5
## 3 \u2b07 crucial day 5
## 4 \U0001f44f @lizzo and 5
## 5 \U0001f481 fam and 5
## 6 ✔ Incredible voice 4
## 7 ✔ cool dude 4
## 8 ❤ available now 4
## 9 ❤ same time 4
## 10 \U0001f39f newest \U0001f1fa 4