Module 6 Intro to Tidyverse

6.1 Before Module #6

Read Advice to Young (and Old) Programmers: A Conversation with Hadley Wickham by Philip Waggoner (2,599 words, 10 minutes)

6.2 What are R Packages?

An R package contains functions, and it might contain data. There are a lot of R packages out here (check the Comprehensive R Archive Network, i.e., CRAN, for a full list). That is one of the beautiful things about R, anyone can create an R package to share their code.

6.3 Installing Packages

The function to install packages in R is install.packages(). We will be working with TidyVerse extensively in this course, which is a collection of R packages carefully designed for data science.

Open your RStudio. In your console, enter the following to install tidyverse (this may take a while).

install.packages("tidyverse")

You need to install any package only once (remember to check for new package versions and to keep your packages updated). However, with every new R session, you need to load the packages you are going to use by using the library() function (a library is an installed R package in your computer).

library(tidyverse)

Note that when calling the install.packages() function you need to enter the package name between quotation marks (e.g., “tidyverse”). When you call the library() function, you don’t use quotation marks (e.g., tidyverse).

6.4 Before You Load your Data

Although we are working within an R project, which sets the working directory automatically for you, it’s good practice to check what folder you are working from by calling the getwd() function.

getwd()
## [1] "/Users/adriana/Desktop/ESOC214/Spring 2021/ESOC_214_Spring_2021"

You can list the contents of your working directory by using the dir() function.

dir()

We are going to create a data folder in our project, to keep things organized. Today we will be working with data on COVID-19 World Vaccination Progress. I cleaned up this data set already (no need for data tidying for now).

You can now list the contents of your data folder with the dir() function with a string that specifies the folder as a parameter.

dir("data")
##  [1] "clean_beer_awards.csv"                
##  [2] "country_vaccinations.csv"             
##  [3] "elnino.csv"                           
##  [4] "GlobalLandTemperaturesByCountry.csv"  
##  [5] "GlobalLandTemperaturesByMajorCity.csv"
##  [6] "groundhog_day.csv"                    
##  [7] "nfl_salary.xlsx"                      
##  [8] "olympic_history_athlete_events.csv"   
##  [9] "olympic_history_noc_regions.csv"      
## [10] "passwords.csv"                        
## [11] "president_county_candidate.csv"       
## [12] "spotify_songs_clean.csv"              
## [13] "spotify_songs.csv"                    
## [14] "tweets.tsv"                           
## [15] "us_avg_tuition.xlsx"                  
## [16] "women_in_labor_force.csv"

6.5 What’s our question again?

The Kaggle page on COVID-19 World Vaccination Progress lists the following questions:

  • Which country is using what vaccine?
  • In which country the vaccination program is more advanced?
  • Which country has vaccinated more people per day? (in terms of per hundred)

6.6 Load Data with Tidyverse

We will use the read_csv() function from the readr package (which is part of tidyverse) to read data in. Be careful, there’s a similar function that is read.csv() from base R. We do want to use the function with the _ (i.e., read_csv())

country_vaccinations <- read_csv("data/country_vaccinations.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   iso_code = col_character(),
##   date = col_date(format = ""),
##   total_vaccinations = col_double(),
##   people_vaccinated = col_double(),
##   people_fully_vaccinated = col_double(),
##   daily_vaccinations_raw = col_double(),
##   daily_vaccinations = col_double(),
##   total_vaccinations_per_hundred = col_double(),
##   people_vaccinated_per_hundred = col_double(),
##   people_fully_vaccinated_per_hundred = col_double(),
##   daily_vaccinations_per_million = col_double(),
##   vaccines = col_character(),
##   source_name = col_character(),
##   source_website = col_character()
## )

** CHALLENGE**

Reading warnings - R often prints out warnings in red (these are not always errors). What information did you get when loading your data?

6.7 Inspect Your Data

As with any other programming language, there are multiple ways to doing anything. As such, there are multiple ways of inspecting your data in R. Here are some of my favorite ways of inspecting my data:

# get an overview of the data frame
glimpse(country_vaccinations)
## Rows: 1,816
## Columns: 15
## $ country                             <chr> "Argentina", "Argentina", "Argenti…
## $ iso_code                            <chr> "ARG", "ARG", "ARG", "ARG", "ARG",…
## $ date                                <date> 2020-12-29, 2020-12-30, 2020-12-3…
## $ total_vaccinations                  <dbl> 700, NA, 32013, NA, NA, NA, 39599,…
## $ people_vaccinated                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ people_fully_vaccinated             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ daily_vaccinations_raw              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ daily_vaccinations                  <dbl> NA, 15656, 15656, 11070, 8776, 740…
## $ total_vaccinations_per_hundred      <dbl> 0.00, NA, 0.07, NA, NA, NA, 0.09, …
## $ people_vaccinated_per_hundred       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ people_fully_vaccinated_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ daily_vaccinations_per_million      <dbl> NA, 346, 346, 245, 194, 164, 143, …
## $ vaccines                            <chr> "Sputnik V", "Sputnik V", "Sputnik…
## $ source_name                         <chr> "Ministry of Health", "Ministry of…
## $ source_website                      <chr> "http://datos.salud.gob.ar/dataset…
summary(country_vaccinations)
##    country            iso_code              date            total_vaccinations
##  Length:1816        Length:1816        Min.   :2020-12-13   Min.   :       0  
##  Class :character   Class :character   1st Qu.:2021-01-05   1st Qu.:   19315  
##  Mode  :character   Mode  :character   Median :2021-01-14   Median :   92706  
##                                        Mean   :2021-01-12   Mean   :  831690  
##                                        3rd Qu.:2021-01-22   3rd Qu.:  422864  
##                                        Max.   :2021-01-30   Max.   :29577902  
##                                                             NA's   :595       
##  people_vaccinated  people_fully_vaccinated daily_vaccinations_raw
##  Min.   :       0   Min.   :      2         Min.   :      0       
##  1st Qu.:   24773   1st Qu.:   3317         1st Qu.:   1802       
##  Median :  112986   Median :  11670         Median :   8923       
##  Mean   :  819164   Mean   : 199311         Mean   :  59342       
##  3rd Qu.:  487814   3rd Qu.: 107978         3rd Qu.:  44013       
##  Max.   :24064165   Max.   :5259693         Max.   :1693241       
##  NA's   :868        NA's   :1348            NA's   :814           
##  daily_vaccinations total_vaccinations_per_hundred
##  Min.   :      1    Min.   : 0.000                
##  1st Qu.:   1510    1st Qu.: 0.330                
##  Median :   5776    Median : 1.120                
##  Mean   :  49187    Mean   : 3.416                
##  3rd Qu.:  27360    3rd Qu.: 2.850                
##  Max.   :1291416    Max.   :54.690                
##  NA's   :68         NA's   :595                   
##  people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
##  Min.   : 0.000                Min.   : 0.0000                    
##  1st Qu.: 0.360                1st Qu.: 0.0400                    
##  Median : 1.350                Median : 0.1500                    
##  Mean   : 3.482                Mean   : 0.7789                    
##  3rd Qu.: 3.033                3rd Qu.: 0.6725                    
##  Max.   :38.190                Max.   :19.9700                    
##  NA's   :868                   NA's   :1348                       
##  daily_vaccinations_per_million   vaccines         source_name       
##  Min.   :    0.0                Length:1816        Length:1816       
##  1st Qu.:  287.0                Class :character   Class :character  
##  Median :  747.5                Mode  :character   Mode  :character  
##  Mean   : 1738.1                                                     
##  3rd Qu.: 1354.2                                                     
##  Max.   :30869.0                                                     
##  NA's   :68                                                          
##  source_website    
##  Length:1816       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
# get variable names
colnames(country_vaccinations)
##  [1] "country"                             "iso_code"                           
##  [3] "date"                                "total_vaccinations"                 
##  [5] "people_vaccinated"                   "people_fully_vaccinated"            
##  [7] "daily_vaccinations_raw"              "daily_vaccinations"                 
##  [9] "total_vaccinations_per_hundred"      "people_vaccinated_per_hundred"      
## [11] "people_fully_vaccinated_per_hundred" "daily_vaccinations_per_million"     
## [13] "vaccines"                            "source_name"                        
## [15] "source_website"
names(country_vaccinations)
##  [1] "country"                             "iso_code"                           
##  [3] "date"                                "total_vaccinations"                 
##  [5] "people_vaccinated"                   "people_fully_vaccinated"            
##  [7] "daily_vaccinations_raw"              "daily_vaccinations"                 
##  [9] "total_vaccinations_per_hundred"      "people_vaccinated_per_hundred"      
## [11] "people_fully_vaccinated_per_hundred" "daily_vaccinations_per_million"     
## [13] "vaccines"                            "source_name"                        
## [15] "source_website"
# check how many countries (categorical variable)
unique(country_vaccinations$country)
##  [1] "Argentina"            "Austria"              "Bahrain"             
##  [4] "Belgium"              "Bermuda"              "Brazil"              
##  [7] "Bulgaria"             "Canada"               "Chile"               
## [10] "China"                "Costa Rica"           "Croatia"             
## [13] "Cyprus"               "Czechia"              "Denmark"             
## [16] "Ecuador"              "England"              "Estonia"             
## [19] "Finland"              "France"               "Germany"             
## [22] "Gibraltar"            "Greece"               "Hungary"             
## [25] "Iceland"              "India"                "Indonesia"           
## [28] "Ireland"              "Isle of Man"          "Israel"              
## [31] "Italy"                "Kuwait"               "Latvia"              
## [34] "Lithuania"            "Luxembourg"           "Malta"               
## [37] "Mexico"               "Myanmar"              "Netherlands"         
## [40] "Northern Cyprus"      "Northern Ireland"     "Norway"              
## [43] "Oman"                 "Panama"               "Poland"              
## [46] "Portugal"             "Romania"              "Russia"              
## [49] "Saudi Arabia"         "Scotland"             "Serbia"              
## [52] "Seychelles"           "Singapore"            "Slovakia"            
## [55] "Slovenia"             "Spain"                "Sri Lanka"           
## [58] "Sweden"               "Switzerland"          "Turkey"              
## [61] "United Arab Emirates" "United Kingdom"       "United States"       
## [64] "Wales"
# check vaccines (categorical variable)
unique(country_vaccinations$vaccines)
##  [1] "Sputnik V"                            
##  [2] "Pfizer/BioNTech"                      
##  [3] "Pfizer/BioNTech, Sinopharm"           
##  [4] "Moderna, Pfizer/BioNTech"             
##  [5] "Oxford/AstraZeneca, Sinovac"          
##  [6] "CNBG, Sinovac"                        
##  [7] "Oxford/AstraZeneca, Pfizer/BioNTech"  
##  [8] "Covaxin, Oxford/AstraZeneca"          
##  [9] "Sinovac"                              
## [10] "Oxford/AstraZeneca"                   
## [11] "Pfizer/BioNTech, Sinovac"             
## [12] "Pfizer/BioNTech, Sinopharm, Sputnik V"
## [13] "Oxford/AstraZeneca, Sinopharm"

CHALLENGE

Which variables are numeric? Which are categorical?

daily_vaccinations_raw: daily change in the total number of doses administered. It is only calculated for consecutive days. This is a raw measure provided for data checks and transparency, but we strongly recommend that any analysis on daily vaccination rates be conducted using daily_vaccinations instead.

There might be inconsistencies in both data - daily & total (and not only for Romania) - the data is based on collected data from national agencies by the main aggregator. It might be that data collected / day to be subsequently corrected (from alternative sources) when they calculate the total. Or the other way around. In any case, I will refine my cleaning.

6.8 The Pipe

We will be using the package dplyr (which is also part of tidyverse) to do an exploratory analysis of our data.

The package dplyr most used function is %>% (called the pipe). The pipe allows you to “pipe” (or redirect) objects into functions. (hint: use ctrl+shift+m or cmd+shift+m as a shortcut for typing %>%).

Here’s how to pipe the avocado_data object into the summary() function

# get an overview of the data frame
country_vaccinations %>% 
  summary()
##    country            iso_code              date            total_vaccinations
##  Length:1816        Length:1816        Min.   :2020-12-13   Min.   :       0  
##  Class :character   Class :character   1st Qu.:2021-01-05   1st Qu.:   19315  
##  Mode  :character   Mode  :character   Median :2021-01-14   Median :   92706  
##                                        Mean   :2021-01-12   Mean   :  831690  
##                                        3rd Qu.:2021-01-22   3rd Qu.:  422864  
##                                        Max.   :2021-01-30   Max.   :29577902  
##                                                             NA's   :595       
##  people_vaccinated  people_fully_vaccinated daily_vaccinations_raw
##  Min.   :       0   Min.   :      2         Min.   :      0       
##  1st Qu.:   24773   1st Qu.:   3317         1st Qu.:   1802       
##  Median :  112986   Median :  11670         Median :   8923       
##  Mean   :  819164   Mean   : 199311         Mean   :  59342       
##  3rd Qu.:  487814   3rd Qu.: 107978         3rd Qu.:  44013       
##  Max.   :24064165   Max.   :5259693         Max.   :1693241       
##  NA's   :868        NA's   :1348            NA's   :814           
##  daily_vaccinations total_vaccinations_per_hundred
##  Min.   :      1    Min.   : 0.000                
##  1st Qu.:   1510    1st Qu.: 0.330                
##  Median :   5776    Median : 1.120                
##  Mean   :  49187    Mean   : 3.416                
##  3rd Qu.:  27360    3rd Qu.: 2.850                
##  Max.   :1291416    Max.   :54.690                
##  NA's   :68         NA's   :595                   
##  people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
##  Min.   : 0.000                Min.   : 0.0000                    
##  1st Qu.: 0.360                1st Qu.: 0.0400                    
##  Median : 1.350                Median : 0.1500                    
##  Mean   : 3.482                Mean   : 0.7789                    
##  3rd Qu.: 3.033                3rd Qu.: 0.6725                    
##  Max.   :38.190                Max.   :19.9700                    
##  NA's   :868                   NA's   :1348                       
##  daily_vaccinations_per_million   vaccines         source_name       
##  Min.   :    0.0                Length:1816        Length:1816       
##  1st Qu.:  287.0                Class :character   Class :character  
##  Median :  747.5                Mode  :character   Mode  :character  
##  Mean   : 1738.1                                                     
##  3rd Qu.: 1354.2                                                     
##  Max.   :30869.0                                                     
##  NA's   :68                                                          
##  source_website    
##  Length:1816       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The pipe allows us to apply multiple functions to the same object.

Let’s start by selecting one column in our data.

country_vaccinations %>% 
  select(vaccines)
## # A tibble: 1,816 x 1
##    vaccines 
##    <chr>    
##  1 Sputnik V
##  2 Sputnik V
##  3 Sputnik V
##  4 Sputnik V
##  5 Sputnik V
##  6 Sputnik V
##  7 Sputnik V
##  8 Sputnik V
##  9 Sputnik V
## 10 Sputnik V
## # … with 1,806 more rows

Now let’s add another pipe to get unique values in this column.

country_vaccinations %>% 
  select(vaccines) %>%
  unique()
## # A tibble: 13 x 1
##    vaccines                             
##    <chr>                                
##  1 Sputnik V                            
##  2 Pfizer/BioNTech                      
##  3 Pfizer/BioNTech, Sinopharm           
##  4 Moderna, Pfizer/BioNTech             
##  5 Oxford/AstraZeneca, Sinovac          
##  6 CNBG, Sinovac                        
##  7 Oxford/AstraZeneca, Pfizer/BioNTech  
##  8 Covaxin, Oxford/AstraZeneca          
##  9 Sinovac                              
## 10 Oxford/AstraZeneca                   
## 11 Pfizer/BioNTech, Sinovac             
## 12 Pfizer/BioNTech, Sinopharm, Sputnik V
## 13 Oxford/AstraZeneca, Sinopharm

6.9 Counting Categorical Variables

One of the functions I most use when exploring my data is count(), which you can combine with %>%.

country_vaccinations %>% 
  count(vaccines)
## # A tibble: 13 x 2
##    vaccines                                  n
##    <chr>                                 <int>
##  1 CNBG, Sinovac                            44
##  2 Covaxin, Oxford/AstraZeneca              16
##  3 Moderna, Pfizer/BioNTech                429
##  4 Oxford/AstraZeneca                        5
##  5 Oxford/AstraZeneca, Pfizer/BioNTech     226
##  6 Oxford/AstraZeneca, Sinopharm            21
##  7 Oxford/AstraZeneca, Sinovac              15
##  8 Pfizer/BioNTech                         864
##  9 Pfizer/BioNTech, Sinopharm               65
## 10 Pfizer/BioNTech, Sinopharm, Sputnik V    22
## 11 Pfizer/BioNTech, Sinovac                  9
## 12 Sinovac                                  37
## 13 Sputnik V                                63

You can do the same adding group_by() to your pipeline.

country_vaccinations %>% 
  group_by(vaccines) %>%
  count()
## # A tibble: 13 x 2
## # Groups:   vaccines [13]
##    vaccines                                  n
##    <chr>                                 <int>
##  1 CNBG, Sinovac                            44
##  2 Covaxin, Oxford/AstraZeneca              16
##  3 Moderna, Pfizer/BioNTech                429
##  4 Oxford/AstraZeneca                        5
##  5 Oxford/AstraZeneca, Pfizer/BioNTech     226
##  6 Oxford/AstraZeneca, Sinopharm            21
##  7 Oxford/AstraZeneca, Sinovac              15
##  8 Pfizer/BioNTech                         864
##  9 Pfizer/BioNTech, Sinopharm               65
## 10 Pfizer/BioNTech, Sinopharm, Sputnik V    22
## 11 Pfizer/BioNTech, Sinovac                  9
## 12 Sinovac                                  37
## 13 Sputnik V                                63

And instead of count() we can use the summarise() and n() functions.

country_vaccinations %>% 
  group_by(vaccines) %>%
  summarise(total = n())
## # A tibble: 13 x 2
##    vaccines                              total
##    <chr>                                 <int>
##  1 CNBG, Sinovac                            44
##  2 Covaxin, Oxford/AstraZeneca              16
##  3 Moderna, Pfizer/BioNTech                429
##  4 Oxford/AstraZeneca                        5
##  5 Oxford/AstraZeneca, Pfizer/BioNTech     226
##  6 Oxford/AstraZeneca, Sinopharm            21
##  7 Oxford/AstraZeneca, Sinovac              15
##  8 Pfizer/BioNTech                         864
##  9 Pfizer/BioNTech, Sinopharm               65
## 10 Pfizer/BioNTech, Sinopharm, Sputnik V    22
## 11 Pfizer/BioNTech, Sinovac                  9
## 12 Sinovac                                  37
## 13 Sputnik V                                63

CHALLENGE

This last way of counting categorical variables (with summarise() and n()) outputs a data frame that is slightly different from the previous too. What’s the difference?

6.10 Arrange

Tables are easier to read when then are arranges by some logical order. In the case of counts, we usually arrange by the count itself (e.g., n or total).

country_vaccinations %>% 
  group_by(vaccines) %>%
  summarise(total = n()) %>%
  arrange(total)
## # A tibble: 13 x 2
##    vaccines                              total
##    <chr>                                 <int>
##  1 Oxford/AstraZeneca                        5
##  2 Pfizer/BioNTech, Sinovac                  9
##  3 Oxford/AstraZeneca, Sinovac              15
##  4 Covaxin, Oxford/AstraZeneca              16
##  5 Oxford/AstraZeneca, Sinopharm            21
##  6 Pfizer/BioNTech, Sinopharm, Sputnik V    22
##  7 Sinovac                                  37
##  8 CNBG, Sinovac                            44
##  9 Sputnik V                                63
## 10 Pfizer/BioNTech, Sinopharm               65
## 11 Oxford/AstraZeneca, Pfizer/BioNTech     226
## 12 Moderna, Pfizer/BioNTech                429
## 13 Pfizer/BioNTech                         864

The default order for arrange() is increasing. We can invert that by adding a minus (i.e., -) in front of the variable in arrange().

country_vaccinations %>% 
  group_by(vaccines) %>%
  summarise(total = n()) %>%
  arrange(-total)
## # A tibble: 13 x 2
##    vaccines                              total
##    <chr>                                 <int>
##  1 Pfizer/BioNTech                         864
##  2 Moderna, Pfizer/BioNTech                429
##  3 Oxford/AstraZeneca, Pfizer/BioNTech     226
##  4 Pfizer/BioNTech, Sinopharm               65
##  5 Sputnik V                                63
##  6 CNBG, Sinovac                            44
##  7 Sinovac                                  37
##  8 Pfizer/BioNTech, Sinopharm, Sputnik V    22
##  9 Oxford/AstraZeneca, Sinopharm            21
## 10 Covaxin, Oxford/AstraZeneca              16
## 11 Oxford/AstraZeneca, Sinovac              15
## 12 Pfizer/BioNTech, Sinovac                  9
## 13 Oxford/AstraZeneca                        5

6.11 group_by + summarise

The combination of the group_by() and summarise() functions is very powerful. In addition to using the n() function to count how many rows per each category in our categorical variable, we can use other functions with numeric (i.e., quantitative) variable such as sum() and mean().

CHALLENGE

Take a moment to revisit the question we want to answer.

  • What do we want to find out?

  • How can we answer our question with this data?

  • What function (e.g., sum(), max(), mean()) do we use to answer our question? With what variables/columns?

Complete the code below.

country_vaccinations %>% 
  group_by(country) %>%
  summarise(total_days = n(),
            total_per_hundred = ____(____), na.rm = TRUE)

Example of output that you might want to get to answer our question:

## # A tibble: 64 x 3
##    country   total_days total_per_hundred
##    <chr>          <int>             <dbl>
##  1 Argentina         33              0.81
##  2 Austria           21              2.19
##  3 Bahrain           39             10.0 
##  4 Belgium           33              2.45
##  5 Bermuda           14              4.71
##  6 Brazil            15              0.94
##  7 Bulgaria          33              0.59
##  8 Canada            22              2.48
##  9 Chile             38              0.35
## 10 China             44              1.58
## # … with 54 more rows

CHALLENGE add arrange() to the code block.

6.12 group_by + filter

The output above contains a lot of countries. We can keep just observations that are just from the United States by using the filter() function:

country_vaccinations %>% 
  filter(country == "United States") %>%
  count(vaccines)
## # A tibble: 1 x 2
##   vaccines                     n
##   <chr>                    <int>
## 1 Moderna, Pfizer/BioNTech    42

CHALLENGE Add a filter() to your solution from the previous challenge.

Example of output that you might want to get:

## # A tibble: 2 x 3
##   country       total_days total_per_hundred
##   <chr>              <int>             <dbl>
## 1 Israel                43             54.7 
## 2 United States         42              8.94

6.13 Example of Plotting

For fun, here’s an example of plotting (we will be working extensively with plotting in the future).

country_vaccinations %>%
  filter(country == "United States" |
           country == "Israel") %>%
  ggplot(aes(x = date, 
             y = daily_vaccinations,
             color = country)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).