Chapter 3 Intro to R Part II

3.1 Installing and using packages in R

There are a lot of R packages out here (check the Comprehensive R Archive Network, i.e., CRAN, for a full list). That is one of the beautiful things about R, anyone can create an R package to share their code (check out the workshop on how to create your own R package later this week).

Open your RStudio (if you haven’t attended the first part of this workshop, please check the “Before we start” section for instructions on how to download and/or update R and RStudio).

The function to install packages in R is install.packages(). We will be working with TidyVerse today, which is a collection of R packages carefully designed for data science.

Let’s install tidyverse (this may take a while).

You need to install any package only once (remember to check for new package versions and to keep your packages updated). However, with every new R session, you need to load the packages you are going to use by using the library() function.

## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Note that when calling the install.packages() function you need to enter the package name between quotation marks (e.g., “tidyverse”). When you call the library() function, you don’t use quotation marks (e.g., tidyverse).

3.2 Acquire data

First, we will download the data we are going to be using today.

  1. Go to https://www.kaggle.com/neuromusic/avocado-prices/data and click on the “Download (2 MB)” button
  2. Find the zip file (avocado-prices.zip) you downloaded and unzip it
  3. On the same level as your project folder, add a “new folder” called “data”
  4. Move, copy, or upload the data file (avocado.csv) to the “data” folder

3.3 Load data in R

For this workshop, we will be using data from kaggle. In the previous section, you created a data folder in your project folder, which should contain the avocado.csv data file.

Although we are working within an R project, which sets the working directory automatically for you, it’s good practice to check what folder you are working from by calling the getwd() function.

## [1] "/Users/adriana/Desktop/workshops/resbaz_2020/intro_to_R"

You can also list the contents of your data folder by using the dir() function.

## [1] "avocado.csv"

We will use the read_csv() function from the readr package (which is part of tidyverse) to read data in.

## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Date = col_date(format = ""),
##   AveragePrice = col_double(),
##   `Total Volume` = col_double(),
##   `4046` = col_double(),
##   `4225` = col_double(),
##   `4770` = col_double(),
##   `Total Bags` = col_double(),
##   `Small Bags` = col_double(),
##   `Large Bags` = col_double(),
##   `XLarge Bags` = col_double(),
##   type = col_character(),
##   year = col_double(),
##   region = col_character()
## )

CHALLENGE

Reading warnings - R often prints out warnings in red (these are not always errors). What information did you get when loading your data?

3.4 Inspect your data

Now, let’s inspect our dataframe. As usual, there are multiple ways of inspecting your data.

Here’s one of my favorites:

## Rows: 18,249
## Columns: 14
## $ X1             <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Date           <date> 2015-12-27, 2015-12-20, 2015-12-13, 2015-12-06, 2015-…
## $ AveragePrice   <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02, …
## $ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 559…
## $ `4046`         <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 136…
## $ `4225`         <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 480…
## $ `4770`         <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80.0…
## $ `Total Bags`   <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91, …
## $ `Small Bags`   <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47, …
## $ `Large Bags`   <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05, …
## $ `XLarge Bags`  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
## $ type           <chr> "conventional", "conventional", "conventional", "conve…
## $ year           <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, …
## $ region         <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "Alb…
##        X1             Date             AveragePrice    Total Volume     
##  Min.   : 0.00   Min.   :2015-01-04   Min.   :0.440   Min.   :      85  
##  1st Qu.:10.00   1st Qu.:2015-10-25   1st Qu.:1.100   1st Qu.:   10839  
##  Median :24.00   Median :2016-08-14   Median :1.370   Median :  107377  
##  Mean   :24.23   Mean   :2016-08-13   Mean   :1.406   Mean   :  850644  
##  3rd Qu.:38.00   3rd Qu.:2017-06-04   3rd Qu.:1.660   3rd Qu.:  432962  
##  Max.   :52.00   Max.   :2018-03-25   Max.   :3.250   Max.   :62505647  
##       4046               4225               4770           Total Bags      
##  Min.   :       0   Min.   :       0   Min.   :      0   Min.   :       0  
##  1st Qu.:     854   1st Qu.:    3009   1st Qu.:      0   1st Qu.:    5089  
##  Median :    8645   Median :   29061   Median :    185   Median :   39744  
##  Mean   :  293008   Mean   :  295155   Mean   :  22840   Mean   :  239639  
##  3rd Qu.:  111020   3rd Qu.:  150207   3rd Qu.:   6243   3rd Qu.:  110783  
##  Max.   :22743616   Max.   :20470573   Max.   :2546439   Max.   :19373134  
##    Small Bags         Large Bags       XLarge Bags           type          
##  Min.   :       0   Min.   :      0   Min.   :     0.0   Length:18249      
##  1st Qu.:    2849   1st Qu.:    127   1st Qu.:     0.0   Class :character  
##  Median :   26363   Median :   2648   Median :     0.0   Mode  :character  
##  Mean   :  182195   Mean   :  54338   Mean   :  3106.4                     
##  3rd Qu.:   83338   3rd Qu.:  22029   3rd Qu.:   132.5                     
##  Max.   :13384587   Max.   :5719097   Max.   :551693.7                     
##       year         region         
##  Min.   :2015   Length:18249      
##  1st Qu.:2015   Class :character  
##  Median :2016   Mode  :character  
##  Mean   :2016                     
##  3rd Qu.:2017                     
##  Max.   :2018
## [1] "conventional" "organic"
##  [1] "Albany"              "Atlanta"             "BaltimoreWashington"
##  [4] "Boise"               "Boston"              "BuffaloRochester"   
##  [7] "California"          "Charlotte"           "Chicago"            
## [10] "CincinnatiDayton"    "Columbus"            "DallasFtWorth"      
## [13] "Denver"              "Detroit"             "GrandRapids"        
## [16] "GreatLakes"          "HarrisburgScranton"  "HartfordSpringfield"
## [19] "Houston"             "Indianapolis"        "Jacksonville"       
## [22] "LasVegas"            "LosAngeles"          "Louisville"         
## [25] "MiamiFtLauderdale"   "Midsouth"            "Nashville"          
## [28] "NewOrleansMobile"    "NewYork"             "Northeast"          
## [31] "NorthernNewEngland"  "Orlando"             "Philadelphia"       
## [34] "PhoenixTucson"       "Pittsburgh"          "Plains"             
## [37] "Portland"            "RaleighGreensboro"   "RichmondNorfolk"    
## [40] "Roanoke"             "Sacramento"          "SanDiego"           
## [43] "SanFrancisco"        "Seattle"             "SouthCarolina"      
## [46] "SouthCentral"        "Southeast"           "Spokane"            
## [49] "StLouis"             "Syracuse"            "Tampa"              
## [52] "TotalUS"             "West"                "WestTexNewMexico"

Do you know any other ways of checking your data?


CHALLENGE

Which variables are numeric? Which are categorical?

What functions do you remember from Part I? Run these functions (e.g., mean()) on your numeric variables.

3.5 Explore your data

We will be using the package dplyr (which is also part of tidyverse) to do an exploratory analysis of our data.

The package dplyr most used function is %>% (called the pipe). The pipe allows you to “pipe” (or redirect) objects into functions. (hint: use ctrl+shift+m or cmd+shift+m as a shortcut for typing %>%).

Here’s how to pipe the avocado_data object into the summary() function

##        X1             Date             AveragePrice    Total Volume     
##  Min.   : 0.00   Min.   :2015-01-04   Min.   :0.440   Min.   :      85  
##  1st Qu.:10.00   1st Qu.:2015-10-25   1st Qu.:1.100   1st Qu.:   10839  
##  Median :24.00   Median :2016-08-14   Median :1.370   Median :  107377  
##  Mean   :24.23   Mean   :2016-08-13   Mean   :1.406   Mean   :  850644  
##  3rd Qu.:38.00   3rd Qu.:2017-06-04   3rd Qu.:1.660   3rd Qu.:  432962  
##  Max.   :52.00   Max.   :2018-03-25   Max.   :3.250   Max.   :62505647  
##       4046               4225               4770           Total Bags      
##  Min.   :       0   Min.   :       0   Min.   :      0   Min.   :       0  
##  1st Qu.:     854   1st Qu.:    3009   1st Qu.:      0   1st Qu.:    5089  
##  Median :    8645   Median :   29061   Median :    185   Median :   39744  
##  Mean   :  293008   Mean   :  295155   Mean   :  22840   Mean   :  239639  
##  3rd Qu.:  111020   3rd Qu.:  150207   3rd Qu.:   6243   3rd Qu.:  110783  
##  Max.   :22743616   Max.   :20470573   Max.   :2546439   Max.   :19373134  
##    Small Bags         Large Bags       XLarge Bags           type          
##  Min.   :       0   Min.   :      0   Min.   :     0.0   Length:18249      
##  1st Qu.:    2849   1st Qu.:    127   1st Qu.:     0.0   Class :character  
##  Median :   26363   Median :   2648   Median :     0.0   Mode  :character  
##  Mean   :  182195   Mean   :  54338   Mean   :  3106.4                     
##  3rd Qu.:   83338   3rd Qu.:  22029   3rd Qu.:   132.5                     
##  Max.   :13384587   Max.   :5719097   Max.   :551693.7                     
##       year         region         
##  Min.   :2015   Length:18249      
##  1st Qu.:2015   Class :character  
##  Median :2016   Mode  :character  
##  Mean   :2016                     
##  3rd Qu.:2017                     
##  Max.   :2018

The pipe allows us to apply multiple functions to the same object.

Let’s start by selecting one column in our data.

## # A tibble: 18,249 x 1
##    type        
##    <chr>       
##  1 conventional
##  2 conventional
##  3 conventional
##  4 conventional
##  5 conventional
##  6 conventional
##  7 conventional
##  8 conventional
##  9 conventional
## 10 conventional
## # … with 18,239 more rows

Now let’s add another pipe to get unique values in this column.

## # A tibble: 2 x 1
##   type        
##   <chr>       
## 1 conventional
## 2 organic

3.5.1 Group by + count

One of the most useful pipe combinations is group_by() and count().

## # A tibble: 2 x 2
## # Groups:   type [2]
##   type             n
##   <chr>        <int>
## 1 conventional  9126
## 2 organic       9123

We can add more variables to the group_by() function.

## # A tibble: 108 x 3
## # Groups:   region, type [108]
##    region              type             n
##    <chr>               <chr>        <int>
##  1 Albany              conventional   169
##  2 Albany              organic        169
##  3 Atlanta             conventional   169
##  4 Atlanta             organic        169
##  5 BaltimoreWashington conventional   169
##  6 BaltimoreWashington organic        169
##  7 Boise               conventional   169
##  8 Boise               organic        169
##  9 Boston              conventional   169
## 10 Boston              organic        169
## # … with 98 more rows

3.5.2 Group by + summarise

We can also use the summarise() function after group_by(). Inside summarise() you can use other functions such as sum().

## # A tibble: 108 x 3
## # Groups:   region [54]
##    region              type         total_volume
##    <chr>               <chr>               <dbl>
##  1 Albany              conventional    15700611.
##  2 Albany              organic           367188.
##  3 Atlanta             conventional    86661392.
##  4 Atlanta             organic          1943727.
##  5 BaltimoreWashington conventional   130745575.
##  6 BaltimoreWashington organic          3968344.
##  7 Boise               conventional    14000540.
##  8 Boise               organic           412647.
##  9 Boston              conventional    94900438.
## 10 Boston              organic          2373547.
## # … with 98 more rows

Let’s add another pipe and arrange the results by total_volume.

## # A tibble: 54 x 2
##    region           total_volume
##    <chr>                   <dbl>
##  1 Syracuse            10942668.
##  2 Boise               14413188.
##  3 Spokane             15565275.
##  4 Albany              16067800.
##  5 Louisville          16097002.
##  6 Pittsburgh          18806346.
##  7 BuffaloRochester    22962470.
##  8 Roanoke             25042011.
##  9 Jacksonville        28790005.
## 10 Columbus            29993361.
## # … with 44 more rows

3.6 Mutate

You can use mutate() to add a new column to your data.

## # A tibble: 108 x 4
## # Groups:   region [54]
##    region              type         total_type_volume total_volume
##    <chr>               <chr>                    <dbl>        <dbl>
##  1 Albany              conventional         15700611.    16067800.
##  2 Albany              organic                367188.    16067800.
##  3 Atlanta             conventional         86661392.    88605119.
##  4 Atlanta             organic               1943727.    88605119.
##  5 BaltimoreWashington conventional        130745575.   134713919.
##  6 BaltimoreWashington organic               3968344.   134713919.
##  7 Boise               conventional         14000540.    14413188.
##  8 Boise               organic                412647.    14413188.
##  9 Boston              conventional         94900438.    97273985.
## 10 Boston              organic               2373547.    97273985.
## # … with 98 more rows

CHALLENGE

Use group_by(), summarise(), and mutate() to print out the volume percentage of conventional and organic avocado types per region. (hint: add a new variable inside mutate() that is the result of total_type_volume divided by total_volume)

We’ve been just printing our results to our console. Let’s save the results are a new data frame. When assigning your group_by() results to a new object, make sure to add ungroup() as the last pipe (this will save you headaches in the future).

CHALLENGE #1

  1. Inspect your new data frame.

  2. Base on volume_type_region calculate the mean and standard deviation of type percentage per type (i.e., conventional vs. organic). Question: What is the percentage of conventional vs. organic avocados sold in each region?

CHALLENGE #2

Calculate the average avocado price for each type in each region.

3.7 Filter your data

One of the “regions” in our data is TotalUS which is not really a specific region, but the sum of all the other regions. To calculate averages per year, for example, we need to filter our the TotalUS region.

## # A tibble: 17,911 x 14
##       X1 Date       AveragePrice `Total Volume` `4046` `4225` `4770`
##    <dbl> <date>            <dbl>          <dbl>  <dbl>  <dbl>  <dbl>
##  1     0 2015-12-27         1.33         64237.  1037. 5.45e4   48.2
##  2     1 2015-12-20         1.35         54877.   674. 4.46e4   58.3
##  3     2 2015-12-13         0.93        118220.   795. 1.09e5  130. 
##  4     3 2015-12-06         1.08         78992.  1132  7.20e4   72.6
##  5     4 2015-11-29         1.28         51040.   941. 4.38e4   75.8
##  6     5 2015-11-22         1.26         55980.  1184. 4.81e4   43.6
##  7     6 2015-11-15         0.99         83454.  1369. 7.37e4   93.3
##  8     7 2015-11-08         0.98        109428.   704. 1.02e5   80  
##  9     8 2015-11-01         1.02         99811.  1022. 8.73e4   85.3
## 10     9 2015-10-25         1.07         74339.   842. 6.48e4  113  
## # … with 17,901 more rows, and 7 more variables: `Total Bags` <dbl>, `Small
## #   Bags` <dbl>, `Large Bags` <dbl>, `XLarge Bags` <dbl>, type <chr>,
## #   year <dbl>, region <chr>

CHALLENGE

Calculate the mean type average per year. Remember to filter out the TotalUS region. Question: What is the percentage of conventional vs. organic avocados sold each year?

3.8 Conditionally mutate your data

You can use the if_else() function inside mutate() to create a new variable that is conditional on an existing variable in your data frame.

Let’s create a new column in our data frame, indicating whether the average price per avocado is higher than $1.50.

Now we can calculate the percentage of expensive avocados by using the mean() function on our new expensive variable.

## [1] 0.3787605

CHALLENGE

Calculate the percentage of expensive avocados per region.

3.9 Pivot your data

Numerical column names refer to price lookup codes.

4046: small Hass

4225: large Hass

4770: extra large Hass

First, let’s slice our data, to remove Total Volume.

Inspect your data.

Now, let’s pivot our data frame.

CHALLENGE

Summarize volume of each avocado type per region. Which region buys a larger portion of large avocados?