Chapter 3 Intro to R Part II
3.1 Installing and using packages in R
There are a lot of R packages out here (check the Comprehensive R Archive Network, i.e., CRAN, for a full list). That is one of the beautiful things about R, anyone can create an R package to share their code (check out the workshop on how to create your own R package later this week).
Open your RStudio (if you haven’t attended the first part of this workshop, please check the “Before we start” section for instructions on how to download and/or update R and RStudio).
The function to install packages in R is install.packages()
. We will be working with TidyVerse today, which is a collection of R packages carefully designed for data science.
Let’s install tidyverse (this may take a while).
You need to install any package only once (remember to check for new package versions and to keep your packages updated). However, with every new R session, you need to load the packages you are going to use by using the library()
function.
## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Note that when calling the install.packages()
function you need to enter the package name between quotation marks (e.g., “tidyverse”). When you call the library()
function, you don’t use quotation marks (e.g., tidyverse).
3.2 Acquire data
First, we will download the data we are going to be using today.
- Go to https://www.kaggle.com/neuromusic/avocado-prices/data and click on the “Download (2 MB)” button
- Find the zip file (avocado-prices.zip) you downloaded and unzip it
- On the same level as your project folder, add a “new folder” called “data”
- Move, copy, or upload the data file (avocado.csv) to the “data” folder
3.3 Load data in R
For this workshop, we will be using data from kaggle. In the previous section, you created a data folder in your project folder, which should contain the avocado.csv data file.
Although we are working within an R project, which sets the working directory automatically for you, it’s good practice to check what folder you are working from by calling the getwd()
function.
## [1] "/Users/adriana/Desktop/workshops/resbaz_2020/intro_to_R"
You can also list the contents of your data
folder by using the dir()
function.
## [1] "avocado.csv"
We will use the read_csv()
function from the readr
package (which is part of tidyverse
) to read data in.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## Date = col_date(format = ""),
## AveragePrice = col_double(),
## `Total Volume` = col_double(),
## `4046` = col_double(),
## `4225` = col_double(),
## `4770` = col_double(),
## `Total Bags` = col_double(),
## `Small Bags` = col_double(),
## `Large Bags` = col_double(),
## `XLarge Bags` = col_double(),
## type = col_character(),
## year = col_double(),
## region = col_character()
## )
CHALLENGE
Reading warnings - R often prints out warnings in red (these are not always errors). What information did you get when loading your data?
3.4 Inspect your data
Now, let’s inspect our dataframe. As usual, there are multiple ways of inspecting your data.
Here’s one of my favorites:
## Rows: 18,249
## Columns: 14
## $ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ Date <date> 2015-12-27, 2015-12-20, 2015-12-13, 2015-12-06, 2015-…
## $ AveragePrice <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02, …
## $ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 559…
## $ `4046` <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 136…
## $ `4225` <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 480…
## $ `4770` <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80.0…
## $ `Total Bags` <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91, …
## $ `Small Bags` <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47, …
## $ `Large Bags` <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05, …
## $ `XLarge Bags` <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
## $ type <chr> "conventional", "conventional", "conventional", "conve…
## $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, …
## $ region <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "Alb…
## X1 Date AveragePrice Total Volume
## Min. : 0.00 Min. :2015-01-04 Min. :0.440 Min. : 85
## 1st Qu.:10.00 1st Qu.:2015-10-25 1st Qu.:1.100 1st Qu.: 10839
## Median :24.00 Median :2016-08-14 Median :1.370 Median : 107377
## Mean :24.23 Mean :2016-08-13 Mean :1.406 Mean : 850644
## 3rd Qu.:38.00 3rd Qu.:2017-06-04 3rd Qu.:1.660 3rd Qu.: 432962
## Max. :52.00 Max. :2018-03-25 Max. :3.250 Max. :62505647
## 4046 4225 4770 Total Bags
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 854 1st Qu.: 3009 1st Qu.: 0 1st Qu.: 5089
## Median : 8645 Median : 29061 Median : 185 Median : 39744
## Mean : 293008 Mean : 295155 Mean : 22840 Mean : 239639
## 3rd Qu.: 111020 3rd Qu.: 150207 3rd Qu.: 6243 3rd Qu.: 110783
## Max. :22743616 Max. :20470573 Max. :2546439 Max. :19373134
## Small Bags Large Bags XLarge Bags type
## Min. : 0 Min. : 0 Min. : 0.0 Length:18249
## 1st Qu.: 2849 1st Qu.: 127 1st Qu.: 0.0 Class :character
## Median : 26363 Median : 2648 Median : 0.0 Mode :character
## Mean : 182195 Mean : 54338 Mean : 3106.4
## 3rd Qu.: 83338 3rd Qu.: 22029 3rd Qu.: 132.5
## Max. :13384587 Max. :5719097 Max. :551693.7
## year region
## Min. :2015 Length:18249
## 1st Qu.:2015 Class :character
## Median :2016 Mode :character
## Mean :2016
## 3rd Qu.:2017
## Max. :2018
## [1] "conventional" "organic"
## [1] "Albany" "Atlanta" "BaltimoreWashington"
## [4] "Boise" "Boston" "BuffaloRochester"
## [7] "California" "Charlotte" "Chicago"
## [10] "CincinnatiDayton" "Columbus" "DallasFtWorth"
## [13] "Denver" "Detroit" "GrandRapids"
## [16] "GreatLakes" "HarrisburgScranton" "HartfordSpringfield"
## [19] "Houston" "Indianapolis" "Jacksonville"
## [22] "LasVegas" "LosAngeles" "Louisville"
## [25] "MiamiFtLauderdale" "Midsouth" "Nashville"
## [28] "NewOrleansMobile" "NewYork" "Northeast"
## [31] "NorthernNewEngland" "Orlando" "Philadelphia"
## [34] "PhoenixTucson" "Pittsburgh" "Plains"
## [37] "Portland" "RaleighGreensboro" "RichmondNorfolk"
## [40] "Roanoke" "Sacramento" "SanDiego"
## [43] "SanFrancisco" "Seattle" "SouthCarolina"
## [46] "SouthCentral" "Southeast" "Spokane"
## [49] "StLouis" "Syracuse" "Tampa"
## [52] "TotalUS" "West" "WestTexNewMexico"
Do you know any other ways of checking your data?
CHALLENGE
Which variables are numeric? Which are categorical?
What functions do you remember from Part I? Run these functions (e.g., mean()
) on your numeric variables.
3.5 Explore your data
We will be using the package dplyr
(which is also part of tidyverse
) to do an exploratory analysis of our data.
The package dplyr
most used function is %>%
(called the pipe). The pipe allows you to “pipe” (or redirect) objects into functions. (hint: use ctrl+shift+m or cmd+shift+m as a shortcut for typing %>%
).
Here’s how to pipe the avocado_data
object into the summary()
function
## X1 Date AveragePrice Total Volume
## Min. : 0.00 Min. :2015-01-04 Min. :0.440 Min. : 85
## 1st Qu.:10.00 1st Qu.:2015-10-25 1st Qu.:1.100 1st Qu.: 10839
## Median :24.00 Median :2016-08-14 Median :1.370 Median : 107377
## Mean :24.23 Mean :2016-08-13 Mean :1.406 Mean : 850644
## 3rd Qu.:38.00 3rd Qu.:2017-06-04 3rd Qu.:1.660 3rd Qu.: 432962
## Max. :52.00 Max. :2018-03-25 Max. :3.250 Max. :62505647
## 4046 4225 4770 Total Bags
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 854 1st Qu.: 3009 1st Qu.: 0 1st Qu.: 5089
## Median : 8645 Median : 29061 Median : 185 Median : 39744
## Mean : 293008 Mean : 295155 Mean : 22840 Mean : 239639
## 3rd Qu.: 111020 3rd Qu.: 150207 3rd Qu.: 6243 3rd Qu.: 110783
## Max. :22743616 Max. :20470573 Max. :2546439 Max. :19373134
## Small Bags Large Bags XLarge Bags type
## Min. : 0 Min. : 0 Min. : 0.0 Length:18249
## 1st Qu.: 2849 1st Qu.: 127 1st Qu.: 0.0 Class :character
## Median : 26363 Median : 2648 Median : 0.0 Mode :character
## Mean : 182195 Mean : 54338 Mean : 3106.4
## 3rd Qu.: 83338 3rd Qu.: 22029 3rd Qu.: 132.5
## Max. :13384587 Max. :5719097 Max. :551693.7
## year region
## Min. :2015 Length:18249
## 1st Qu.:2015 Class :character
## Median :2016 Mode :character
## Mean :2016
## 3rd Qu.:2017
## Max. :2018
The pipe allows us to apply multiple functions to the same object.
Let’s start by selecting one column in our data.
## # A tibble: 18,249 x 1
## type
## <chr>
## 1 conventional
## 2 conventional
## 3 conventional
## 4 conventional
## 5 conventional
## 6 conventional
## 7 conventional
## 8 conventional
## 9 conventional
## 10 conventional
## # … with 18,239 more rows
Now let’s add another pipe to get unique values in this column.
## # A tibble: 2 x 1
## type
## <chr>
## 1 conventional
## 2 organic
3.5.1 Group by + count
One of the most useful pipe combinations is group_by()
and count()
.
## # A tibble: 2 x 2
## # Groups: type [2]
## type n
## <chr> <int>
## 1 conventional 9126
## 2 organic 9123
We can add more variables to the group_by()
function.
## # A tibble: 108 x 3
## # Groups: region, type [108]
## region type n
## <chr> <chr> <int>
## 1 Albany conventional 169
## 2 Albany organic 169
## 3 Atlanta conventional 169
## 4 Atlanta organic 169
## 5 BaltimoreWashington conventional 169
## 6 BaltimoreWashington organic 169
## 7 Boise conventional 169
## 8 Boise organic 169
## 9 Boston conventional 169
## 10 Boston organic 169
## # … with 98 more rows
3.5.2 Group by + summarise
We can also use the summarise()
function after group_by()
. Inside summarise()
you can use other functions such as sum()
.
## # A tibble: 108 x 3
## # Groups: region [54]
## region type total_volume
## <chr> <chr> <dbl>
## 1 Albany conventional 15700611.
## 2 Albany organic 367188.
## 3 Atlanta conventional 86661392.
## 4 Atlanta organic 1943727.
## 5 BaltimoreWashington conventional 130745575.
## 6 BaltimoreWashington organic 3968344.
## 7 Boise conventional 14000540.
## 8 Boise organic 412647.
## 9 Boston conventional 94900438.
## 10 Boston organic 2373547.
## # … with 98 more rows
Let’s add another pipe and arrange the results by total_volume
.
avocado_data %>%
group_by(region) %>%
summarise(total_volume = sum(`Total Volume`)) %>%
arrange(total_volume)
## # A tibble: 54 x 2
## region total_volume
## <chr> <dbl>
## 1 Syracuse 10942668.
## 2 Boise 14413188.
## 3 Spokane 15565275.
## 4 Albany 16067800.
## 5 Louisville 16097002.
## 6 Pittsburgh 18806346.
## 7 BuffaloRochester 22962470.
## 8 Roanoke 25042011.
## 9 Jacksonville 28790005.
## 10 Columbus 29993361.
## # … with 44 more rows
3.6 Mutate
You can use mutate()
to add a new column to your data.
avocado_data %>%
group_by(region, type) %>%
summarise(total_type_volume = sum(`Total Volume`)) %>%
mutate(total_volume = sum(total_type_volume))
## # A tibble: 108 x 4
## # Groups: region [54]
## region type total_type_volume total_volume
## <chr> <chr> <dbl> <dbl>
## 1 Albany conventional 15700611. 16067800.
## 2 Albany organic 367188. 16067800.
## 3 Atlanta conventional 86661392. 88605119.
## 4 Atlanta organic 1943727. 88605119.
## 5 BaltimoreWashington conventional 130745575. 134713919.
## 6 BaltimoreWashington organic 3968344. 134713919.
## 7 Boise conventional 14000540. 14413188.
## 8 Boise organic 412647. 14413188.
## 9 Boston conventional 94900438. 97273985.
## 10 Boston organic 2373547. 97273985.
## # … with 98 more rows
CHALLENGE
Use group_by()
, summarise()
, and mutate()
to print out the volume percentage of conventional and organic avocado types per region. (hint: add a new variable inside mutate()
that is the result of total_type_volume divided by total_volume)
We’ve been just printing our results to our console. Let’s save the results are a new data frame. When assigning your group_by()
results to a new object, make sure to add ungroup()
as the last pipe (this will save you headaches in the future).
volume_type_region <- avocado_data %>%
group_by(region, type) %>%
summarise(total_type_volume = sum(`Total Volume`)) %>%
mutate(total_volume = sum(total_type_volume),
type_percentage = total_type_volume/total_volume) %>%
ungroup()
CHALLENGE #1
Inspect your new data frame.
- Base on
volume_type_region
calculate the mean and standard deviation of type percentage per type (i.e., conventional vs. organic). Question: What is the percentage of conventional vs. organic avocados sold in each region?
CHALLENGE #2
Calculate the average avocado price for each type in each region.
3.7 Filter your data
One of the “regions” in our data is TotalUS
which is not really a specific region, but the sum of all the other regions. To calculate averages per year, for example, we need to filter our the TotalUS
region.
## # A tibble: 17,911 x 14
## X1 Date AveragePrice `Total Volume` `4046` `4225` `4770`
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 2015-12-27 1.33 64237. 1037. 5.45e4 48.2
## 2 1 2015-12-20 1.35 54877. 674. 4.46e4 58.3
## 3 2 2015-12-13 0.93 118220. 795. 1.09e5 130.
## 4 3 2015-12-06 1.08 78992. 1132 7.20e4 72.6
## 5 4 2015-11-29 1.28 51040. 941. 4.38e4 75.8
## 6 5 2015-11-22 1.26 55980. 1184. 4.81e4 43.6
## 7 6 2015-11-15 0.99 83454. 1369. 7.37e4 93.3
## 8 7 2015-11-08 0.98 109428. 704. 1.02e5 80
## 9 8 2015-11-01 1.02 99811. 1022. 8.73e4 85.3
## 10 9 2015-10-25 1.07 74339. 842. 6.48e4 113
## # … with 17,901 more rows, and 7 more variables: `Total Bags` <dbl>, `Small
## # Bags` <dbl>, `Large Bags` <dbl>, `XLarge Bags` <dbl>, type <chr>,
## # year <dbl>, region <chr>
CHALLENGE
Calculate the mean type average per year. Remember to filter out the TotalUS
region. Question: What is the percentage of conventional vs. organic avocados sold each year?
3.8 Conditionally mutate your data
You can use the if_else()
function inside mutate()
to create a new variable that is conditional on an existing variable in your data frame.
Let’s create a new column in our data frame, indicating whether the average price per avocado is higher than $1.50.
Now we can calculate the percentage of expensive avocados by using the mean()
function on our new expensive
variable.
## [1] 0.3787605
CHALLENGE
Calculate the percentage of expensive avocados per region.
3.9 Pivot your data
Numerical column names refer to price lookup codes.
4046: small Hass
4225: large Hass
4770: extra large Hass
First, let’s slice our data, to remove Total Volume
.
Inspect your data.
Now, let’s pivot our data frame.
avocado_data_longer <- avocado_data_v2 %>%
pivot_longer(cols = c(`4046`, `4225`, `4770`),
names_to = "lookup_code")
CHALLENGE
Summarize volume of each avocado type per region. Which region buys a larger portion of large avocados?