1 Workshop prep

1.1 Install R and RStudio

Make sure you install both R and RStudio for this workshop.

  1. Download and install R from https://cran.r-project.org (If you are a Windows user, first determine if you are running the 32 or the 64 bit version)

  2. Download and install RStudio from https://rstudio.com/products/rstudio/download/#download

If you have R and RStudio already installed in your computer, make sure your R version is greater than 4.0 by entering sessionInfo() in your console.

sessionInfo()

1.2 Install and load libraries

For this workshop, we will be using two R packages. Make sure you install these packages before proceeding.

install.packages("tidyverse")
install.packages("effects")

Once the packages are installed, you can load them using library().

library(tidyverse)
library(effects)

2 Read and inspect data

We will be using the Variable that data from Tagliamonte’s book “Variationist Sociolinguistics” [@tagliamonte2012variationist] since these data are public and we can check our results with the results presented in the book. I highly recommend reading chapter 5 (Quantitative Analysis) for a better understanding of the data.

We have two data files: one that contains the tokens or individual observations of the language structure (i.e., verb complement clauses) and another with the extra-linguistic or social variables for each individual in the data.

# read data in
that_data <- read_csv("data/compl_that_data.csv")
individual_data <- read_csv("data/individual_data.csv")
# inspect data
that_data
dep_var dep_var_numeric verb_expanded verb matrix_subj add_elm sub_subj int_mat tense indiv context
that_expressed 0 Know Know Other Pronoun Something Other Some Past acork …we knew THAT a lot of um, of- of er flyers hadn ’t
that_expressed 0 Point out OTHER Other Pronoun Something Other Some Past acork in the music lesson she was always um, pointing out
that_omitted 1 Know Know First Person Singular Something Personal Pronoun Some Present acork I know [¬Ø] they couldn ’t
that_omitted 1 Suppose OTHER First Person Singular Something Personal Pronoun Some Present acork …I suppose [¬Ø] it ’s too late now.

Note that these data frames are tidy, meaning they show the following characteristics:

Here’s what the participant data frame looks like.

# inspect data
individual_data
indiv sex age_3way age occ edu
acork F Older (66+) 76 White-Collar Up to minimum age
bhamilton M Older (66+) 91 Blue-Collar Up to minimum age
cbeale M Younger (under 35) 30 Blue-Collar Up to minimum age
cbeckett M Younger (under 35) 29 White-Collar More than minimum age
cgiles M Younger (under 35) 20 White-Collar More than minimum age
conniebeale F Middle (36-65) 48 Blue-Collar Up to minimum age
dburns M Middle (36-65) 60 White-Collar More than minimum age
ddavis M Younger (under 35) 19 Blue-Collar Up to minimum age
dwallis M Older (66+) 72 Blue-Collar Up to minimum age
eburritt F Older (66+) 82 White-Collar Up to minimum age
echapman F Middle (36-65) 55 White-Collar More than minimum age
ggreen F Middle (36-65) 45 White-Collar Up to minimum age
hbutton M Older (66+) 68 Blue-Collar Up to minimum age
hstainton M Middle (36-65) 59 Blue-Collar Up to minimum age
jtweddle M Older (66+) 78 White-Collar Up to minimum age
kyoung F Younger (under 35) 30 White-Collar Up to minimum age
lmcgrath F Younger (under 35) 27 White-Collar Up to minimum age
mjohnstone M Middle (36-65) 53 White-Collar More than minimum age
mlondry F Middle (36-65) 62 Blue-Collar Up to minimum age
mpeters F Older (66+) 81 White-Collar Up to minimum age
mtoovey M Middle (36-65) 40 White-Collar More than minimum age
nbond F Younger (under 35) 34 White-Collar More than minimum age
ocavell M Middle (36-65) 40 White-Collar More than minimum age
rbeale M Middle (36-65) 52 Blue-Collar Up to minimum age
rcotton F Younger (under 35) 34 White-Collar Up to minimum age
rjones M Middle (36-65) 50 White-Collar Up to minimum age
rmitchell M Younger (under 35) 20 Blue-Collar Up to minimum age
sboggin F Younger (under 35) 23 Blue-Collar More than minimum age
sevans F Older (66+) 69 White-Collar More than minimum age
swatkins M Middle (36-65) 37 White-Collar More than minimum age
vpriestly M Older (66+) 84 Blue-Collar Up to minimum age
wevans M Older (66+) 72 White-Collar Up to minimum age

We can get an idea of distribution of age across sex from our tidy participant data.

individual_data %>%
  count(sex, age_3way) %>%
  pivot_wider(names_from = sex,
              values_from = n)
age_3way F M
Middle (36-65) 4 8
Older (66+) 4 6
Younger (under 35) 5 5

You can do this with any two variables.

individual_data %>%
  count(edu, occ) %>%
  pivot_wider(names_from = edu,
              values_from = n)
occ More than minimum age Up to minimum age
Blue-Collar 1 11
White-Collar 10 10

This type of cross count helps you identify any correlation between variables. If we had all observations in two pairs, e.g., Blue-Collar x Up to minimum age and White-Collar x More than minimum age, that would mean that the education and occupation are measuring the exact same thing (they correlate perfectly) are are thus basically the same variable. In the case of highly correlated variables, you choose just one of the variables to enter in your regression.

3 Data analysis by social variables

We will start with the easiest grouping of the data. Rate of that omission by participant.

3.1 Calculating percentages of that omission per participant

We first create an object that holds the rate of omission for each individual. Note that many people will have this as their “raw” data – this is not ideal, since you can always calculate counts and percentages from tidy data (i.e., data that contains one observation per row and one variable per column) but the inverse is not true.

percent_omit_indiv <- that_data %>%
  group_by(indiv, dep_var) %>%
  summarise(n = n()) %>%
  mutate(total = sum(n),
         percent = n/total) %>%
  filter(dep_var == "that_omitted") %>%
  select(-dep_var) %>%
  left_join(individual_data)
# inspect data
percent_omit_indiv
indiv n total percent sex age_3way age occ edu
acork 21 32 0.6562500 F Older (66+) 76 White-Collar Up to minimum age
bhamilton 26 28 0.9285714 M Older (66+) 91 Blue-Collar Up to minimum age
cbeale 13 19 0.6842105 M Younger (under 35) 30 Blue-Collar Up to minimum age
cbeckett 51 62 0.8225806 M Younger (under 35) 29 White-Collar More than minimum age
cgiles 48 52 0.9230769 M Younger (under 35) 20 White-Collar More than minimum age
conniebeale 57 66 0.8636364 F Middle (36-65) 48 Blue-Collar Up to minimum age
dburns 125 148 0.8445946 M Middle (36-65) 60 White-Collar More than minimum age
ddavis 54 57 0.9473684 M Younger (under 35) 19 Blue-Collar Up to minimum age
dwallis 17 22 0.7727273 M Older (66+) 72 Blue-Collar Up to minimum age
eburritt 38 44 0.8636364 F Older (66+) 82 White-Collar Up to minimum age
echapman 105 134 0.7835821 F Middle (36-65) 55 White-Collar More than minimum age
ggreen 75 86 0.8720930 F Middle (36-65) 45 White-Collar Up to minimum age
hbutton 32 44 0.7272727 M Older (66+) 68 Blue-Collar Up to minimum age
hstainton 38 41 0.9268293 M Middle (36-65) 59 Blue-Collar Up to minimum age
jtweddle 12 15 0.8000000 M Older (66+) 78 White-Collar Up to minimum age
kyoung 70 100 0.7000000 F Younger (under 35) 30 White-Collar Up to minimum age
lmcgrath 39 44 0.8863636 F Younger (under 35) 27 White-Collar Up to minimum age
mjohnstone 8 15 0.5333333 M Middle (36-65) 53 White-Collar More than minimum age
mlondry 38 48 0.7916667 F Middle (36-65) 62 Blue-Collar Up to minimum age
mpeters 49 55 0.8909091 F Older (66+) 81 White-Collar Up to minimum age
mtoovey 47 60 0.7833333 M Middle (36-65) 40 White-Collar More than minimum age
nbond 128 167 0.7664671 F Younger (under 35) 34 White-Collar More than minimum age
ocavell 23 35 0.6571429 M Middle (36-65) 40 White-Collar More than minimum age
rbeale 4 5 0.8000000 M Middle (36-65) 52 Blue-Collar Up to minimum age
rcotton 100 123 0.8130081 F Younger (under 35) 34 White-Collar Up to minimum age
rjones 43 67 0.6417910 M Middle (36-65) 50 White-Collar Up to minimum age
rmitchell 49 51 0.9607843 M Younger (under 35) 20 Blue-Collar Up to minimum age
sboggin 24 24 1.0000000 F Younger (under 35) 23 Blue-Collar More than minimum age
sevans 56 107 0.5233645 F Older (66+) 69 White-Collar More than minimum age
swatkins 23 32 0.7187500 M Middle (36-65) 37 White-Collar More than minimum age
vpriestly 15 15 1.0000000 M Older (66+) 84 Blue-Collar Up to minimum age
wevans 49 74 0.6621622 M Older (66+) 72 White-Collar Up to minimum age

We are now ready to visualize our results (which are just descriptive).

3.2 Visualizing that omission across extra-linguistic groups

Let’s start a bar plot with rate of that omission across occupation.

percent_omit_indiv %>%
  group_by(occ) %>%
  summarise(mean_omission = mean(percent)) %>%
  ggplot(aes(x = occ,
             y = mean_omission)) +
  geom_col() +
  geom_label(aes(label = round(mean_omission, 
                               digits = 2)))

We can add another categorical variable, and draw a line plot. Let’s visualize the rate of that omission across sex and occupation.

percent_omit_indiv %>%
  group_by(sex, occ) %>%
  summarise(mean_omission = mean(percent)) %>%
  ggplot(aes(x = occ,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point() 

As we can see in the resulting plot (above), the lines are parallel, meaning the effect of sex across occupation levels seems to be the same (no interaction between sex and occupation).

Let’s visualize the rate of that omission across sex and education.

percent_omit_indiv %>%
  group_by(sex, edu) %>%
  summarise(mean_omission = mean(percent)) %>%
  mutate(edu = factor(edu,
                      levels = c("Up to minimum age",
                                 "More than minimum age"))) %>%
  ggplot(aes(x = edu,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point() 

The lines in the plot above are not parallel (they cross), which means that the effect of sex across education is not the same (there seems to be an interaction between sex and education).

Let’s visualize sex and age group for that omission rate.

percent_omit_indiv %>%
  group_by(sex, age_3way) %>%
  summarise(mean_omission = mean(percent)) %>%
  mutate(age_3way = factor(age_3way,
                           levels = c("Younger (under 35)",
                                      "Middle (36-65)",
                                      "Older (66+)"))) %>%
  ggplot(aes(x = age_3way,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point() 

Based on the plot above, do you think there’s an interaction between sex and age?

We can complicate things by visualizing all of these four categorical variables.

percent_omit_indiv %>%
  group_by(sex, age_3way, occ, edu) %>%
  summarise(mean_omission = mean(percent)) %>%
  mutate(age_3way = factor(age_3way,
                           levels = c("Younger (under 35)",
                                      "Middle (36-65)",
                                      "Older (66+)")),
         edu = factor(edu,
                      levels = c("Up to minimum age",
                                 "More than minimum age"))) %>%
  ggplot(aes(x = age_3way,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_line() +
  geom_point() +
  facet_wrap(~occ+edu, ncol = 2)

These exploratory visualizations are useful for deciding what variables to include in our linear regression, and what type of interactions we should be checking for.

3.3 Linear regression

We can run linear regression on the that omission rate per participant data by using our percentage of omission as our dependent variable.

percent_omit_indiv %>%
  lm(formula = percent ~ sex + age_3way + occ + edu + edu:sex) %>%
  summary()
## 
## Call:
## lm(formula = percent ~ sex + age_3way + occ + edu + edu:sex, 
##     data = .)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.219488 -0.071606  0.003609  0.091543  0.156771 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.82215    0.07709  10.665 8.61e-11 ***
## sexM                        0.04067    0.07765   0.524   0.6050    
## age_3wayOlder (66+)         0.01872    0.05702   0.328   0.7454    
## age_3wayYounger (under 35)  0.07919    0.05112   1.549   0.1340    
## occWhite-Collar            -0.13076    0.05637  -2.320   0.0288 *  
## eduUp to minimum age        0.06220    0.06983   0.891   0.3816    
## sexM:eduUp to minimum age  -0.10051    0.10527  -0.955   0.3488    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1155 on 25 degrees of freedom
## Multiple R-squared:  0.301,  Adjusted R-squared:  0.1332 
## F-statistic: 1.794 on 6 and 25 DF,  p-value: 0.1412

How do we interpret the results above? What factors are significant? Is the interaction between education and sex significant?

percent_omit_indiv %>%
  lm(formula = percent ~ occ) %>%
  summary()
## 
## Call:
## lm(formula = percent ~ occ, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23376 -0.09439  0.02634  0.08907  0.16595 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.86692    0.03279  26.440   <2e-16 ***
## occWhite-Collar -0.10980    0.04147  -2.647   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1136 on 30 degrees of freedom
## Multiple R-squared:  0.1894, Adjusted R-squared:  0.1624 
## F-statistic: 7.009 on 1 and 30 DF,  p-value: 0.0128

How does variance explain change across the two models?

Let’s save our linear regression model to an object, so we can visualize the effect of occupation.

model_1 <- percent_omit_indiv %>%
  lm(formula = percent ~ occ)

effect("occ", model_1) %>%
  data.frame() %>%
  ggplot(aes(x = occ,
             y = fit)) +
  geom_errorbar(aes(ymin = lower,
                    ymax = upper),
                width = .2) +
  geom_label(aes(label = round(fit, digits = 2)))

Let’s move on to the analysis of the linguistic variables in our data.

4 Data analysis by linguistic variables (version 1)

4.1 Visualizing percentages of that omission across linguistic variables

It is harder to decide how to group variables that are linguistic (compared to calculating that omission per participant, for example). We can usually do that with one linguistic variable. Let’s check the distribution of that omission rate across the lexical verb in the matrix clause.

that_data %>%
  group_by(verb, dep_var) %>%
  summarise(n = n()) %>%
  mutate(total = sum(n),
         percent = n/total) %>%
  filter(dep_var == "that_omitted") %>%
  ggplot(aes(x = reorder(verb, percent),
             y = percent)) +
  geom_col() +
  geom_label(aes(label = round(percent, digits = 2))) +
  labs(x = "")

It seems think is the verb with the highest rate of that omission. How do we know these differences are actually significant? The easiest way to do so, in my opinion, is to run logistic regression with the binary variable (that omitted vs. that expressed) as the dependent variable.

5 Data analysis by linguistic variables (version 2)

Here’s how to run the logistic regression. We need to use a numeric 0 and 1 (i.e., binary) variable as the dependent variable. We will enter in the model the factors that Tagliamonte discusses in her book.

model_2 <- that_data %>%
  glm(formula = dep_var_numeric ~ verb + matrix_subj + add_elm + sub_subj + int_mat + tense,
        family = "binomial")

summary(model_2)
## 
## Call:
## glm(formula = dep_var_numeric ~ verb + matrix_subj + add_elm + 
##     sub_subj + int_mat + tense, family = "binomial", data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8343   0.1907   0.2516   0.6431   1.7678  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                0.8351     0.2603   3.208 0.001334 ** 
## verbOTHER                 -0.1508     0.2248  -0.671 0.502380    
## verbSay                    0.9343     0.2637   3.543 0.000395 ***
## verbThink                  1.8151     0.2657   6.831 8.43e-12 ***
## matrix_subjNoun Phrase    -1.2830     0.1788  -7.175 7.23e-13 ***
## matrix_subjOther Pronoun  -1.3420     0.1614  -8.314  < 2e-16 ***
## add_elmSomething          -0.6042     0.1434  -4.214 2.51e-05 ***
## sub_subjPersonal Pronoun   0.5615     0.1504   3.735 0.000188 ***
## int_matSome               -0.6269     0.1503  -4.170 3.04e-05 ***
## tensePresent               0.7868     0.1418   5.547 2.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1929.2  on 1871  degrees of freedom
## Residual deviance: 1415.3  on 1862  degrees of freedom
## AIC: 1435.3
## 
## Number of Fisher Scoring iterations: 6

What factor levels are significant? What are these factor levels significant against?

Note that the output for the estimates is in log odds, which can be easily converted to probabilities. But let’s visualize the effects instead, using the effect() function which converts log odds to probabilities for us.

Here’s the effect of verb on that omission.

effect("verb", model_2) %>%
  data.frame() %>%
  ggplot(aes(x = reorder(verb, fit),
             y = fit)) +
  geom_errorbar(aes(ymin = lower,
                    ymax = upper),
                width = .2) +
  geom_label(aes(label = round(fit, digits = 2)))  +
  labs(x = "lexical verb in matrix clause")

How does the plot above compare with the first bar plot on verbs and that omission?

Let’s look at the matrix subject type.

effect("matrix_subj", model_2) %>%
  data.frame() %>%
  ggplot(aes(x = reorder(matrix_subj, fit),
             y = fit)) +
  geom_errorbar(aes(ymin = lower,
                    ymax = upper),
                width = .2) +
  geom_label(aes(label = round(fit, digits = 2)))  +
  labs(x = "matrix subject type")

How about the complement clause subject?

effect("sub_subj", model_2) %>%
  data.frame() %>%
  ggplot(aes(x = reorder(sub_subj, fit),
             y = fit)) +
  geom_errorbar(aes(ymin = lower,
                    ymax = upper),
                width = .2) +
  geom_label(aes(label = round(fit, digits = 2)))  +
  labs(x = "complement clause subject")

6 Short survey

Please take 2-3 minutes to fill out a short survey on what you thought of this workshop.

7 Other resources