1 Introduction

The first quantitative examinations of linguistic variation, and its constraints, date back to over 50 years ago in the field of variationist sociolinguistics (Grieve 2012; Labov et al. 1968). The search for a statistical method that accounts for theoretical concerns in language variation and change is not a new pursuit (Sankoff and Labov 1979). Early on, it was established that to reveal the underlying grammar of language use a robust probabilistic model with ordered rules was more adequate than categorical or discrete descriptions, or raw frequencies, simple proportions, and distributions of use by social groups. In fact, “important generalizations of well-known types of categorical relationships between constraints and variables were found to be expressible only in quantitative terms” (Sankoff and Labov 1979, 191).

Linguistic data tends to be unevenly distributed even in controlled experimental conditions and even more so in more naturalistic settings of data collection (e.g., sociolinguistic interviews, and corpus building), which renders statistical tests such as ANOVA not as reliable as originally intended (Sankoff and Labov 1979). The solution is estimation of maximum likelihood, which is more robust in terms of poorly distributed data, but requires more difficult computation methods (Sankoff and Labov 1979). Another term for estimation of maximum likelihood is logistic regression. This is not to say that all linguistic inquiry should avoid methods such as ANOVA or linear regression. There are plenty of instances where the dependent variable under investigation is a continuous quantitative variable (e.g., vowel shifts, reaction times) already, and there is no need to force it into a quantitative measure (e.g., percentages of use) to be able to run linear regression.

The field of language variation and change, however, concerns itself with discrete language choice which translates more naturally into a discrete dependent variable (usually binary, represented as zero for one of the choices and one as the other) (Grieve 2012) which is what we need to fit logistic regression models. The statistical method that has been adopted since the 1970s in language variation and change has thus been multivariate logistic regression with sum contrasts (Sankoff 1982; Sankoff and Labov 1979). The goal of this method is to establish whether what is said (or not said) in what contexts and with what type and frequency of co-occurrence (in terms of phonological or syntactic environments) (Sankoff 1982).

2 Formalization

The formalization of this statistical method in language variation and change has been proposed as the following (Sankoff 1982):

\[\begin{align*} log\frac{P}{1-P} = \mu+a*x_1+b*x_2+... \end{align*}\]

where:

P is the probability that one variant in the language variable (e.g., omission of that complementizer) is used versus the other choice (e.g., overt that complementizer) and is a value between 0 and 1
\(\mu\) is the average tendency pertinent to all contexts (without any changes to \(a*x_1+b*x_2+...\))
a, b, and so on are the linguistic and social variables (e.g., like lexical verb in the main clause vs. other verbs, blue-collar worker vs. other types of workers) which are either 0 (indicating the context does not present that specific variable) or 1 (indicating the context has that specific variable)
\(x_1\), \(x_2\), and so on represent the factor level effect of each linguistic and social variable (i.e., estimates or factor weights)
\(\mu+a*x_1+b*x_2+...\) represents the overall tendency combined with weights for each component features that is present in the given context (i.e., all factor levels that are 1 for present in that context)

3 Variable Rule in R

Basically what is meant by variable rule analysis is the evaluation of “the effects of multiple factors on a binary linguistic ‘choice’ – the presence or absence of an element, or any phenomenon treated as an alternation between two variants” (Johnson 2009, 359). Most recently, scholars have made the shift to more modern statistical software environments, such as R, which requires a broader understanding of statistics and software programming since these new environments are not single-purpose software such as VARBRUL and GoldVarb (Johnson 2009; Tagliamonte and D’Arcy 2017).

There are three main things to consider when running multivariate logistic regression in R for the purpose of the study of language variation:

Variable rule analysis, as seen in the formalization above, outputs \(\mu\) as the average tendency pertinent to all contexts. In other words, our intercept (i.e., probability estimate when all factor levels are zero) needs to be the average probability of our dependent variable overall. For this, we need to change the default from treatment contrasts to sum contrasts (Johnson 2009).
While variable rule analysis presents results in probabilities (values between 0 and 1), R outputs regression results as logit (as known as log-odds) (Johnson 2009). Remember our formula:

\[\begin{align*} logit(P) = log\frac{P}{1-P} \end{align*}\]

where:

P is our probability
P/(1-p) is our odds

The logit of the probability is the logarithm of the odds. Using the logarithm of the odds makes it easier to deal with very small probabilities. If you have a small probability multiplied by another small probability, you get an even smaller number (with lots and lots of zeros after the decimal point), which could cause problems in early computers with limited representation for numbers. Log-odds are added (not multiplied) to calculate joint probabilities (which is also an easier operation than multiplication).

We can easily convert log-odds back into probabilities with the following formula:

\[\begin{align*} probability = \frac{exp({logit})}{1 + exp({logit})} \end{align*}\]

Finally, we need to multiply coefficients from our regression output by the contrast matrix to get estimates for all factor group levels.

4 Linear vs. logistic regression

Both linear and logistic regression are commonly used in quantitative language analysis. The general goal of regression is to build a model where a response or dependent variable (often represented as \(y\)) can be predicted or estimated (often represented as \(\hat{y}\)) according to a number of independent variables (or factor groups).

In linear regression, a linear (i.e., straight, or with no curving) line is fit into the data. Figure 4.1 shows an example of data with \(y\) and \(x\) variables and a regression line of \(y ~ x\) (i.e., \(y\) by \(x\)). Continuous numeric variables in language use are usually normalized frequencies of linguistic structures.

Figure 4.1: Scatter plot of a numeric continous variable y by another numeric continous varible x.

In logistic regression, a binomial regression line (i.e., curved) is fit into the data. Figure 4.2 shows an example of data with a binary \(y\) and continuous numeric \(x\) variables and a regression line of \(y ~ x\) (i.e., \(y\) by \(x\)) that takes into account the binary nature of the response variable (hence it is a logistic regression line). Binary response variables in language use are usually the occurrence or not of some variant.

Figure 4.2: Scatter plot of a binary variable y (0 for one choice 1 for another) by another numeric continous varible x.

In the next section, I will use a case study to demonstrate first how to aggregate data to run linear regression with a numeric continuous variable. This method is similar to what one would do to run ANOVA on language data, and it is commonly used in fields like corpus linguistics. Other example of a continuous numeric variable is reaction times, commonly used in the field of experimental linguistics.

I do believe that in order to understand why logistic regression is preferred for certain types of linguistic data, as argued by Sankoff and Labov (1979), a good understanding of what linear Regression does and how its output is interpreted is essential. I will then analyze the same data but instead of aggregating it to get to normalized frequencies of use, I will use each observation of an instance of the linguistic construction being used (i.e., commonly known in language variation and change as tokens) as individual data points, with a binary dependent variable.

5 Case Study (Variable that)

For the purposes of this explanation, I will use the Variable that data from Tagliamonte’s book “Variationist Sociolinguistics” (Tagliamonte 2012) since these data are public and we can check our results with the results presented in the book. I highly recommend reading chapter 5 (Quantitative Analysis) for a better understanding of the data.

Here’s a summary of what the data looks like, with 1872 observations (i.e., tokens) and 17 columns (i.e., variables of factor groups).

## Rows: 1,872
## Columns: 17
## $ dep_var         <chr> "that_expressed", "that_expressed", "that_expressed",…
## $ main_subj       <chr> "4", "4", "3", "1", "i", "i", "z", "z", "n", "1", "1"…
## $ verbs_3         <chr> "Know", "Know", "Point out", "Be & adjective", "Be & …
## $ verbs_1         <chr> "Know", "Know", "OTHER", "OTHER", "OTHER", "OTHER", "…
## $ matrix_subj     <chr> "Other Pronoun", "Other Pronoun", "Other Pronoun", "F…
## $ add_elm         <chr> "Something", "Something", "Something", "Something", "…
## $ sub_subj        <chr> "Other", "Other", "Other", "Personal Pronoun", "Perso…
## $ int_mat         <chr> "Some", "Some", "Some", "None", "None", "Some", "Some…
## $ tense           <chr> "Past", "Present", "Past", "Past", "Past", "Past", "P…
## $ indiv           <chr> "acork", "acork", "acork", "acork", "acork", "acork",…
## $ sex             <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"…
## $ age_3way        <chr> "Older (66+)", "Older (66+)", "Older (66+)", "Older (…
## $ age             <dbl> 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 7…
## $ occ             <chr> "White-Collar", "White-Collar", "White-Collar", "Whit…
## $ edu             <chr> "Up to minimum age", "Up to minimum age", "Up to mini…
## $ context         <chr> "...we knew THAT a lot of um, of- of er flyers hadn '…
## $ dep_var_numeric <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,…

5.1 Linear regression with normalized frequencies

I will start this case study analysis with a common method to create a dependent variable that is continuous and numeric, by calculating normalized frequencies of use. Frequencies are normalized by participant, i.e., the frequency of each variant (i.e., that omitted and that expressed) divided by total number of tokens per participant. I am multiplying this division by 100 because that is the order of magnitude for total tokens for the participants with the most number of tokens.

# create that_per_indiv (variable that data per individual) by first
# counting how many that_omission and how many that_expressed by individual
# then combine total token count per individual creating a new total column
# then create with mutate a new norm_freq column with normalized frequency by 100
# the combine that with information about each individual (sex, age group, age,
# occupation and education)
that_per_indiv <- that_data %>%
  count(dep_var, indiv) %>%
  left_join(that_data %>%
              count(indiv) %>%
              rename(total = n)) %>%
  mutate(norm_freq = (n/total)*100) %>%
  left_join(that_data %>%
              distinct(indiv,
                       sex,
                       age_3way,
                       age,
                       occ,
                       edu))

When collapsing data, we have to ensure we have all the data we need. Each individual should have two rows: one observation (or row) for percentage of that_omitted and another for percentage that_expressed. Zeros are usually omitted from the data aggregation we did above, because there are no rows for a variant for certain individuals. We have to ensure the zeros are represented as well (zero omission is as important as 100 omission when it comes to analyzing your data).

# each individual should have two rows (one for that_omitted and another for that_expressed)
# count rows by individual, show lowest n first
that_per_indiv %>% 
  count(indiv) %>% 
  arrange(n)

## # A tibble: 32 x 2
##    indiv           n
##    <chr>       <int>
##  1 sboggin         1
##  2 vpriestly       1
##  3 acork           2
##  4 bhamilton       2
##  5 cbeale          2
##  6 cbeckett        2
##  7 cgiles          2
##  8 conniebeale     2
##  9 dburns          2
## 10 ddavis          2
## # … with 22 more rows

# inspect two individuals with just one row each
that_per_indiv %>% 
  filter(indiv %in% c("sboggin", "vpriestly"))

Table 5.1: Individuals with 100% of one variant.
dep_var	indiv	n	total	norm_freq	sex	age_3way	age	occ	edu
that_omitted	sboggin	24	24	100	F	Younger (under 35)	23	Blue-Collar	More than minimum age
that_omitted	vpriestly	15	15	100	M	Older (66+)	84	Blue-Collar	Up to minimum age

As suspected, and shown in Table 5.1, two individuals have 100% of that omission. Since we will focus only on that omission, we could proceed with the data frame as is. I like to ensure my data frame is complete before proceeding, just in case I decide to change my analysis in the future and focus on that expressed instead (you really never know, so better to be safe). Besides, zeros are data too! I hope I convinced you to do this next step to complete your aggregate data.

# run the same aggregation as before but add complete() and mutate() to
# complete data with missing combinations of individual and variant
that_per_indiv <- that_data %>%
  count(dep_var, indiv) %>%
  complete(dep_var, indiv) %>%
  mutate(n = replace_na(n, 0)) %>%
  left_join(that_data %>%
              count(indiv) %>%
              rename(total = n)) %>%
  mutate(norm_freq = (n/total)*100) %>%
  left_join(that_data %>%
              distinct(indiv,
                       sex,
                       age_3way,
                       age,
                       occ,
                       edu))

# inspect same two individuals to verify rows have been completed
that_per_indiv %>% 
  filter(indiv %in% c("sboggin", "vpriestly"))

Table 5.2: Individuals with complete data for both variants.
dep_var	indiv	n	total	norm_freq	sex	age_3way	age	occ	edu
that_expressed	sboggin	0	24	0	F	Younger (under 35)	23	Blue-Collar	More than minimum age
that_expressed	vpriestly	0	15	0	M	Older (66+)	84	Blue-Collar	Up to minimum age
that_omitted	sboggin	24	24	100	F	Younger (under 35)	23	Blue-Collar	More than minimum age
that_omitted	vpriestly	15	15	100	M	Older (66+)	84	Blue-Collar	Up to minimum age

Table 5.2 shows the same individuals as shown in Table 5.1, but completed with the variants that have zero frequency in the data.

Here’s what the resulting data looks like. Note that we went from the original data frame that had 1872 observations (i.e., tokens) to an aggregate data frame by individual with 32 observations for omission of that complementizer. By aggregating the data by individual to create a numeric dependent variable, we are decreasing our statistical power and losing the context of each occurrence of that complement occurrence.

Let’s visualize the data by individual ordered by normalized frequency of that omission. I will add occupation as the color, to verify whether occupation is split in terms of normalized frequency of that omission.

# start with aggregate data and then
# filter to keep just that omitted and then
# start a plot with y mapped to individual (reordered by normalized frequency),
# x mapped to normalized frequency, and color mapped to occupation
# add geometry to actually draw the plot, with a point (geom_point) for each data point
that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  ggplot(aes(y = reorder(indiv, norm_freq),
             x = norm_freq,
             color = occ)) +
  geom_point(size = 3) +
  theme_linedraw() +
  labs(y = "",
       x = "normalized frequency (percentage of that omission)",
       color = "Occupation Type",
       title = "Individual that omission by occupation") +
  scale_color_manual(values = c("#56B4E9", "#E69F00"))

Figure 5.1: Plot showing each individual’s normalized frequency of that omission color-coded by occupation type.

Figure 5.1 shows that the individuals with the lowest normalized frequency of omission are white-collar workers, while the individuals with the highest normalized frequency of omission are blue-collar workers.

Let’s try to plot as many variables as possible, to try to visualize any patterns in the aggregate data. We can map to different elements in the plot the following variables: age, sex, and occupation. Our numeric dependent variable is still normalized frequency. And we can use the size of the data point to indicate how many tokens each individual has in our data.

# start with aggregate data and then
# filter to keep just that omitted and then
# start a plot with y mapped normalized frequency, x mapped to age, color to sex
# and size to total number of tokens (remember each data point is an individual)
# split plot by occupation
that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  ggplot(aes(x = age,
             y = norm_freq,
             color = sex,
             size = total)) +
  geom_point() +
  facet_wrap(~occ) +
  theme_linedraw() + 
  geom_smooth(method = "lm",
              formula = y ~ x,
              se = FALSE) +
  labs(y = "normalized frequency (percentage of that omission)",
       title = "Omission of that complementizer per individual") +
  scale_color_manual(values = c("#000000", "#009E73"))

Figure 5.2: Plot showing each individual’s normalized frequency of that omission across individual age, color-coded by sex, split by occupation.

Figure 5.2 shows plots split by occupation, with normalized frequency by age of participant. Each data point is an individual, and the size of the point indicates how many tokens we have for each participant (i.e., the larger the point, the more tokens that participant produced). Note that some participants have a very low number of tokens, which can be a problem for any type of statistical test you run. Later we will add individual as a random effect in our logistic regression modeling, which at least takes into account individual variation. Going back to Figure 5.2 , color indicates sex (i.e., female and male). Regression lines are also shown color-coded by sex. It seems that for white-collar workers, the normalized frequency of that omission goes down with age increase. There are only three female participants that are blue-collar workers, and they follow the same negative relationship with that omission and age. The same is not true for male blue-collar workers, where the regression line is flat (or slightly positive).

We can also visualize means by sex, occupation, and age group, which is useful especially when interpreting results for linear regression (the estimates will approximate the means of your numeric dependent variable and differences between means).

that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  group_by(age_3way, sex, occ) %>%
  summarise(mean_omission = mean(norm_freq)) %>%
  mutate(age_3way = factor(age_3way,
                           levels = c("Younger (under 35)",
                                      "Middle (36-65)",
                                      "Older (66+)"))) %>%
  ggplot(aes(x = age_3way,
             y = mean_omission,
             color = sex,
             group = sex)) +
  geom_point(size = 3) +
  geom_line() +
  geom_label(aes(label = round(mean_omission, digits = 2)),
             show.legend = FALSE) +
  facet_wrap(~occ) +
  theme_linedraw() +
  theme(legend.position = "bottom") +
  labs(y = "normalized frequency (percentage of that omission)",
       x = "",
       title = "Omission of that complementizer across age groups") +
  scale_color_manual(values = c("#000000", "#009E73"))

Figure 5.3: Plot showing mean of normalized frequency of that omission across group age, color-coded by sex and split by occupation.

Figure 5.3 shows plots split by occupation with means of normalized frequencies for that omission aggregated by age group and sex. This type of visualization aggregates the data even more, but it is useful when interpreting linear regression results, which are shown below.

# start with aggregated data by individual and then
# filter to keep just that omitted normalized frequencies (our dependent variable)
# and then reorder age group so that the order makes sense and then
# run linear regression of normalized frequency (of that omission) by age group,
# sex and occupation and then output summary
lin_reg_model <- that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  mutate(age_3way = factor(age_3way,
                           levels = c("Younger (under 35)",
                                      "Middle (36-65)",
                                      "Older (66+)"))) %>%
  lm(formula = norm_freq ~ age_3way + sex + occ) %>%
  summary() 

lin_reg_model

## 
## Call:
## lm(formula = norm_freq ~ age_3way + sex + occ, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.781  -6.750   1.688   7.588  15.895 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              92.981      5.186  17.928   <2e-16 ***
## age_3wayMiddle (36-65)   -7.092      4.928  -1.439   0.1616    
## age_3wayOlder (66+)      -6.558      5.091  -1.288   0.2087    
## sexM                     -2.318      4.268  -0.543   0.5915    
## occWhite-Collar         -11.305      4.296  -2.631   0.0139 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.34 on 27 degrees of freedom
## Multiple R-squared:  0.2722, Adjusted R-squared:  0.1644 
## F-statistic: 2.525 on 4 and 27 DF,  p-value: 0.0641

The linear regression model above explains 16% of the variation in the data. The estimate normalized frequency of that omission for female younger (under 35) blue-collar participants (i.e., our intercept) is of 92.98. The average difference in the mean of normalized frequency for that omission between younger (under 35) and middle age (36-55) participants is of -7.09 (holding everything else constant), but the difference is not significant. The average difference in the mean of normalized frequency for that omission between younger (under 35) and older (66+) participants is of -6.56 (holding everything else constant), but the difference is not significant. The average difference in the mean of normalized frequency for that omission between female and male participants is of -2.32 (holding everything else constant), but the difference is not significant. The only actual significant difference is between blue- and white-collar workers with white-collar workers’ mean of normalized frequency for that omission being -11.31 (lower) than blue-collar workers (holding everything else constant).

As seen in 5.3, there are interactions across independent variables (lines are not parallel), but with only 32 data points, we do not have enough statistical power to compute interactions.

As discussed in these results, the contrasts for all independent variables are set for treatment contrast, which is the default in R. That is the case because a lot of times we do have a control group as our reference and different levels as different treatments. The contrasts just change the format of the output, they do not change the model itself. You need to know what your contrasts are, so you can interpret the output correctly. We can check the contrasts with the code below.

that_per_indiv %>%
  mutate(sex = factor(sex)) %>%
  pull(sex) %>%
  contrasts()

##   M
## F 0
## M 1

that_per_indiv %>%
  mutate(age_3way = factor(age_3way,
                      levels = c("Younger (under 35)",
                                      "Middle (36-65)",
                                      "Older (66+)"))) %>%
  pull(age_3way) %>%
  contrasts()

##                    Middle (36-65) Older (66+)
## Younger (under 35)              0           0
## Middle (36-65)                  1           0
## Older (66+)                     0           1

that_per_indiv %>%
  mutate(occ = factor(occ)) %>%
  pull(occ) %>%
  contrasts()

##              White-Collar
## Blue-Collar             0
## White-Collar            1

As seen above, the intercept (or reference, or control group) is wherever there is a zero for the contrasts (i.e., F, Younger (Under 35), Blue-Collar). The other estimates are calculated adding the intercept estimate to the estimate for that factor level that has a 1 in the contrasts. Here’s the formula for our linear regression:

\[\begin{align*} \hat{y} = \beta_0 + age\_middle*\beta_1 + age\_older*\beta_2 + sex*\beta_3 + occ*\beta_4 \end{align*}\]

We have our coefficients (i.e., \(\beta_0\), \(\beta_1\), \(\beta_2\), \(\beta_3\), and \(\beta_4\)), so we can replace them in our formula.

\[\begin{align*} \hat{y} = 92.98 + age\_middle*-7.09 + age\_older*-6.56 + sex*-2.32+ occ*-11.31 \end{align*}\]

To calculate the estimate (i.e., \(\hat{y}\)) for the intercept (younger, female, blue-collar), we have the following:

\[\begin{align*} \hat{y} = 92.98 + 0*-7.09 + 0*-6.56 + 0*-2.32+ 0*-11.31 \end{align*}\]

\[\begin{align*} \hat{y} = 92.98 + 0 + 0 + 0 + 0 \end{align*}\]

\[\begin{align*} \hat{y} = 92.98 \end{align*}\]

To calculate the estimate (i.e., \(\hat{y}\)) for younger, male, blue-collar, we have the following (note that the only variable that changes from 0 to 1 is sex):

\[\begin{align*} \hat{y} = 92.98 + 0*-7.09 + 0*-6.56 + 1*-2.32+ 0*-11.31 \end{align*}\]

\[\begin{align*} \hat{y} = 92.98 + 1*-2.32 \end{align*}\]

\[\begin{align*} \hat{y} = 92.98 -2.32 \end{align*}\]

\[\begin{align*} \hat{y} = 90.66 \end{align*}\]

To calculate the estimate (i.e., \(\hat{y}\)) for younger, male, white-collar, we have the following (note that the variables that are set to 1 are sex and occ):

\[\begin{align*} \hat{y} = 92.98 + 0*-7.09 + 0*-6.56 + 1*-2.32+ 1*-11.31 \end{align*}\]

\[\begin{align*} \hat{y} = 92.98 + 1*-2.32+ 1*-11.31 \end{align*}\]

\[\begin{align*} \hat{y} = 92.98 -2.32 -11.31 \end{align*}\]

\[\begin{align*} \hat{y} = 79.36 \end{align*}\]

And so on. The calculations above are just illustrative, since only occupation is a significant factor group. Let’s create a plot and run a linear regression model with only occupation as the independent variable.

For the sake of demonstrating different visualization techniques in R, Figure 5.4 shows a box plot with median and mean of normalized frequencies of that omission across occupation type.

that_per_indiv.sum <- that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  group_by(occ) %>%
  summarise(mean = mean(norm_freq),
            median = median(norm_freq)) %>%
  pivot_longer(cols = c(mean, median),
               names_to = "type",
               values_to = "norm_freq") %>%
  mutate(type = factor(type,
                       levels = c("median", "mean")))


that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  ggplot(aes(x = occ,
             y = norm_freq)) +
  geom_boxplot() +
  geom_label(data = that_per_indiv.sum,
             aes(label = round(norm_freq, digits = 2),
                 fill = type)) +
  theme_linedraw() +
  labs(y = "normalized frequency (percentage of that omission)",
       x = "",
       title = "Omission of that complementizer across occupation",
       fill = "") +
  scale_fill_manual(values = c("#DDDDDD", "#999999"))

Figure 5.4: Box plot showing median and mean of normalized frequency of that omission across occupation.

Note that the means in Figure 5.4 above match the estimates in the linear regression results below. The intercept (i.e., Blue-Collar) matches exactly what we have for the Blue-Collar mean in Figure 5.4.

lin_reg_model_occ <- that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  lm(formula = norm_freq ~ occ) %>%
  summary() 

lin_reg_model_occ

## 
## Call:
## lm(formula = norm_freq ~ occ, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.376  -9.439   2.634   8.907  16.596 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       86.692      3.279  26.440   <2e-16 ***
## occWhite-Collar  -10.980      4.147  -2.647   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.36 on 30 degrees of freedom
## Multiple R-squared:  0.1894, Adjusted R-squared:  0.1624 
## F-statistic: 7.009 on 1 and 30 DF,  p-value: 0.0128

The estimate for White-Collar is -10.98 from the intercept (i.e., 86.69 -10.98 = 75.71), which again matches what we have for mean for White-Collar in Figure Figure 5.4.

These means can be calculated automatically in R (instead of by hand as demonstrated thus far), by using the contrast matrix and the coefficients from the linear regression model. Here’s an example of how to calculate estimate means for all factor group levels for our last linear regression model:

# pull out the contrasts from occupation factor group
contrasts_occ <- that_per_indiv %>%
  mutate(occ = factor(occ)) %>%
  pull(occ) %>%
  contrasts() 

# inspect contrasts for occupation
contrasts_occ

##              White-Collar
## Blue-Collar             0
## White-Collar            1

# pull out coefficients from linear regression model
coeffs_occ <- lin_reg_model_occ$coefficients %>%
  as.data.frame() %>%
  pull(Estimate)

# inspect coefficients for linear regression model
coeffs_occ

## [1]  86.69222 -10.98003

# first coefficient is our intercept
intercept_estimate <- coeffs_occ[1]

# all estimates is the coefficients multiplied by contrasts
# added to the our intercept
all_estimates <- (coeffs_occ*contrasts_occ) + intercept_estimate

# print out estimates in a nice table (knitr package required)
all_estimates %>%
  as.data.frame() %>%
  rename(estimate = `White-Collar`)

Table 5.3: Linear regression estimates for Blue-Collar and White-Collar
	estimate
Blue-Collar	86.69
White-Collar	75.71

Table 5.3 shows the linear regression estimates for the two levels in the occupation factor group (i.e., Blue-Collar and White-Collar). Note that these estimates match our calculations so far.

Like mentioned in the introduction, it does not make much sense to have an output of treatment contrasts, since we do not have a control group and treatment groups in variationist analyses. Blue-collar in our last linear regression model was set as our reference (i.e., our supposed control group) because it comes before White-Collar in alphabetical order. R needs a level for reference, and by default it uses alphabetical order to determine the reference. It is more useful for our purposes to have the intercept as our mean for the entire data set and then have estimates that tell us how far from the whole group each factor level is.

Let’s change our occupation contrasts from default (i.e., treatment contrast) to sum contrast and see what the output would be like.

# first, occupation needs to be a factor
that_per_indiv <- that_per_indiv %>%
  mutate(occ = factor(occ))

# reminder of what contrasts look like before changing to sum contrasts
contrasts(that_per_indiv$occ)

##              White-Collar
## Blue-Collar             0
## White-Collar            1

# change to sum contrasts (other options: contr.treatment, contr.poly, contr.helmert)
contrasts(that_per_indiv$occ) <- contr.sum 

# here's what the occupation contrasts look like after the contrast change
contrasts(that_per_indiv$occ)

##              [,1]
## Blue-Collar     1
## White-Collar   -1

Everything else is kept the same. The only difference is how we deal with the output.

We first re-run the same linear regression, but now with the new contrasts.

# run linear regression again, exactly the same way as before
lin_reg_model_occ_2 <- that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  lm(formula = norm_freq ~ occ) %>%
  summary() 

lin_reg_model_occ_2

## 
## Call:
## lm(formula = norm_freq ~ occ, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.376  -9.439   2.634   8.907  16.596 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   81.202      2.074  39.158   <2e-16 ***
## occ1           5.490      2.074   2.647   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.36 on 30 degrees of freedom
## Multiple R-squared:  0.1894, Adjusted R-squared:  0.1624 
## F-statistic: 7.009 on 1 and 30 DF,  p-value: 0.0128

As seen above the output looks slightly different. Our intercept (i.e., 81.2) is now our overall mean. In other words, taken all together this community has a tendency to omit that complementizers 81.2% of the time they use complement clauses.

Our only estimate in the results above is now called occ1 (as opposed to While-Collar as before) because it represents the distance (positive for Blue-Collar i.e., 1 and negative for White-Collar i.e., -1 as per our new contrasts) each group is from the overall mean. Note that this estimate (i.e., 5.49) is exactly half of our previous estimate with treatment contrasts.

Thus, to calculate our means for Blue-Collar and White-Collar as shown in Figure 5.4 with our new output, we need to take in account our new contrasts (i.e., 1 for Blue-Collar and -1 for White Collar). The formula is thus as follows:

\[\begin{align*} \hat{y} = \mu + blue\_collar*blue\_collar\_contrast*\beta_1 + white\_collar*white\_collar\_contrast*\beta_1 \end{align*}\]

We know our contrasts to be 1 for Blue-Collar and -1 for White-Collar, so we can update the formula:

\[\begin{align*} \hat{y} = \mu + blue\_collar*(1)*\beta_1 + white\_collar*(-1)*\beta_1 \end{align*}\]

\[\begin{align*} \hat{y} = \mu + blue\_collar*\beta_1 - white\_collar*\beta_1 \end{align*}\]

Our estimate (i.e., \(\hat{y}\)) for Blue-Collar (i.e., Blue-Collar is set to 1 and White-Collar is set to 0) is calculated as follows:

\[\begin{align*} \hat{y} = \mu + 1*\beta_1 - 0*\beta_1 \end{align*}\]

\[\begin{align*} \hat{y} = \mu + 1*\beta_1 \end{align*}\]

Our \(\beta_1\) as per our linear regression model is 5.49:

\[\begin{align*} \hat{y} = \mu + 1*5.49 \end{align*}\]

And our group mean (i.e., our intercept or our \(\mu\) in this case) is 81.2:

\[\begin{align*} \hat{y} = 81.2 + 5.49 \end{align*}\]

\[\begin{align*} \hat{y} = 86.69 \end{align*}\]

Which is the same mean as we got for Blue-Collar as before (it should not change, of course, because we do have the same data and the same model).

Here’s how to calculate the estimate (i.e., \(\hat{y}\)) for White-Collar:

\[\begin{align*} \hat{y} = \mu - 1*5.49 \end{align*}\]

\[\begin{align*} \hat{y} = 81.2- 5.49 \end{align*}\]

\[\begin{align*} \hat{y} = 75.71 \end{align*}\]

And here’s how to make the same calculations in R:

# pull contrasts for occupation
contrasts_occ_2 <- that_per_indiv %>%
  pull(occ) %>%
  contrasts() 

# inspect occupation contrasts
contrasts_occ_2

##              [,1]
## Blue-Collar     1
## White-Collar   -1

# pull coefficients for our linear regression model
coeffs_occ_2 <- lin_reg_model_occ_2$coefficients %>%
  as.data.frame() %>%
  pull(Estimate)

# inspect coefficients
coeffs_occ_2

## [1] 81.202209  5.490016

# our intercept is our first coefficient and is also our group estimate
intercept_estimate_2 <- coeffs_occ_2[1]

# all estimates are calculated by multiplying our second coefficient by
# our contrasts and adding our intercept
all_estimates_2 <- (coeffs_occ_2[2]*contrasts_occ_2) + intercept_estimate_2

# print out estimates
all_estimates_2 %>%
  as.data.frame() %>%
  rename(estimate = V1)

Table 5.4: Linear regression estimates (with contrasts set for sum contrasts) for Blue-Collar and White-Collar
	estimate
Blue-Collar	86.69
White-Collar	75.71

Table 5.4 shows the estimates from our linear regression model with sum contrasts for the two levels in the occupation factor group (i.e., Blue-Collar and White-Collar). These estimates match what we had before (again, it should match, it is the same analysis).

Our linear regression with sum of contrasts gives us our group mean as our intercept, which we can then add to our original box plot:

that_per_indiv %>%
  filter(dep_var == "that_omitted") %>%
  ggplot(aes(x = occ,
             y = norm_freq)) +
  geom_boxplot() +
  geom_label(data = that_per_indiv.sum,
             aes(label = round(norm_freq, digits = 2),
                 fill = type)) +
  geom_hline(yintercept=intercept_estimate_2, linetype="dashed", color = "red") +
  theme_linedraw() +
  labs(y = "normalized frequency (percentage of that omission)",
       x = "",
       title = "Omission of that complementizer across occupation",
       fill = "") +
  scale_fill_manual(values = c("#DDDDDD", "#999999"))

Box plot showing median and mean of normalized frequency of that omission across occupation. Red horizontal dash line is the group mean.

Figure 5.5: Box plot showing median and mean of normalized frequency of that omission across occupation. Red horizontal dash line is the group mean.

As mentioned at the top of this section, linear regression modeling requires a continuous numeric variable which in turns requires data aggregation in the format of percentage of use by individual (or another variable such as lexical verb). This process of aggregation flattens the data, which results in our losing the complete linguistic context of each individual token.

In the next section will demonstrate how we can analyze the same data with no data aggregation, by using logistic regression with a binary dependent variable for each occurrence of complement clause in the data.

5.2 Logistic regression with individual token occurrences

We will start by loading some libraries (you might need to install these with the install.packages() function). We will run a mixed effect logistic regression model, since we have multiple data points from the same participants. We will be running a model with random intercepts only (as opposed to random slopes). But the reason we need the packages below is due to the need to define random effects (in addition to our fixed effects).

library(lme4)
library(lmerTest)

No data aggregation is needed, so we will use our data as is (i.e., each row is an observation of a complement clause being used, a.k.a token).

# run a generalized regression model (binomial) with individual as random slope
log_reg_model <- that_data %>%
  glmer(formula = dep_var_numeric ~ occ + (1|indiv),
        family = "binomial") %>%
  summary()

# inspect model
log_reg_model

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: dep_var_numeric ~ occ + (1 | indiv)
##    Data: .
## 
##      AIC      BIC   logLik deviance df.resid 
##   1862.3   1878.9   -928.1   1856.3     1869 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6456  0.2833  0.4418  0.5289  0.9015 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  indiv  (Intercept) 0.3394   0.5826  
## Number of obs: 1872, groups:  indiv, 32
## 
## Fixed effects:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       2.0546     0.2398   8.570  < 2e-16 ***
## occWhite-Collar  -0.8196     0.2820  -2.907  0.00365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## occWht-Cllr -0.846

Note that the output above (for our logistic regression model) looks quite different from what we had in the previous section for linear regression.

As mentioned in section 3, item 2, we need to convert logit estimates to probabilities. We can see that the probability is higher for the intercept (we are using the default treatment contrasts here, so our intercept is Blue-Collar) than White-Collar, but since most humans are trained in reading probabilities and not logits, let’s convert these to probabilities.

First, we create a function to convert logit to probability.

# function that takes in logit and convert it to probabilities
logit2prob <- function(logit) {
  odds <- exp(logit)
  prob <- odds / (1 + odds)
  return(prob)
}

The same process of pulling out contrasts and coefficients from our model is done here too.

# pull contrasts for occupation
contrasts_occ_3 <- that_data %>%
  mutate(occ = factor(occ)) %>%
  pull(occ) %>%
  contrasts() 

# inspect occupation contrasts
contrasts_occ_3

##              White-Collar
## Blue-Collar             0
## White-Collar            1

# pull coefficients for our linear regression model
coeffs_occ_3 <- log_reg_model$coefficients %>%
  as.data.frame() %>%
  pull(Estimate)

# inspect coefficients
coeffs_occ_3

## [1]  2.0546217 -0.8195944

# our intercept is our first coefficient and is also our group estimate
intercept_estimate_3 <- coeffs_occ_3[1]

# all estimates is the coefficients multiplied by contrasts
# added to the our intercept
all_estimates_logit <- (coeffs_occ_3*contrasts_occ_3) + intercept_estimate_3

# we have an extra step which is to convert logit to probability
all_estimates_prob <- logit2prob(all_estimates_logit)

# print out estimates
data.frame(all_estimates_logit,
           all_estimates_prob) %>%
  rename(logit =    White.Collar,
         probability = White.Collar.1)

##                 logit probability
## Blue-Collar  2.054622   0.8864138
## White-Collar 1.235027   0.7746973

Table 5.5: Logistic regression estimates for Blue-Collar and White-Collar
	logit	probability
Blue-Collar	2.05	0.89
White-Collar	1.24	0.77

Table 5.5 shows logit estimates and their conversion to probabilities, which are very similar to what we got before with the linear regression modeling. The difference is that 1) we are using all individual instances of complement clause occurrences (as opposed to aggregate data) and 2) we were able to add individual as a random intercept to our modeling, so we are taking into account individual differences (so individual behavior does not determine the behavior of the whole group).

It is important to note that we have to calculate the coefficients in logit before transforming them to probabilities. Also, make sure you follow the right way to calculate estimates by factor group level (by multiplying contrasts by coefficients and adding the intercept). Do not convert or visualize the estimates as they are because that can be misleading.

To demonstrate the problem of reporting only the raw estimates output by the regression summary, I will convert the individual coefficients to probabilities and then I will change the order of group factor levels to show how different results supposedly look (they are the same if you multiply the contrasts by coefficients). Again, doing your report like this is very misleading.

So, in summary, here’s what not to do:

# don't do this (report and convert raw estimates to probabilities)
blue_collar_as_intercept_logit <- c(log_reg_model$coefficients[1,1],
                                    log_reg_model$coefficients[2,1])

blue_collar_as_intercept_prob <- c(logit2prob(log_reg_model$coefficients[1,1]),
                              logit2prob(log_reg_model$coefficients[2,1]))

# let's change the order of factor group levels (make White-Collar the "control group")
that_data <- that_data %>%
  mutate(occ = factor(occ,
                      levels = c("White-Collar",
                                 "Blue-Collar")))

# run regression again with new order of levels (exactly the same as before)
log_reg_model_2 <- that_data %>%
  glmer(formula = dep_var_numeric ~ occ + (1|indiv),
        family = "binomial") %>%
  summary()

# inspect results
log_reg_model_2

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: dep_var_numeric ~ occ + (1 | indiv)
##    Data: .
## 
##      AIC      BIC   logLik deviance df.resid 
##   1862.3   1878.9   -928.1   1856.3     1869 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6456  0.2833  0.4418  0.5289  0.9015 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  indiv  (Intercept) 0.3394   0.5826  
## Number of obs: 1872, groups:  indiv, 32
## 
## Fixed effects:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.2350     0.1505   8.207 2.27e-16 ***
## occBlue-Collar   0.8196     0.2820   2.907  0.00365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## occBlu-Cllr -0.526

# don't do this (report raw logits and convert raw estimates to probabilities) 
# do second estimate as the first estimate to align with blue_collar_as_intercept
white_collar_as_intercept_logit <- c(log_reg_model_2$coefficients[2,1],
                                    log_reg_model_2$coefficients[1,1])

white_collar_as_intercept_prob <- c(logit2prob(log_reg_model_2$coefficients[2,1]),
                                    logit2prob(log_reg_model_2$coefficients[1,1]))

# print both results together
data.frame(factor_level = c("Blue-Collar", "White-Collar"),
           blue_collar_as_intercept_logit,
           blue_collar_as_intercept_prob,
           white_collar_as_intercept_logit,
           white_collar_as_intercept_prob)

Table 5.6: Raw estimate outputs (in logit and converted to probabilities) from logistic regression modeling with each level as the estimate (in two different runs).
factor_level	blue_collar_as_intercept_logit	blue_collar_as_intercept_prob	white_collar_as_intercept_logit	white_collar_as_intercept_prob
Blue-Collar	2.05	0.89	0.82	0.69
White-Collar	-0.82	0.31	1.24	0.77

Table 5.6 shows how different the raw estimates are by changing which level is used as reference in the output. In fact, the two results are the opposite when looking at just raw estimates: with Blue-Collar as the intercept, Blue-Collar is the level with higher that omission; with White-Collar as the intercept, Whit-Collar is the level with higher that omission.

Let me repeat myself here. The results in Table 5.6 are very wrong because I am reporting raw logits and probabilities based on these raw logits. The “results” are completely different according to what factor level is your reference. Don’t do this.

What you need to do instead is to calculate estimates by multiplying contrasts by coefficients and adding the intercept (in other words: follow the formula).

Let’s see what changing contrast from the default treatment contrast to sum contrasts (i.e., the intercept is the estimate for entire group) does for our analysis.

# change contrasts to sum contrasts
contrasts(that_data$occ) <- contr.sum

# run same logistic regression model
log_reg_model_3 <- that_data %>%
  glmer(formula = dep_var_numeric ~ occ + (1|indiv),
        family = "binomial") %>%
  summary()

# inspect model
log_reg_model_3

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: dep_var_numeric ~ occ + (1 | indiv)
##    Data: .
## 
##      AIC      BIC   logLik deviance df.resid 
##   1862.3   1878.9   -928.1   1856.3     1869 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6456  0.2833  0.4418  0.5289  0.9015 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  indiv  (Intercept) 0.3394   0.5826  
## Number of obs: 1872, groups:  indiv, 32
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.6448     0.1421  11.576  < 2e-16 ***
## occ1         -0.4098     0.1410  -2.906  0.00366 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## occ1 -0.435

Once again our raw coefficients look very different from previous runs, but results should be similar. Coefficients will look different in different regression modeling depending on the format of your dependent variable and how your contrasts are set (i.e., what you are using as reference for the coefficients). In the previous section, when we ran linear regression, which had a continuous numeric dependent variable of normalized frequencies. This time around we are running logistic regression with a binary (i.e., 0 or 1) dependent variable, and the coefficients are output as logits. However, as I mentioned, in the end, results should be somewhat similar, because we are looking at the same data and the same independent variable (i.e., occupation). Having individual tokens as our data points (instead of aggregate data per participant) allows us to take individual behavior into account in our logistic regression (i.e., individual is set as a random slope). What we need to do to get to the actual estimates for each level in our factor group occupation is to multiple contrasts by coefficients and add intercept. We will also convert logits to probabilities.

# pull contrasts for occupation
contrasts_occ_4 <- that_data %>%
  pull(occ) %>%
  contrasts() 

# inspect occupation contrasts
contrasts_occ_4

##              [,1]
## White-Collar    1
## Blue-Collar    -1

# pull coefficients for our linear regression model
coeffs_occ_4 <- log_reg_model_3$coefficients %>%
  as.data.frame() %>%
  pull(Estimate)

# inspect coefficients
coeffs_occ_4

## [1]  1.6448309 -0.4098068

# our intercept is our first coefficient and is also our group estimate
intercept_estimate_4 <- coeffs_occ_4[1]

# all estimates are calculated by multiplying our second coefficient by
# our contrasts and adding our intercept
all_estimates_logit_2 <- (coeffs_occ_4[2]*contrasts_occ_4) + intercept_estimate_4 

# convert logit to probability
all_estimates_prob_2 <- logit2prob(all_estimates_logit_2)

# build output table with all estimates
last_results <- data.frame(logit = all_estimates_logit_2,
           probability = all_estimates_prob_2) %>%
  bind_rows(data.frame(logit = intercept_estimate_4,
           probability = logit2prob(intercept_estimate_4))) %>%
  mutate(factor_level = c("White-Collar", 
                          "Blue-Collar",
                          "Everyone")) %>%
  select(factor_level, logit, probability)

# print our estimates in a nice table
last_results

Table 5.7: Correct estimates from logistic regression with sum contrasts, showing the logit and probability estimates for each factor group level and for the group as a whole.
factor_level	logit	probability
White-Collar	1.24	0.77
Blue-Collar	2.05	0.89
Everyone	1.64	0.84

Table 5.7 shows results for the individual factor group levels are the same as before when we did the same calculation of estimates by multiplying contrasts by coefficients and adding the intercept (as they should be). With sum contrasts we can add to our table information about our intercept, which is the overall probability of that omission for the community as a whole.

Finally, to calculate factor weights, which are traditionally reported in language variation and change, you also multiply your contrasts by the coefficients that are not the intercept.

# factor weights are calculated by multiplying our second coefficient by our contrasts 
# the second coefficient is the one that is not the intercept
factor_weights_logit <- coeffs_occ_4[2]*contrasts_occ_4

# convert logit to probabilities
factor_weights_prob <- logit2prob(factor_weights_logit)

# print out a nice table
factor_weights_prob %>%
  as.data.frame() %>%
  rename(FW = V1)

Table 5.8: Factor weights for the two levels of the occupation factor group.
	FW
White-Collar	0.4
Blue-Collar	0.6

Table 5.8 displays the factor weights calculated from our logistic regression with sum contrasts run. The factor weights represent the distance from the group average for each factor group level. In a random binary situation (e.g., a coin flip) our random chance is .5 (or 50/50, because heads and tails have the same chance to occur). We know that language choice is not random, so sum contrasts allow us to calculate the group estimate so we can use it as our adjusted “random chance” baseline. The factor weights reveal how far from this group baseline (our “new” 50/50 chance) our different levels are.

Language variation is rarely examined with only one factor group, however. So in the next section I will demonstrate a complete analysis.

5.3 Factor Weights for Linguistic Variables

Here’s a final demonstration on how to run a complete multivariate mixed-effect logistic regression outputting factor weights. I’m going to include in the model the variables in Tagliamonte’s book (Tagliamonte 2012, 39:126) since the goal here is to replicate in R the results Tagliamonte got in Godlvarb. Results will be slightly different because I’m running individual as a random effect.

First step is to set all factor groups to be added to our regression model (i.e., verbs_1, matrix_subj, add_elm, sub_subj, int_mat, tense – Tagliamonte’s book (Tagliamonte 2012, 39:126) for an explanation of these) to actual factors that are set for sum contrasts.

# all variables should be set actual factors -- use factor()
that_data <- that_data %>%
  mutate(verbs_1 = factor(verbs_1),
         matrix_subj = factor(matrix_subj),
         add_elm = factor(add_elm),
         sub_subj = factor(sub_subj),
         tense = factor(tense),
         int_mat = factor(int_mat))

# let's check contrasts before we change them
contrasts(that_data$verbs_1)

##       OTHER Say Think
## Know      0   0     0
## OTHER     1   0     0
## Say       0   1     0
## Think     0   0     1

contrasts(that_data$matrix_subj)

##                       Noun Phrase Other Pronoun
## First Person Singular           0             0
## Noun Phrase                     1             0
## Other Pronoun                   0             1

contrasts(that_data$add_elm)

##           Something
## Nothing           0
## Something         1

contrasts(that_data$sub_subj)

##                  Personal Pronoun
## Other                           0
## Personal Pronoun                1

contrasts(that_data$tense)

##         Present
## Past          0
## Present       1

contrasts(that_data$int_mat)

##      Some
## None    0
## Some    1

# now we set factors to be in sum contrasts
contrasts(that_data$verbs_1) <- contr.sum
contrasts(that_data$matrix_subj) <- contr.sum
contrasts(that_data$add_elm) <- contr.sum
contrasts(that_data$sub_subj) <- contr.sum
contrasts(that_data$tense) <- contr.sum
contrasts(that_data$int_mat) <- contr.sum

# check contrasts after the change
contrasts(that_data$verbs_1)

##       [,1] [,2] [,3]
## Know     1    0    0
## OTHER    0    1    0
## Say      0    0    1
## Think   -1   -1   -1

contrasts(that_data$matrix_subj)

##                       [,1] [,2]
## First Person Singular    1    0
## Noun Phrase              0    1
## Other Pronoun           -1   -1

contrasts(that_data$add_elm)

##           [,1]
## Nothing      1
## Something   -1

contrasts(that_data$sub_subj)

##                  [,1]
## Other               1
## Personal Pronoun   -1

contrasts(that_data$tense)

##         [,1]
## Past       1
## Present   -1

contrasts(that_data$int_mat)

##      [,1]
## None    1
## Some   -1

You will notice that contrasts for factor groups that have more than two levels are a bit more complex than the contrasts from factor groups that have three or more levels. We will see later how to multiply our contrasts by our coefficients for these instances, which is slightly different than what we did before (the principle is the same).

# run same logistic regression model
ling_log_reg_model <- that_data %>%
  glmer(formula = dep_var_numeric ~ verbs_1 + matrix_subj + add_elm + sub_subj + int_mat + tense + (1|indiv),
        family = "binomial") %>%
  summary()

# inspect model
ling_log_reg_model

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: dep_var_numeric ~ verbs_1 + matrix_subj + add_elm + sub_subj +  
##     int_mat + tense + (1 | indiv)
##    Data: .
## 
##      AIC      BIC   logLik deviance df.resid 
##   1356.5   1417.4   -667.2   1334.5     1861 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -6.5748  0.0906  0.1868  0.3937  3.1500 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  indiv  (Intercept) 0.6785   0.8237  
## Number of obs: 1872, groups:  indiv, 32
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.74887    0.18817   3.980 6.90e-05 ***
## verbs_11     -0.64083    0.17641  -3.633 0.000280 ***
## verbs_12     -0.76820    0.11301  -6.798 1.06e-11 ***
## verbs_13      0.26155    0.15662   1.670 0.094937 .  
## matrix_subj1  0.98165    0.10450   9.394  < 2e-16 ***
## matrix_subj2 -0.45499    0.11768  -3.866 0.000110 ***
## add_elm1      0.45078    0.08273   5.449 5.07e-08 ***
## sub_subj1    -0.28990    0.08039  -3.606 0.000311 ***
## int_mat1      0.50881    0.08861   5.742 9.35e-09 ***
## tense1       -0.38528    0.07757  -4.967 6.81e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) vrb_11 vrb_12 vrb_13 mtrx_1 mtrx_2 add_l1 sb_sb1 int_m1
## verbs_11     0.154                                                        
## verbs_12    -0.257 -0.366                                                 
## verbs_13    -0.032 -0.440 -0.160                                          
## matrix_sbj1 -0.072 -0.073  0.014  0.186                                   
## matrix_sbj2  0.132  0.119 -0.115 -0.111 -0.551                            
## add_elm1    -0.116 -0.039  0.003  0.117  0.024  0.021                     
## sub_subj1    0.186 -0.053  0.039  0.020 -0.068  0.029 -0.029              
## int_mat1    -0.195 -0.042  0.086  0.012  0.043 -0.035  0.033 -0.010       
## tense1       0.059 -0.013  0.029 -0.158  0.170  0.010 -0.104  0.123 -0.015

Next step is to calculate our factor weights by first multiplying our contrasts by our coefficients and then transforming them from logit to probabilities.

Let’s start with pulling our contrasts and saving them to objects like we did before.

# pull contrasts
contrasts_verbs <- that_data %>%
  pull(verbs_1) %>%
  contrasts() 

contrasts_mat_sj <- that_data %>%
  pull(matrix_subj) %>%
  contrasts() 

contrasts_ad_el <- that_data %>%
  pull(add_elm) %>%
  contrasts() 

contrasts_sub_sj <- that_data %>%
  pull(sub_subj) %>%
  contrasts() 

contrasts_int_mt <- that_data %>%
  pull(int_mat) %>%
  contrasts() 

contrasts_tense <- that_data %>%
  pull(tense) %>%
  contrasts()

Now we need to pull our coefficients.

# save all coefficients as data frame for filtering
all_coefficients <- ling_log_reg_model$coefficients %>%
  as.data.frame()

# add row names as an actual variable
all_coefficients$level <- row.names(all_coefficients)

# pull out our group estimate
group_input <- all_coefficients %>%
  filter(level == "(Intercept)") %>%
  pull(Estimate)

# pull out coefficients for verbs_1
verbs_1_coeffs <- all_coefficients %>%
  filter(grepl("verbs_1", level)) %>%
  pull(Estimate)

# pull out coefficients for matrix_subj
matrix_subj_coeffs <- all_coefficients %>%
  filter(grepl("matrix_subj", level)) %>%
  pull(Estimate)

# pull out coefficients for add_elm
add_elm_coeffs <- all_coefficients %>%
  filter(grepl("add_elm", level)) %>%
  pull(Estimate)

# pull out coefficients for sub_subj
sub_subj_coeffs <- all_coefficients %>%
  filter(grepl("sub_subj", level)) %>%
  pull(Estimate)

# pull out coefficients for int_mat
int_mat_coeffs <- all_coefficients %>%
  filter(grepl("int_mat", level)) %>%
  pull(Estimate)

# pull out coefficients for tense
tense_coeffs <- all_coefficients %>%
  filter(grepl("tense", level)) %>%
  pull(Estimate)

Next step is to multiply contrasts by coefficients. As I mentioned before, matrix multiplications are a little more complex and need some matrix transposing. I will demonstrate how to do this below.

# transpose contrast matrix before multiplying it by coefficients
# sum columns to get to individual estimates
verbs_fws_logit <- (t(contrasts_verbs) * verbs_1_coeffs) %>%
  colSums() 

# transform logits to probabilities
verbs_fws_prob <- logit2prob(verbs_fws_logit)

# put everything in a data frame and then
# arrange by highest factor weight first and then
# calculate factor weight range
verbs_results <- data.frame(factor_group = "Lexical verb",
                            level = names(verbs_fws_prob),
                            logit = verbs_fws_logit,
                            fw = verbs_fws_prob) %>%
  arrange(-fw) %>%
  mutate(range = round((max(fw)-min(fw))*100, digits = 0))

# inspect results
verbs_results %>%
  kable(digits = 2)

factor_group	level	logit	fw	range
Lexical verb	Think	1.15	0.76	44
Lexical verb	Say	0.26	0.57	44
Lexical verb	Know	-0.64	0.35	44
Lexical verb	OTHER	-0.77	0.32	44

# now we repeat everything for all other five factor groups
##############  matrix_subj #############
# transpose contrast matrix before multiplying it by coefficients
# sum columns to get to individual estimates
matrix_subj_fws_logit <- (t(contrasts_mat_sj) * matrix_subj_coeffs) %>%
  colSums() 

# transform logits to probabilities
matrix_subj_fws_prob <- logit2prob(matrix_subj_fws_logit)

# put everything in a data frame and then
# arrange by highest factor weight first and then
# calculate factor weight range
matrix_subj_results <- data.frame(factor_group = "Matrix subject",
                                  level = names(matrix_subj_fws_prob),
                                  logit = matrix_subj_fws_logit,
                                  fw = matrix_subj_fws_prob) %>%
  arrange(-fw) %>%
  mutate(range = round((max(fw)-min(fw))*100, digits = 0))

# inspect results
matrix_subj_results %>%
  kable(digits = 2)

factor_group	level	logit	fw	range
Matrix subject	First Person Singular	0.98	0.73	36
Matrix subject	Noun Phrase	-0.45	0.39	36
Matrix subject	Other Pronoun	-0.53	0.37	36

##############  add_elm #############
# transpose contrast matrix before multiplying it by coefficients
# sum columns to get to individual estimates
add_elm_fws_logit <- (t(contrasts_ad_el) * add_elm_coeffs) %>%
  colSums() 

# transform logits to probabilities
add_elm_fws_prob <- logit2prob(add_elm_fws_logit)

# put everything in a data frame and then
# arrange by highest factor weight first and then
# calculate factor weight range
add_elm_results <- data.frame(factor_group = "Additional elements",
                              level = names(add_elm_fws_prob),
                              logit = add_elm_fws_logit,
                              fw = add_elm_fws_prob) %>%
  arrange(-fw) %>%
  mutate(range = round((max(fw)-min(fw))*100, digits = 0))

# inspect results
add_elm_results %>%
  kable(digits = 2)

factor_group	level	logit	fw	range
Additional elements	Nothing	0.45	0.61	22
Additional elements	Something	-0.45	0.39	22

##############  sub_subj #############
# transpose contrast matrix before multiplying it by coefficients
# sum columns to get to individual estimates
sub_subj_fws_logit <- (t(contrasts_sub_sj) * sub_subj_coeffs) %>%
  colSums() 

# transform logits to probabilities
sub_subj_fws_prob <- logit2prob(sub_subj_fws_logit)

# put everything in a data frame and then
# arrange by highest factor weight first and then
# calculate factor weight range
sub_subj_results <- data.frame(factor_group = "Complement Clause Subject",
                              level = names(sub_subj_fws_prob),
                              logit = sub_subj_fws_logit,
                              fw = sub_subj_fws_prob) %>%
  arrange(-fw) %>%
  mutate(range = round((max(fw)-min(fw))*100, digits = 0))

# inspect results
sub_subj_results %>%
  kable(digits = 2)

factor_group	level	logit	fw	range
Complement Clause Subject	Personal Pronoun	0.29	0.57	14
Complement Clause Subject	Other	-0.29	0.43	14

##############  int_mat #############
# transpose contrast matrix before multiplying it by coefficients
# sum columns to get to individual estimates
int_mat_fws_logit <- (t(contrasts_int_mt) * int_mat_coeffs) %>%
  colSums() 

# transform logits to probabilities
int_mat_fws_prob <- logit2prob(int_mat_fws_logit)

# put everything in a data frame and then
# arrange by highest factor weight first and then
# calculate factor weight range
int_mat_results <- data.frame(factor_group = "Intervening material",
                              level = names(int_mat_fws_prob),
                              logit = int_mat_fws_logit,
                              fw = int_mat_fws_prob) %>%
  arrange(-fw) %>%
  mutate(range = round((max(fw)-min(fw))*100, digits = 0))

# inspect results
int_mat_results %>%
  kable(digits = 2)

factor_group	level	logit	fw	range
Intervening material	None	0.51	0.62	25
Intervening material	Some	-0.51	0.38	25

##############  int_mat #############
# transpose contrast matrix before multiplying it by coefficients
# sum columns to get to individual estimates
tense_fws_logit <- (t(contrasts_tense) * tense_coeffs) %>%
  colSums() 

# transform logits to probabilities
tense_fws_prob <- logit2prob(tense_fws_logit)

# put everything in a data frame and then
# arrange by highest factor weight first and then
# calculate factor weight range
tense_results <- data.frame(factor_group = "Verb tense",
                              level = names(tense_fws_prob),
                              logit = tense_fws_logit,
                              fw = tense_fws_prob) %>%
  arrange(-fw) %>%
  mutate(range = round((max(fw)-min(fw))*100, digits = 0))

# inspect results
tense_results %>%
  kable(digits = 2)

factor_group	level	logit	fw	range
Verb tense	Present	0.39	0.6	19
Verb tense	Past	-0.39	0.4	19

###################### COMBINE EVERYTHING ####################
# combine all results and arrange by largest range first and then by fw
all_linguistic_results <- bind_rows(verbs_results, 
                                    matrix_subj_results, 
                                    add_elm_results, 
                                    sub_subj_results, 
                                    int_mat_results, 
                                    tense_results) %>%
  arrange(-range, -fw)

# print all results
all_linguistic_results

Table 5.9: Final table with logistic regression of the linguistic factors conditioning that omission in our data.
factor_group	level	logit	fw	range
Lexical verb	Think	1.15	0.76	44
Lexical verb	Say	0.26	0.57	44
Lexical verb	Know	-0.64	0.35	44
Lexical verb	OTHER	-0.77	0.32	44
Matrix subject	First Person Singular	0.98	0.73	36
Matrix subject	Noun Phrase	-0.45	0.39	36
Matrix subject	Other Pronoun	-0.53	0.37	36
Intervening material	None	0.51	0.62	25
Intervening material	Some	-0.51	0.38	25
Additional elements	Nothing	0.45	0.61	22
Additional elements	Something	-0.45	0.39	22
Verb tense	Present	0.39	0.60	19
Verb tense	Past	-0.39	0.40	19
Complement Clause Subject	Personal Pronoun	0.29	0.57	14
Complement Clause Subject	Other	-0.29	0.43	14

Table 5.9 shows all factor weights from our logistic regression with sum contrasts of the linguistic factors conditioning that omission in our data. For a complete discussion of these results, please consult chapter 5 (Quantitative Analysis) of Tagliamonte’s Variationist Sociolinguistics book (Tagliamonte 2012). For example, lexical verb in matrix clause is the factor group that most constrains that omission in complement clauses (i.e., lexical verb is the factor group with the largest range in its factor weights). The lexical verb think favors that omission with a factor weight of 0.76 and the lexical verb know favors that expressed with a factor weight of 0.35. Remember that our community mean is used as a reference for these factor weights (i.e., our community overall behavior is our .50 weight). When members of this community use a complement clause controlled by think they are more likely to omit that (compared to general that complementizer omission). When they use a complement clause controlled by know they are more likely to express that (compared to general that complementizer expression).

I highly recommend reporting information about \(p\) values and confidence intervals, but that’s a topic for another post. Also out of the scope of this post is the issue of interactions, and how much statistical power is needed to include interactions in our regression modeling.

6 How to cite this blog post

Picoral, Adriana. (2020, December 20). Quantitative Language Data Analysis in R: Regression and Contrasts [Blog post]. Adriana Picoral. Retrieved from https://picoral.github.io/

References

Grieve, Jack. 2012. “Sociolinguistics: Quantitative Methods.” The Encyclopedia of Applied Linguistics.

Johnson, Daniel Ezra. 2009. “Getting Off the Goldvarb Standard: Introducing Rbrul for Mixed-Effects Variable Rule Analysis.” Language and Linguistics Compass 3 (1): 359–83.

Labov, William, Paul Cohen, Clarence Robins, and John Lewis. 1968. A Study of the Nonstandard English of Black and Puerto Rican Speakers in New-York City. Philadelphia: US Office of Education Cooperative Research Project.

Sankoff, David. 1982. “Sociolinguistic Method and Linguistic Theory.” In Studies in Logic and the Foundations of Mathematics, 104:677–89. Elsevier.

Sankoff, David, and William Labov. 1979. “On the Uses of Variable Rules.” Language in Society 8 (2): 189–222.

Tagliamonte, Sali A. 2012. Variationist Sociolinguistics: Change, Observation, Interpretation. Vol. 39. John Wiley & Sons.

Tagliamonte, Sali A., and Alexandra D’Arcy. 2017. “Individuals, Communities and the Sociolinguistic Canon.” New Ways of Analyzing Variation (NWAV) 46. Madison, Wisconsin.

Quantitative Language Data Analysis in R

Regression and Contrasts

Adriana Picoral, PhD

Assistant Professor of Data Science

School of Information @ University of Arizona

adrianaps@arizona.edu

20 December 2020