The R Analyst badge; learn it before you earn it.

print(summary)

[1] "The We R Analysts masterclass introduces basic frequentist data analysis in R. From elucidative descriptives to infallible inferences, R has you covered. And although R's decisive power is in its incredibly large and ever-expanding collection of advanced statistical packages, the foundations of data analysis in R that you'll learn in this masterclass will enable you to explore R's full potential."

2 Data

2.1 Demonstration

The demonstration gives you the opportunity to see everything you’ll learn in harmony, before we break it apart into all the various building blocks that you can explore on your own.

In this demonstration we use the data from a paper by Simmons, Nelson, and Simonsohn (2011) published in Psychological Science. The paper aimed to show how flexibility in data collection, analysis, and reporting can dramatically increase false-positive rates, i.e., statistically significant evidence for false hypotheses. Let’s call it an ethical guide to unethical p-hacking. The authors conducted two experiments; the data used in this demonstration is the data from their Study 1.

Reference: Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

# Load necessary package collections
library("tidyverse")
library("tidymodels")

# Read in data (We R Novices class)
study_1 <- readr::read_delim("data/Study 1 .txt", delim = "\t", escape_double = FALSE, trim_ws = TRUE)

# View data (We R Novices class)
study_1 %>% tibble::view()

# Mean age in years
study_1 %>%
  dplyr::summarize(aged365_mean = mean(aged365))

# Mean age in years, by group
study_1 %>%
  dplyr::group_by(cond) %>%
  dplyr::summarize(aged365_mean = mean(aged365))

# Means of all numeric variables, by group
study_1 %>%
  dplyr::group_by(cond) %>%
  dplyr::summarize(across(where(is.numeric), list(mean = mean)))

# Various descriptives for two variables, by group
study_1 %>%
  dplyr::group_by(cond) %>%
  dplyr::summarize(across(c(aged365, aged), list(mean = mean, sd = sd, min = min, max = max)),
                   n = n())

# Correlations
library("corrr")
study_1 %>%
  dplyr::select(where(is.numeric)) %>%
  corrr::correlate()

2.2 Resources

Yeh

Computing common summary statistics.
Computing correlations.

Nah

Reading in data from various sources, and viewing it, is discussed in the We R Novices masterclass.
Tidying and transforming data is discussed in the We R Transformers masterclass. For now, just tidy your data in your favorite data editor before reading it into R.
Histograms, box plots, and other visualizations are discussed in the We R Visualizers masterclass.

2.2.1 Packages

In We R Novices, you learned how to install, load, and update packages. In this masterclass, we’ll use two package collections: the tidyverse collection introduced in We R Novices and the tidymodels collection. Tidymodels is designed for statistical modeling and machine learning, using tidyverse principles.

install.packages("tidyverse")  # installs the tidyverse collection
install.packages("tidymodels")  # installs the tidymodels collection

library("tidyverse")  # loads the tidyverse collection
library("tidymodels")  # loads the tidymodels collection

tidyverse_update()  # updates the tidyverse collection
tidymodels_update()  # updates the tidymodels collection

tidyverse_packages()  # shows the tidyverse packages
tidymodels_packages()  # shows the tidymodels packages

We’ll use various of the included packages throughout the masterclass, including:

broom for tidying results
infer for some statistical tests

2.2.2 Split data

Give me more. Before you start exploring your data set and cherry-picking the significant p-values (this is not a tip), you might want to keep a holdout set untouched (this is a tip).

iris_split <- iris %>% rsample::initial_split(prop = .8)  # the prop argument determines the proportion of data you use for analysis
iris_explore <- iris_split %>% rsample::training()
iris_holdout <- iris_split %>% rsample::testing()

Indeed, this is a programming masterclass, not a statistics masterclass, but you should definitely read this essay if you want to understand the holdout method.

2.2.3 Summarize data

R helps you to easily compute some common summary statistics for location, spread, shape, and dependence.

Summary statistics

In We R Novices you applied the mean function to a vector. Here are a few other descriptive functions:

mean for the arithmetic mean
median for the median
sum for the sum
min for the minimum
max for the maximum
sd for the standard deviation
var for the variance
n for the group size
IQR for the interquartile range
mad for the median absolute deviation
range for the range
quantile for the quartiles (or other quantiles)
skewness for the skewness (requires the moments package)
kurtosis for the kurtosis (requires the moments package)

As we are now working with tibbles rather then vectors, we need the summarize function from the dplyr package to apply one of the above functions to a variable.

iris %>% dplyr::summarize(sepal_length_median = median(Sepal.Length))
iris %>% dplyr::summarise(sepal_length_median = median(Sepal.Length))  # British spelling works too!
iris %>% dplyr::samenvatten(sepal_length_median = median(Sepal.Length))  # Dutch spelling fails miserably, such a missed opportunity
iris %>% dplyr::summarize(sepal_length_median = median(Sepal.Length), sepal_width_min = min(Sepal.Width), n = n())  # add descriptives by separating them by a comma

Some remarks.

You can assign your summary statistics to an object. Great for reuse or viewing large summary tables with the view function.
R has no function for computing the mode (the most frequent value) built in. You will learn how to create your own mode function in the We R Programmers masterclass.
In the n function, you do not specify a variable, as it does not target a specific variable but simply computes the group size. Also, n is a special function that only works within the summarize function. If, for some reason, you want to know the size of a vector, use the length function.
The range and quantile functions return multiple values per variable and thus do not work well together with the other statistics.

Grouped statistics

Want to see the descriptives per group? Specify the variable that identifies the group in the group_by function, and R does the rest.

iris %>%
  dplyr::group_by(Species) %>%
  dplyr::summarize(sepal_length_median = median(Sepal.Length), sepal_width_min = min(Sepal.Width), n = n())

Multiple statistics

A very helpful function for computing various descriptive statistics of multiple variables is the across function. Within this function you first specify the desired variable(s) and then the desired statistic(s). You use the across function inside the summarize function.

# multiple descriptive statistics of one variable
iris %>% 
  dplyr::summarize(across(Sepal.Length, list(mean = mean, sd = sd, min = min, max = max)))

# the means of two variables
iris %>%
  dplyr::summarize(across(c(Sepal.Length, Sepal.Width), list(mean = mean)))

# the means and standard deviations of two variables per level of another variable
iris %>%
  dplyr::group_by(Species) %>%
  dplyr::summarize(across(c(Sepal.Length, Sepal.Width), list(mean = mean, sd = sd)))

The examples, or your psychokinetic abilities, allow you to discover two stranger things:

Multiple variables are grouped in a vector (using the c function) and multiple statistics are grouped in a list (using the list function).
The n function is used outside of the across function, as it does not target a specific variable but simply computes the group size.

Give me more. If you want to get really funky, various helper functions may help you select multiple variables at once. These functions should be used within the across function.

starts_with for specifying what the variable name starts with
ends_with for specifying what the variable name ends with
contains for specifying what the variable name contains
where for specifying what the variable is

# the means of variables that start with Sepal
iris %>%
  dplyr::summarize(across(starts_with("Sepal"), list(mean = mean)))
# the means of variables that end with Width
iris %>%
  dplyr::summarize(across(ends_with("Width"), list(mean = mean)))
# the means of all numeric variables
iris %>%
  dplyr::summarize(across(where(is.numeric), list(mean = mean)))
# multiple variables and multiple descriptive statistics per species group
iris %>%
  dplyr::group_by(Species) %>%
  dplyr::summarize(across(where(is.numeric), list(mean = mean, sd = sd, min = min, max = max)), n = n())

Frequencies

To obtain frequencies, use the count function from the dplyr package.

iris %>%
  dplyr::count(Species)

Or apply your newly obtained knowledge about the group_by and summarize functions.

iris %>%
  dplyr::group_by(Species) %>%
  dplyr::summarize(n = n())

Correlations

The corrr package deals with corrrrelations. Although it is part of the tidymodels collection, it is not (yet) automatically installed and loaded.

install.packages("corrr")
library("corrr")

Now let’s create a correlation matrix. Make sure to select numeric variables only.

iris %>%
  dplyr::select(where(is.numeric)) %>%
  corrr::correlate() %>%
  corrr::rearrange()  # optional, orders the correlations by magnitude

Check the help file of the correlate function to understand how you can change the method it uses, or how it deals with missings. And check the package website to see how to easily visualize the correlation matrix (and you may now guess the type of visualization on the badge for this masterclass).

Give me more. There are more advanced packages for intra-class correlations and repeated measures correlations, namely the ICC package and rmcorr package.

2.3 Exercises

In these exercises you will work with the data used in this chapter’s initial demonstration, or alternatively, you may grab some of your own data. A description of the p-hacking data is provided in the demonstration section.

Finish it

Instruction

Download the data for Study 1 of the paper here. Open the .zip file, and find the data file called Study 1 .txt. Read this data into R and name it study_1. A text file called Codebook.txt provides a short description of each variable.

The script below computes the mean of the variable computer.

study_1 %>% 
  dplyr::summarize(computer_mean = mean(computer))

Now, adapt the script to compute the following summary statistics:

Compute the mean of the variable computer per sex.
Compute the mean, sd, min, and max of the variable computer per sex.
Compute the mean, sd, min, and max of the variables computer and diner per sex.

Tips

Run the code to see what happens now, and find out at each step what you have to add.
Find working examples in the resources and copy from them.

DIY

Instruction

Compute the minimum, maximum, mean, and standard deviation of the following variables at once: computer, dad, mom, and political.
Compute the mean and standard deviation of age in years, per condition, and per sex, at once.
Compute quantiles for age in years.

Tips

Find everything you need to know in the resources and the demonstration script. You can group by multiple variables.

Frequencies

Instruction

First pick a categorical variable and compute the frequencies for its different values.
Then, try to compute the frequencies for variables other than categorical.

Tips

The resources describe two ways of computing frequencies. Moreover, the We R Novices masterclass describes how you can recognize numerical, categorical, and other variables in R.

Correlations

Instruction

First, compute the correlation coefficients for all pairs of numeric variables in your data.
Second, rearrange the coefficients by magnitude.
Finally, if you feel like it, create a heatmap or network from the coefficients.

Tips

Look into this masterclass’ resources and the documentation of the corrr package.

Solutions

Finding answers—to pretty much anything—is really not that hard.

2.4 Break

phdcomics.com

2.5 Discussion

Lingering questions or concerns? Use Piazza to post a question or up on our plenary discussion.

3 Analyses

3.1 Demonstration

Let us again first see everything we’ll learn in harmony, before breaking it down. In this demonstration, we’ll go over doing basic frequentist inference.

# Student's t-test: political orientation difference between two groups (listened to when64 vs. not)

# Data
study_1 %>%
  dplyr::select(political, when64) %>%
  dplyr::glimpse()

# Model
mod_ttest <- political ~ when64

# Test & view results
study_1 %>%
  infer::t_test(mod_ttest, order = c("0", "1"))

# Multiple linear regression: predicting how complicated the students find computers using their sex, age and political orientation

# Data
study_1 %>%
  dplyr::select(computer, female, aged365, political) %>%
  dplyr::glimpse()

# Model
mod_lm <- computer ~ female + aged365 + political

# Test
res_lm <- study_1 %>%
  stats::lm(mod_lm, data = .)

# View results
res_lm %>% broom::glance()  # whole model results
res_lm %>% broom::tidy()  # coefficients
res_lm %>% broom::augment()  # adds useful results to data (like std. residuals)

# Assumption check: normality of residuals
res_normality <- res_lm %>%
  broom::augment() %>%
  dplyr::pull(.std.resid) %>%
  stats::shapiro.test() 

res_normality %>%
  broom::tidy()

# Now visually
res_lm %>% broom::augment() %>%
  ggplot2::ggplot(aes(x = .std.resid)) +
  ggplot2::geom_density() + 
  ggplot2::stat_function(fun = dnorm, linetype = "dashed") + 
  ggplot2::theme_minimal()

3.2 Resources

Yeh

Frequentist inference
Some popular statistical analyses
Performing statistical analyses

Nah

Bayesian inference
Advanced statistical analyses
Understanding/interpreting statistical analyses

3.2.1 Formulate model

Most statistical techniques in R require you to explicate your model. Now, don’t hide right away, it’s not as scary as you might think it is. And conveniently, it is done in a uniform format. In R, you explicate your model in a formula:

y ~ model

As you probably know, y is a common synonym for the dependent/response variable. But, ever wondered y? It might have to do with x.
Within a formula, the tilde operator ~ simply states that the left-hand side y is modelled by the right-hand side model.
The model is where you specify the model. That is, the independent/predictor variables and all their crazy interactions you came up with in a lucid dream.

Now, let’s compare two formulas with different models. A model consists of different terms, such as main effects and interactions:

y ~ a + b + a:b and y ~ 0 + a - b

a and b represent main effects
the + operator adds a term to the model
the - operator removes a term from the model
the : operator creates an interaction term
starting the model with 0 removes the intercept

Be aware that as soon as the tilde operator ~ is used, you enter the formula realm, and the above operators receive their formula interpretation. However, if you need their arithmetic interpretation, you can wrap the term in the I function (short for as is) to inhibit the formula interpretation: e.g., I(a + b) creates a main effect for the sum of the variables a and b, rather than adding a and b as separate main effects.

When writing your own model, you naturally replace the y with the name of your dependent variable and the a and b (and so on) with the names of your independent variables. Conveniently, you can write your formula upfront and save it as an object:

my_model <- happyness ~ 0 + food + I(friends^2) + food:friends

Give me more. Three more operators help you to write your models more concisely (and more difficult to grasp). Run help("formula") to read more.

the * operator creates crossed terms, e.g., a * b expands to a + b + a:b
the %in% operator creates nested terms, e.g., a + b %in$ a expands to a + a:b
the ^ operator creates crossed terms to the specified degree, e.g., (a + b + c)^2 expands to (a + b + c) * (a + b + c) expands to a + b + c + a:b + a:c + b:c

Additionally, in multilevel modeling the | operator is used to specify the group level. R has very advanced packages for multilevel modeling that you can explore on your own (e.g., lme4 and nlme).

Finally, if you think main effects and interactions are for losers, you can go nuts with logarithms and orthogonal polynomials using the log and poly functions. Great tools for a phishing expedition, you slinky p-hacker.

3.2.2 Perform analysis

In this masterclass, we only cover a few of the most common statistical techniques. Most of these analyses are built in base R, but work well with tidyverse. Nonetheless, where possible, we show their tidyverse equivalents. The tidyverse is work in progress, so expect more equivalents to come in the near future.

Tidyverse

infer::t_test for performing a tidy t-test
infer::chisq_test for performing a tidy chi-squared test

Base R

stats::aov for performing an ANOVA (aov stands for analysis of variance)
stats::lm for performing a linear regression (lm stands for linear models)
stats::glm for performing a logistic regression (glm stands for generalized linear models)
~~stats::t.test for performing a t-test~~
~~stats::chisq.test for performing a chi-squared test~~

Base versus tidy. Remember from the We R Novices masterclass that you often need the . placeholder when piping data into a base R function.

sleep %>% infer::t_test(formula = extra ~ group)  # tidyverse's t-test 
sleep %>% stats::t.test(formula = extra ~ group, data = .)  # base r's t-test

Student’s t-test

Data

The built-in data set sleep includes two student groups group that each received a different drug, and an outcome variable extra that indicates how many hours they slept more than the control group.

?"sleep"  # data description
sleep <- sleep %>% dplyr::as_tibble()  # load data as tibble
sleep %>% dplyr::select(extra, group) %>% dplyr::glimpse()
## Rows: 20
## Columns: 2
## $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1.9, 0.8,…
## $ group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2

Model

Let’s say we’re interested to learn whether the number of extra hours of sleep differs between groups.

mod_ttest <- extra ~ group

Test

Below, we show an example of:

A two sample t-test to evaluate the differences between the two drug groups in extra hours of sleep.
A one-sided two sample t-test to test whether drug group 1 slept less extra than drug group 2.
A paired t-test to evaluate the differences between the two groups as if the groups were not different people, but the same people observed twice (for an illustrative purpose).

res_ttest1 <- sleep %>% infer::t_test(mod_ttest, order = c("1", "2"))  # two sample t-test (two-sided), group 1 - 2
res_ttest1 <- sleep %>% infer::t_test(mod_ttest, order = c("2", "1"))  # two sample t-test (two-sided), group 2 - 1
res_ttest2 <- sleep %>% infer::t_test(mod_ttest, order = c("1", "2"), alternative = "less")  # use "greater" or "less" for one-sided
res_ttest3 <- sleep %>% infer::t_test(mod_ttest, order = c("1", "2"), paired = TRUE)  # paired t-test

In a t-test, the order of the groups matters. Make sure the values in the order argument match the values in your data. If you do not pick an order, the t_test function picks one for you and gives the ordering in a warning message.

sleep %>% infer::t_test(mod_ttest)
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the order
## "1" - "2", or divided in the order "1" / "2" for ratio-based statistics. To
## specify this order yourself, supply `order = c("1", "2")`.
## # A tibble: 1 x 6
##   statistic  t_df p_value alternative lower_ci upper_ci
##       <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>
## 1     -1.86  17.8  0.0794 two.sided      -3.37    0.205

Chi-squared test

Data

The infer::gss data set, a subset from the General Social Survey, provides information about American society and opinions, including college degree college and family income finrela.

?"gss"  # data description
gss %>% dplyr::select(college, finrela) %>% dplyr::glimpse()
## Rows: 500
## Columns: 2
## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no d…
## $ finrela <fct> below average, below average, below average, above average, a…

Model

Let’s say we’re interested in understanding whether college degree and family income are independent.

mod_chisq <- college ~ finrela

Test

The following function gives us the desired results.

res_chisq <- gss %>% infer::chisq_test(mod_chisq)

Note that the chi-squared test requires categorical variables. If one of the variables is not categorical, we should either change the variable type when reading in the data (see We R Novices) or coerce it into a factor (see We R Transformers). Here’s an example with the numeric hompop variable.

mod_chisq_num <- college ~ hompop

gss %>%
  dplyr::mutate(hompop = factor(hompop)) %>%  # coerce the numeric hompop variable into a factor
  infer::chisq_test(mod_chisq_num)

One-way ANOVA

Data

The built-in data set iris includes the sepal lengths Sepal.Length for three different iris species Species.

?"iris"  # data description
iris <- iris %>% dplyr::as_tibble()  # load data as tibble
iris %>% dplyr::select(Sepal.Length, Species) %>% dplyr::glimpse()
## Rows: 150
## Columns: 2
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …

Model

Let’s say we’re interested to learn whether sepal length differs between the three different iris species.

mod_aov1 <- Sepal.Length ~ Species

Test

We conduct a one-way ANOVA, followed by pairwise comparisons.

res_aov1 <- iris %>% stats::aov(mod_aov1, data = .)  # omnibus test
res_pairw1 <- res_aov1 %>% stats::TukeyHSD()  # pairwise comparisons

Two-way ANOVA and ANCOVA

Data

The built-in data set mtcars includes various categorical and numerical variables that describe 32 different cars, such as the fuel consumption mpg, the weight wt, the engine vs, and the transmission am.

?"mtcars"  # data description
mtcars <- mtcars %>% dplyr::as_tibble()  # load data as tibble
mtcars %>% dplyr::select(mpg, wt, vs, am) %>% dplyr::glimpse()
## Rows: 32
## Columns: 4
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ wt  <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ vs  <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am  <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…

Model

Let’s say we’re interested to learn whether the fuel consumption differs between the types of engine and transmissions.

mod_aov2 <- mpg ~ vs + am

And let’s say we’re interested to learn whether the fuel consumption differs between the types of engine, given the weight of the car.

mod_aov3 <- mpg ~ vs + wt

Test

We conduct a two-way ANOVA and an ANCOVA.

res_aov2 <- mtcars %>% stats::aov(mod_aov2, data = .)  # two-way ANOVA
res_aov3 <- mtcars %>% stats::aov(mod_aov3, data = .)  # ANCOVA

Importantly, in the aov function the order of the independent variables matters for the outcome, because it uses type 1 (sequential) sums of squares (whereas the type 3 sums of squares is the default in SPSS). If you prefer to use type 2 or type 3 sums of squares, you must put your result from the aov function into the Anova function from the car package.

res_aov2 %>% car::Anova(type = 3)
res_aov3 %>% car::Anova(type = 3)

Linear regression

Data

The built-in data set iris includes the petal and sepal lengths and widths for three different iris species.

?"iris"  # data description
iris <- iris %>% dplyr::as_tibble()  # load data as tibble
iris %>% dplyr::glimpse()
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …

Model

First, let’s repeat the one-way ANOVA example in a regression context. Importantly, when using a categorical / group variable in the lm function for linear models, it must be a factor variable (or alternatively, you’d have to create dummy variables). Check the We R Novices masterclass if you forgot about factor variables. Catching a glimpse of the data above, we see that the Species variable is a factor.

mod_aov1  # formula is specified in the one-way ANOVA example

Then, let’s say we’re interested in testing a few more hypotheses.

mod_lm2 <- Sepal.Length ~ Sepal.Width
mod_lm3 <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
mod_lm4 <- Sepal.Length ~ Petal.Length + Petal.Width + Petal.Length:Petal.Width

Test

Now, let’s perform the analyses.

res_lm1 <- iris %>% stats::lm(mod_aov1, data = .)  # look, we feed it the exact same formula as for the one-way ANOVA
res_lm2 <- iris %>% stats::lm(mod_lm2, data = .)  # simple linear regression
res_lm3 <- iris %>% stats::lm(mod_lm3, data = .)  # multiple linear regression
res_lm4 <- iris %>% stats::lm(mod_lm4, data = .)  # multiple linear regression with interaction term

Logistic regression

Data

The infer::gss data set, a subset from the General Social Survey, provides information about American society and opinions, including college degree college and household size hompop.

?"gss"  # data description
gss %>% dplyr::select(college, hompop) %>% dplyr::glimpse()
## Rows: 500
## Columns: 2
## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no d…
## $ hompop  <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3, 2, 1, 2, 5…

Model

Let’s say we’re interested in understanding whether household size predicts college degree.

mod_glm <- college ~ hompop

Test

For logistic regression, be sure to set the family argument to binomial. As you can read at ?"family", the default link function for the binomial family is the logit, which is what we need for logistic regression. It also tells you that when the independent variable is a factor variable, the binomial family takes the second level of that factor as the reference level. In the data paragraph above we can verify that the college variable is indeed a factor variable. Now, using the levels function and the summarize function discussed in the summary statistics section, we can check which of the two levels is the second level and thus the reference level. We’ll show you how to change the order of the levels in the We R Transformers masterclass.

gss %>% summarize(college_levels = levels(college))  # check reference level (second level)
res_glm <- gss %>% stats::glm(mod_glm, data = ., family = "binomial")

Give me more. Whereas the lm function only takes the Gaussian family, the glm function generalizes to many more families. Check them at ?"family".

Repeated measures / multilevel

Give me more. As we explained previously, R has very advanced packages for multilevel modeling that you can explore on your own (e.g., lme4 and nlme). Nevertheless, below is an example of a simple repeated measures ANOVA, which you can enhance on your own (don’t tell anyone we gave you this link).

Data

The storms data set includes multiple observations for storm names name, storm classifications status, maximum sustained wind speeds wind, and report dates and times datetime (combination of year, month, day, hour).

?"storms"  # data description
storms <- storms %>% dplyr::as_tibble()  # load data as tibble
storms <- storms %>% dplyr::mutate(datetime = lubridate::make_datetime(year = year, month = month, day = day, hour = hour)) %>% dplyr::select(-year, -month, -day, -hour)  # create a single date variable, we might cover this in the We R Transformers masterclass
storms %>% dplyr::select(wind, status, name, datetime) %>% dplyr::glimpse()
## Rows: 10,010
## Columns: 4
## $ wind     <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 45, 50, 50, 55, 60, …
## $ status   <chr> "tropical depression", "tropical depression", "tropical depr…
## $ name     <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy…
## $ datetime <dttm> 1975-06-27 00:00:00, 1975-06-27 06:00:00, 1975-06-27 12:00:…

Model

Let’s say we’re interested in understanding whether the maximum sustained wind speeds differ between storm classifications (and let’s forget this is a rhetorical question). Evidently, errors of measurement within storms are more alike than errors of measurement between storms. To account for this, you need to specify how the errors are clustered.

mod_rm <- wind ~ status + Error(name)

Test

Let’s perform a repeated measures ANOVA.

res_rm <- storms %>% stats::aov(mod_rm, data = .)

3.2.3 View results

Base versus tidy. Base R’s statistical packages output an object that can be best compared to a nasty bloodstain pattern after a brutal murder; only a few people have the expertise and desire to look into it. Tidyverse statistical packages on the other hand, output a tidy tibble. And thankfully, the tidyverse hosts the broom package that does some decent bloodstain pattern analysis and is able to return a tidy tibble that summarizes key information about all kinds of fitted statistical models for most base R analyses. Run vignette("available-methods") to see the statistical packages that currently work with broom.

From a tidyverse statistical package

Directly inspect the object that you assigned the results to.

res_ttest1

From a base R statistical package

Use broom to tidy the results for you.

broom::tidy summarizes info on the model components, like regression coefficients.
broom::glance reports information about the entire model, like the R squared and model fit measures.
broom::augment adds information such as the fitted values and the residuals to the columns of the data.

res_lm1 %>% broom::tidy()
res_lm1 %>% broom::glance()
res_lm1 %>% broom::augment()

Regretfully, the broom package does not support output from the car::Anova function yet, but it might be added in the future.

3.2.4 Check assumptions

Here, we discuss various checks for assumptions such as in linear regression and analysis of variance:

the linearity assumption
the normality of residuals assumption
the homoscedasticity (regression slang) / homogeneity of variance (ANOVA slang) assumption
the no multicollinearity assumption
influential observations

Visualizations

The autoplot function of ggplot2—tidyverse’s powerful graphical package that you’ll conquer in the We R Visualizers masterclass—helps us automatically generate default plots for all sorts of things. However, it doesn’t do this on its own and needs help from other packages.

The ggfortify package is there to help. It capitalizes on the autoplot function to create default diagnostic plots for a great variety of statistical packages, including lm and glm. As ggfortify is not part of the tidyverse, you’ll need to install and load it separately.

For a linear regression object from lm, it helps you check the linearity assumption (residuals vs. fitted plot), normality of residuals assumption (normal Q–Q plot), homoscedasticity assumption (scale-location plot), and influential observations (residuals vs. leverage plot).

library("ggfortify")
res_lm3 %>% 
  ggplot2::autoplot() +
  ggplot2::theme_minimal()

Boxplot

A boxplot can be useful for checking influential observations and visually checking the homogeneity of variance assumption when performing an ANOVA. The boxplot below shows the distributions of sepal length for the three iris species (corresponding to the one-way ANOVA example above).

iris %>% 
  ggplot2::ggplot(aes(x = Species, y = Sepal.Length, fill = Species)) +
  ggplot2::geom_boxplot() +
  ggplot2::theme_minimal()

Density plot

In addition to the normal Q–Q plot, a density plot of the residuals can be used to check the normality of residuals assumption. Compare the density curve of the standardized residuals of your fitted multiple regression model (solid line) with the normal curve given the mean and standard deviation of those standardized residuals (dashed line).

res_lm3 %>% broom::augment() %>% 
  ggplot2::ggplot(aes(x = .std.resid)) +
  ggplot2::geom_density(linetype = "solid") +
  ggplot2::stat_function(fun = dnorm, linetype = "dashed") +
  ggplot2::theme_minimal()

Descriptives

Correlation matrix

In the Data section of this masterclass it is shown how a correlation matrix (of the independent variables) is created, which can be used to check the no multicollinearity assumption.

Variance inflation factor

To check the no multicollinearity assumption in base R, the variance inflation factors are easily computed by inserting the results object of a multiple regression analysis (using the lm function) into the vif function.

res_lm4 %>% car::vif()

Tests

Levene’s test

With multiple groups, such as in an ANOVA, the homogeneity of variance assumption can be checked with the Levene’s test in base R.

res_aov %>% car::leveneTest()  # the default center of each group is the median
res_aov %>% car::leveneTest(center = mean)  # but you can change the center to the mean

Shapiro–Wilk test

Base R supports a Shapiro–Wilk test on your standardized residuals, to check the normality of residuals assumption.

res_lm3 %>% 
  broom::augment() %>%
  dplyr::pull(.std.resid) %>%  # pulls out the vector of your standardized residuals, as discussed in the _We R Novices_ masterclass
  stats::shapiro.test()

Give me more. There is a tidyverse-friendly function for the Shapiro–Wilk test in the rstatix package, which enables you to perform the test per group, but which must be installed and loaded separately.

library(rstatix)
res_lm3 %>% 
  broom::augment() %>%
  dplyr::group_by(Species) %>%
  rstatix::shapiro_test(.std.resid)

3.3 Exercises

In these exercises you will work again with the data used in this chapter’s demonstration script. The demonstration script might serve as a helpful example for most of these exercises.

Student’s t-test

Instruction

Test whether males and females differ in their agreement with the statement “Computers are complicated machines”.

Tips

First, create the appropriate model formula using the variable names female and computers, and then use this formula object in the t-test function.

Chi-squared test

Instruction

The students were randomly assigned to three conditions; students in each group listened to a different song. Let’s check whether the random assignment yielded a roughly equal number of males and females in each of the three conditions.

Tips

First, create the appropriate model formula, and then use this formula object in the chi-squared test function.
After having performed both a t-test and chi-squared test, your chai tea will never taste the same.

One-way ANOVA

Instruction

Does listening to different songs influence how much the students would enjoy eating at a diner?

Test the differences in the answer to the statement “Imagine you were going to a diner for dinner tonight, how much do you think you would like the food?” between the three groups. Create the appropriate model formula using the variables cond and diner and use this model object in the aov function.
Extract the omnibus test results using the appropriate function from the broom package. Is the difference between the three groups significant?
Test the homogeneity of variance assumption using the Levene’s test.

Give me more. The number of observations is low, so even if the variances between the three groups are different, we might not have enough power to detect these differences with a Levene’s test. Let’s additionally check this assumption visually by creating a box plot.

Linear regression analysis

Instruction

Let’s try to predict students’ political orientation using their dad’s age, mom’s age, and their agreement with the statement “Computers are complicated machines”.

Fit the corresponding linear regression model by creating the model formula and inserting this into the appropriate function.
Extract the results. Get all relevant output such that you would be able to report the R squared of the model, and the regression coefficients (estimates) for all three independent variables and the corresponding p-values.
Check multiple assumptions of linear regression visually using the autoplot function from the ggplot2 package.

Logistic regression

Instruction

Predict students’ sex by their political orientation and by their agreement with the statement “Computers are complicated machines”. Fit the corresponding logistic regression model and extract the results such that you could report the BIC of the model and the regression coefficients.

Bug hunt

Instruction

Run the following lines of code. What happens? Why?

model <- extra ~ group
sleep %>% infer::t_test(model)
sleep %>% stats::lm(model)

Tips

The linear regression throws an error. It has something to do with using the pipe operator on a function that is not from the tidyverse.

Solutions

Use your psi to feel your future R champion and solve the exercises with the expertise that you will obtain in these masterclasses*.

3.4 Break

xkcd.com

3.5 Discussion

Lingering questions or concerns? Use Piazza to post a question or up on our plenary discussion.

4 Wrap Up

4.1 Practice data sets

Readily available data sets

Some R packages contain data that you can freely use. Unfortunately, many data does not come as a data frame or tibble, and is thus not suitable for your needs (below we give a short list of data that is).

Run data() to see which data sets are available in the (loaded) packages.
Run data(package = "dplyr") to look for data sets in the dplyr package (replace with any other package).
Run help("storms") to get information on the storms data set (replace with any other data set).
Run data("storms") to load the storms data set (replace with any other data set).

Data frames and tibbles

Data in packages that belong to the tidyverse (such as dplyr and tidyr), have a higher chance of being in the tibble() format that you are now familiar with. Here are some examples that you might find useful.

help("iris")
help("mtcars")
help("storms")  # longitudinal data

Data packages

Some packages contain only data. There are many such data packages, but you won’t be able to find them easily. One you should definitely check out is the modeldata package. It’s part of the tidymodels collections, so you’ve already loaded it. Check the link or run data(package = "modeldata") to see which data sets it includes. For a bunch of other data packages follow this link, it will give you an overview of some data packages from 2017.

4.2 Take home exercises

Finish the exercises

Continue with the exercises, use Piazza to ask questions.

Explore your data

Grab some of your own data, tidy it, and load it into R. Start applying what you’ve learned so far!

Don’t have your own data? Or want practice data? Check the previous section to get some tips for using readily available data.

Toss out TOSS

Working on a research project right now or starting one soon? It’s time to toss out That Other Statistical Software you normally use and force yourself to do the analyses in R. You now know where to find help. And you regularly compare your results with your co-pilot’s results anyway, don’t you?

5 Colophon

Created with R Markdown and generated on November 19, 2020.

Reproducibility receipt

Session information


─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.2 (2020-06-22)
 os       macOS Catalina 10.15.7      
 system   x86_64, darwin17.0          
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Amsterdam            
 date     2020-11-19                  

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version    date       lib source        
 assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.0.0)
 backports      1.1.10     2020-09-15 [1] CRAN (R 4.0.2)
 BiocManager    1.30.10    2019-11-16 [1] CRAN (R 4.0.0)
 broom        * 0.7.2      2020-10-20 [1] CRAN (R 4.0.2)
 cellranger     1.1.0      2016-07-27 [1] CRAN (R 4.0.0)
 class          7.3-17     2020-04-26 [1] CRAN (R 4.0.2)
 cli            2.1.0      2020-10-12 [1] CRAN (R 4.0.2)
 clipr          0.7.0      2019-07-23 [1] CRAN (R 4.0.0)
 codetools      0.2-16     2018-12-24 [1] CRAN (R 4.0.2)
 colorspace     1.4-1      2019-03-18 [1] CRAN (R 4.0.0)
 corrr          0.4.2      2020-03-22 [1] CRAN (R 4.0.2)
 crayon         1.3.4      2017-09-16 [1] CRAN (R 4.0.0)
 DBI            1.1.0      2019-12-15 [1] CRAN (R 4.0.0)
 dbplyr         2.0.0      2020-11-03 [1] CRAN (R 4.0.2)
 desc           1.2.0      2018-05-01 [1] CRAN (R 4.0.0)
 details      * 0.2.1      2020-01-12 [1] CRAN (R 4.0.0)
 dials        * 0.0.9      2020-09-16 [1] CRAN (R 4.0.2)
 DiceDesign     1.8-1      2019-07-31 [1] CRAN (R 4.0.0)
 digest         0.6.25     2020-02-23 [1] CRAN (R 4.0.0)
 dplyr        * 1.0.2      2020-08-18 [1] CRAN (R 4.0.2)
 ellipsis       0.3.1      2020-05-15 [1] CRAN (R 4.0.0)
 evaluate       0.14       2019-05-28 [1] CRAN (R 4.0.0)
 fansi          0.4.1      2020-01-08 [1] CRAN (R 4.0.0)
 farver         2.0.3      2020-01-16 [1] CRAN (R 4.0.0)
 forcats      * 0.5.0      2020-03-01 [1] CRAN (R 4.0.0)
 foreach        1.5.0      2020-03-30 [1] CRAN (R 4.0.0)
 fs             1.5.0      2020-07-31 [1] CRAN (R 4.0.2)
 furrr          0.1.0      2018-05-16 [1] CRAN (R 4.0.0)
 future         1.19.1     2020-09-22 [1] CRAN (R 4.0.2)
 generics       0.1.0      2020-10-31 [1] CRAN (R 4.0.2)
 ggfortify    * 0.4.10     2020-04-26 [1] CRAN (R 4.0.2)
 ggimage        0.2.8      2020-04-02 [1] CRAN (R 4.0.0)
 ggplot2      * 3.3.2      2020-06-19 [1] CRAN (R 4.0.0)
 ggplotify      0.0.5      2020-03-12 [1] CRAN (R 4.0.0)
 git2r        * 0.27.1     2020-05-03 [1] CRAN (R 4.0.0)
 globals        0.13.0     2020-09-17 [1] CRAN (R 4.0.2)
 glue           1.4.2      2020-08-27 [1] CRAN (R 4.0.2)
 gower          0.2.2      2020-06-23 [1] CRAN (R 4.0.2)
 GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.0.0)
 gridExtra      2.3        2017-09-09 [1] CRAN (R 4.0.0)
 gridGraphics   0.5-0      2020-02-25 [1] CRAN (R 4.0.0)
 gtable         0.3.0      2019-03-25 [1] CRAN (R 4.0.0)
 haven          2.3.1      2020-06-01 [1] CRAN (R 4.0.2)
 hexbin         1.28.1     2020-02-03 [1] CRAN (R 4.0.0)
 hexSticker   * 0.4.7      2020-06-01 [1] CRAN (R 4.0.0)
 highr          0.8        2019-03-20 [1] CRAN (R 4.0.0)
 hms            0.5.3      2020-01-08 [1] CRAN (R 4.0.0)
 htmltools      0.5.0      2020-06-16 [1] CRAN (R 4.0.2)
 httr           1.4.2      2020-07-20 [1] CRAN (R 4.0.2)
 infer        * 0.5.3      2020-07-14 [1] CRAN (R 4.0.2)
 ipred          0.9-9      2019-04-28 [1] CRAN (R 4.0.0)
 iterators      1.0.12     2019-07-26 [1] CRAN (R 4.0.0)
 jsonlite       1.7.1      2020-09-07 [1] CRAN (R 4.0.2)
 knitr          1.30       2020-09-22 [1] CRAN (R 4.0.2)
 labeling       0.3        2014-08-23 [1] CRAN (R 4.0.0)
 lattice        0.20-41    2020-04-02 [1] CRAN (R 4.0.2)
 lava           1.6.8      2020-09-26 [1] CRAN (R 4.0.2)
 lhs            1.1.0      2020-09-29 [1] CRAN (R 4.0.2)
 lifecycle      0.2.0      2020-03-06 [1] CRAN (R 4.0.0)
 listenv        0.8.0      2019-12-05 [1] CRAN (R 4.0.0)
 lubridate      1.7.9      2020-06-08 [1] CRAN (R 4.0.2)
 magick         2.4.0      2020-06-23 [1] CRAN (R 4.0.2)
 magrittr       1.5        2014-11-22 [1] CRAN (R 4.0.0)
 MASS           7.3-53     2020-09-09 [1] CRAN (R 4.0.2)
 Matrix         1.2-18     2019-11-27 [1] CRAN (R 4.0.2)
 modeldata    * 0.1.0      2020-10-22 [1] CRAN (R 4.0.2)
 modelr         0.1.8      2020-05-19 [1] CRAN (R 4.0.2)
 munsell        0.5.0      2018-06-12 [1] CRAN (R 4.0.0)
 nnet           7.3-14     2020-04-26 [1] CRAN (R 4.0.2)
 parsnip      * 0.1.4      2020-10-27 [1] CRAN (R 4.0.2)
 pillar         1.4.6      2020-07-10 [1] CRAN (R 4.0.2)
 pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.0.0)
 plotrix      * 3.7-8      2020-04-16 [1] CRAN (R 4.0.2)
 plyr           1.8.6      2020-03-03 [1] CRAN (R 4.0.0)
 png            0.1-7      2013-12-03 [1] CRAN (R 4.0.0)
 pROC           1.16.2     2020-03-19 [1] CRAN (R 4.0.0)
 prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.0.0)
 purrr        * 0.3.4      2020-04-17 [1] CRAN (R 4.0.0)
 R6             2.4.1      2019-11-12 [1] CRAN (R 4.0.0)
 Rcpp           1.0.5      2020-07-06 [1] CRAN (R 4.0.2)
 readr        * 1.4.0      2020-10-05 [1] CRAN (R 4.0.2)
 readxl         1.3.1      2019-03-13 [1] CRAN (R 4.0.0)
 recipes      * 0.1.15     2020-11-11 [1] CRAN (R 4.0.2)
 reprex         0.3.0      2019-05-16 [1] CRAN (R 4.0.0)
 rlang          0.4.8      2020-10-08 [1] CRAN (R 4.0.2)
 rmarkdown      2.4        2020-09-30 [1] CRAN (R 4.0.2)
 rpart          4.1-15     2019-04-12 [1] CRAN (R 4.0.2)
 rprojroot      1.3-2      2018-01-03 [1] CRAN (R 4.0.0)
 rsample      * 0.0.8      2020-09-23 [1] CRAN (R 4.0.2)
 rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.0.2)
 rvcheck        0.1.8      2020-03-01 [1] CRAN (R 4.0.0)
 rvest          0.3.6      2020-07-25 [1] CRAN (R 4.0.2)
 scales       * 1.1.1      2020-05-11 [1] CRAN (R 4.0.0)
 sessioninfo  * 1.1.1      2018-11-05 [1] CRAN (R 4.0.0)
 showtext       0.9        2020-08-13 [1] CRAN (R 4.0.2)
 showtextdb     3.0        2020-06-04 [1] CRAN (R 4.0.0)
 stringi        1.5.3      2020-09-09 [1] CRAN (R 4.0.2)
 stringr      * 1.4.0      2019-02-10 [1] CRAN (R 4.0.0)
 survival       3.2-7      2020-09-28 [1] CRAN (R 4.0.2)
 sysfonts       0.8.1      2020-05-08 [1] CRAN (R 4.0.0)
 tibble       * 3.0.4      2020-10-12 [1] CRAN (R 4.0.2)
 tidymodels   * 0.1.1      2020-07-14 [1] CRAN (R 4.0.2)
 tidyr        * 1.1.2      2020-08-27 [1] CRAN (R 4.0.2)
 tidyselect     1.1.0      2020-05-11 [1] CRAN (R 4.0.0)
 tidyverse    * 1.3.0      2019-11-21 [1] CRAN (R 4.0.2)
 timeDate       3043.102   2018-02-21 [1] CRAN (R 4.0.0)
 tune         * 0.1.1      2020-07-08 [1] CRAN (R 4.0.2)
 utf8           1.1.4      2018-05-24 [1] CRAN (R 4.0.0)
 vctrs          0.3.4      2020-08-29 [1] CRAN (R 4.0.2)
 withr          2.3.0      2020-09-22 [1] CRAN (R 4.0.2)
 workflows    * 0.2.1      2020-10-08 [1] CRAN (R 4.0.2)
 xfun           0.18       2020-09-29 [1] CRAN (R 4.0.2)
 xml2           1.3.2      2020-04-23 [1] CRAN (R 4.0.0)
 yaml           2.2.1      2020-02-01 [1] CRAN (R 4.0.0)
 yardstick    * 0.0.7      2020-07-13 [1] CRAN (R 4.0.2)

[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

License

We R Analysts by Alexander Savi & Simone Plak is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. An Open Educational Resource. Approved for Free Cultural Works.

We R Analysts

Part of the We R Champions masterclass series

Alexander Savi, Amsterdam Center for Learning Analytics

Simone Plak, Amsterdam Center for Learning Analytics

November 17, 2020

1 Welcome

1.1 Academic countdown

1.2 Expectations

1.3 Prerequisites

1.4 Reading guide

2 Data

2.1 Demonstration

2.2 Resources

2.2.1 Packages

2.2.2 Split data

2.2.3 Summarize data

Summary statistics

Frequencies

Correlations

2.3 Exercises

Finish it

DIY

Frequencies

Correlations

Solutions

2.4 Break

2.5 Discussion

3 Analyses

3.1 Demonstration

3.2 Resources

3.2.1 Formulate model

3.2.2 Perform analysis

Student’s t-test

Chi-squared test

One-way ANOVA

Two-way ANOVA and ANCOVA

Linear regression

Logistic regression

Repeated measures / multilevel

3.2.3 View results

3.2.4 Check assumptions

Visualizations

Descriptives

Tests

3.3 Exercises

Student’s t-test

Chi-squared test

One-way ANOVA

Linear regression analysis

Logistic regression

Bug hunt

Solutions

3.4 Break

3.5 Discussion

4 Wrap Up

4.1 Practice data sets

4.2 Take home exercises

Finish the exercises

Explore your data

Toss out TOSS

5 Colophon