The R Analyst badge; learn it before you earn it.
[1] "The We R Analysts masterclass introduces basic frequentist data analysis in R. From elucidative descriptives to infallible inferences, R has you covered. And although R's decisive power is in its incredibly large and ever-expanding collection of advanced statistical packages, the foundations of data analysis in R that you'll learn in this masterclass will enable you to explore R's full potential."
Our online version of an academic quarter. We’ll take five minutes for everyone to get in and get settled. Are you ready?
All done? Familiarize yourself with the document.
Very basic skills in base R and tidyverse.
Yes, I know how to…
Basic skills in (frequentist) statistics.
Yes, I know that…
Sections.
Text blocks.
The demonstration gives you the opportunity to see everything you’ll learn in harmony, before we break it apart into all the various building blocks that you can explore on your own.
In this demonstration we use the data from a paper by Simmons, Nelson, and Simonsohn (2011) published in Psychological Science. The paper aimed to show how flexibility in data collection, analysis, and reporting can dramatically increase false-positive rates, i.e., statistically significant evidence for false hypotheses. Let’s call it an ethical guide to unethical p-hacking. The authors conducted two experiments; the data used in this demonstration is the data from their Study 1.
Reference: Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.
# Load necessary package collections
library("tidyverse")
library("tidymodels")
# Read in data (We R Novices class)
study_1 <- readr::read_delim("data/Study 1 .txt", delim = "\t", escape_double = FALSE, trim_ws = TRUE)
# View data (We R Novices class)
study_1 %>% tibble::view()
# Mean age in years
study_1 %>%
dplyr::summarize(aged365_mean = mean(aged365))
# Mean age in years, by group
study_1 %>%
dplyr::group_by(cond) %>%
dplyr::summarize(aged365_mean = mean(aged365))
# Means of all numeric variables, by group
study_1 %>%
dplyr::group_by(cond) %>%
dplyr::summarize(across(where(is.numeric), list(mean = mean)))
# Various descriptives for two variables, by group
study_1 %>%
dplyr::group_by(cond) %>%
dplyr::summarize(across(c(aged365, aged), list(mean = mean, sd = sd, min = min, max = max)),
n = n())
# Correlations
library("corrr")
study_1 %>%
dplyr::select(where(is.numeric)) %>%
corrr::correlate()
Yeh
Nah
In We R Novices, you learned how to install, load, and update packages. In this masterclass, we’ll use two package collections: the tidyverse collection introduced in We R Novices and the tidymodels collection. Tidymodels is designed for statistical modeling and machine learning, using tidyverse principles.
install.packages("tidyverse") # installs the tidyverse collection
install.packages("tidymodels") # installs the tidymodels collection
library("tidyverse") # loads the tidyverse collection
library("tidymodels") # loads the tidymodels collection
tidyverse_update() # updates the tidyverse collection
tidymodels_update() # updates the tidymodels collection
tidyverse_packages() # shows the tidyverse packages
tidymodels_packages() # shows the tidymodels packages
We’ll use various of the included packages throughout the masterclass, including:
Give me more. Before you start exploring your data set and cherry-picking the significant p-values (this is not a tip), you might want to keep a holdout set untouched (this is a tip).
iris_split <- iris %>% rsample::initial_split(prop = .8) # the prop argument determines the proportion of data you use for analysis
iris_explore <- iris_split %>% rsample::training()
iris_holdout <- iris_split %>% rsample::testing()
R helps you to easily compute some common summary statistics for location, spread, shape, and dependence.
In We R Novices you applied the mean
function to a vector. Here are a few other descriptive functions:
mean
for the arithmetic meanmedian
for the mediansum
for the summin
for the minimummax
for the maximumsd
for the standard deviationvar
for the variancen
for the group sizeIQR
for the interquartile rangemad
for the median absolute deviationrange
for the rangequantile
for the quartiles (or other quantiles)skewness
for the skewness (requires the moments package)kurtosis
for the kurtosis (requires the moments package)As we are now working with tibbles rather then vectors, we need the summarize
function from the dplyr package to apply one of the above functions to a variable.
iris %>% dplyr::summarize(sepal_length_median = median(Sepal.Length))
iris %>% dplyr::summarise(sepal_length_median = median(Sepal.Length)) # British spelling works too!
iris %>% dplyr::samenvatten(sepal_length_median = median(Sepal.Length)) # Dutch spelling fails miserably, such a missed opportunity
iris %>% dplyr::summarize(sepal_length_median = median(Sepal.Length), sepal_width_min = min(Sepal.Width), n = n()) # add descriptives by separating them by a comma
Some remarks.
view
function.n
function, you do not specify a variable, as it does not target a specific variable but simply computes the group size. Also, n
is a special function that only works within the summarize
function. If, for some reason, you want to know the size of a vector, use the length
function.range
and quantile
functions return multiple values per variable and thus do not work well together with the other statistics.Grouped statistics
Want to see the descriptives per group? Specify the variable that identifies the group in the group_by
function, and R does the rest.
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarize(sepal_length_median = median(Sepal.Length), sepal_width_min = min(Sepal.Width), n = n())
Multiple statistics
A very helpful function for computing various descriptive statistics of multiple variables is the across
function. Within this function you first specify the desired variable(s) and then the desired statistic(s). You use the across
function inside the summarize
function.
# multiple descriptive statistics of one variable
iris %>%
dplyr::summarize(across(Sepal.Length, list(mean = mean, sd = sd, min = min, max = max)))
# the means of two variables
iris %>%
dplyr::summarize(across(c(Sepal.Length, Sepal.Width), list(mean = mean)))
# the means and standard deviations of two variables per level of another variable
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarize(across(c(Sepal.Length, Sepal.Width), list(mean = mean, sd = sd)))
The examples, or your psychokinetic abilities, allow you to discover two stranger things:
c
function) and multiple statistics are grouped in a list (using the list
function).n
function is used outside of the across function, as it does not target a specific variable but simply computes the group size.Give me more. If you want to get really funky, various helper functions may help you select multiple variables at once. These functions should be used within the across
function.
starts_with
for specifying what the variable name starts withends_with
for specifying what the variable name ends withcontains
for specifying what the variable name containswhere
for specifying what the variable is# the means of variables that start with Sepal
iris %>%
dplyr::summarize(across(starts_with("Sepal"), list(mean = mean)))
# the means of variables that end with Width
iris %>%
dplyr::summarize(across(ends_with("Width"), list(mean = mean)))
# the means of all numeric variables
iris %>%
dplyr::summarize(across(where(is.numeric), list(mean = mean)))
# multiple variables and multiple descriptive statistics per species group
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarize(across(where(is.numeric), list(mean = mean, sd = sd, min = min, max = max)), n = n())
To obtain frequencies, use the count
function from the dplyr package.
Or apply your newly obtained knowledge about the group_by
and summarize
functions.
The corrr package deals with corrrrelations. Although it is part of the tidymodels collection, it is not (yet) automatically installed and loaded.
Now let’s create a correlation matrix. Make sure to select numeric variables only.
iris %>%
dplyr::select(where(is.numeric)) %>%
corrr::correlate() %>%
corrr::rearrange() # optional, orders the correlations by magnitude
Check the help file of the correlate
function to understand how you can change the method it uses, or how it deals with missings. And check the package website to see how to easily visualize the correlation matrix (and you may now guess the type of visualization on the badge for this masterclass).
In these exercises you will work with the data used in this chapter’s initial demonstration, or alternatively, you may grab some of your own data. A description of the p-hacking data is provided in the demonstration section.
Instruction
Download the data for Study 1 of the paper here. Open the .zip file, and find the data file called Study 1 .txt. Read this data into R and name it study_1. A text file called Codebook.txt provides a short description of each variable.
The script below computes the mean of the variable computer.
Now, adapt the script to compute the following summary statistics:
Tips
Instruction
Tips
Instruction
Tips
Instruction
Tips
Finding answers—to pretty much anything—is really not that hard.
phdcomics.com
Lingering questions or concerns? Use Piazza to post a question or up on our plenary discussion.
Let us again first see everything we’ll learn in harmony, before breaking it down. In this demonstration, we’ll go over doing basic frequentist inference.
# Student's t-test: political orientation difference between two groups (listened to when64 vs. not)
# Data
study_1 %>%
dplyr::select(political, when64) %>%
dplyr::glimpse()
# Model
mod_ttest <- political ~ when64
# Test & view results
study_1 %>%
infer::t_test(mod_ttest, order = c("0", "1"))
# Multiple linear regression: predicting how complicated the students find computers using their sex, age and political orientation
# Data
study_1 %>%
dplyr::select(computer, female, aged365, political) %>%
dplyr::glimpse()
# Model
mod_lm <- computer ~ female + aged365 + political
# Test
res_lm <- study_1 %>%
stats::lm(mod_lm, data = .)
# View results
res_lm %>% broom::glance() # whole model results
res_lm %>% broom::tidy() # coefficients
res_lm %>% broom::augment() # adds useful results to data (like std. residuals)
# Assumption check: normality of residuals
res_normality <- res_lm %>%
broom::augment() %>%
dplyr::pull(.std.resid) %>%
stats::shapiro.test()
res_normality %>%
broom::tidy()
# Now visually
res_lm %>% broom::augment() %>%
ggplot2::ggplot(aes(x = .std.resid)) +
ggplot2::geom_density() +
ggplot2::stat_function(fun = dnorm, linetype = "dashed") +
ggplot2::theme_minimal()
Yeh
Nah
Most statistical techniques in R require you to explicate your model. Now, don’t hide right away, it’s not as scary as you might think it is. And conveniently, it is done in a uniform format. In R, you explicate your model in a formula:
y ~ model
y
is a common synonym for the dependent/response variable. But, ever wondered y? It might have to do with x.~
simply states that the left-hand side y
is modelled by the right-hand side model
.model
is where you specify the model. That is, the independent/predictor variables and all their crazy interactions you came up with in a lucid dream.Now, let’s compare two formulas with different models. A model consists of different terms, such as main effects and interactions:
y ~ a + b + a:b
and y ~ 0 + a - b
a
and b
represent main effects+
operator adds a term to the model-
operator removes a term from the model:
operator creates an interaction term0
removes the interceptBe aware that as soon as the tilde operator ~
is used, you enter the formula realm, and the above operators receive their formula interpretation. However, if you need their arithmetic interpretation, you can wrap the term in the I
function (short for as is) to inhibit the formula interpretation: e.g., I(a + b)
creates a main effect for the sum of the variables a
and b
, rather than adding a
and b
as separate main effects.
When writing your own model, you naturally replace the y
with the name of your dependent variable and the a
and b
(and so on) with the names of your independent variables. Conveniently, you can write your formula upfront and save it as an object:
Give me more. Three more operators help you to write your models more concisely (and more difficult to grasp). Run help("formula")
to read more.
*
operator creates crossed terms, e.g., a * b
expands to a + b + a:b
%in%
operator creates nested terms, e.g., a + b %in$ a
expands to a + a:b
^
operator creates crossed terms to the specified degree, e.g., (a + b + c)^2
expands to (a + b + c) * (a + b + c)
expands to a + b + c + a:b + a:c + b:c
Additionally, in multilevel modeling the |
operator is used to specify the group level. R has very advanced packages for multilevel modeling that you can explore on your own (e.g., lme4 and nlme).
Finally, if you think main effects and interactions are for losers, you can go nuts with logarithms and orthogonal polynomials using the log
and poly
functions. Great tools for a phishing expedition, you slinky p-hacker.
In this masterclass, we only cover a few of the most common statistical techniques. Most of these analyses are built in base R, but work well with tidyverse. Nonetheless, where possible, we show their tidyverse equivalents. The tidyverse is work in progress, so expect more equivalents to come in the near future.
Tidyverse
infer::t_test
for performing a tidy t-testinfer::chisq_test
for performing a tidy chi-squared testBase R
stats::aov
for performing an ANOVA (aov stands for analysis of variance)stats::lm
for performing a linear regression (lm stands for linear models)stats::glm
for performing a logistic regression (glm stands for generalized linear models)stats::t.test
for performing a t-teststats::chisq.test
for performing a chi-squared testBase versus tidy. Remember from the We R Novices masterclass that you often need the .
placeholder when piping data into a base R function.
Data
The built-in data set sleep
includes two student groups group
that each received a different drug, and an outcome variable extra
that indicates how many hours they slept more than the control group.
?"sleep" # data description
sleep <- sleep %>% dplyr::as_tibble() # load data as tibble
sleep %>% dplyr::select(extra, group) %>% dplyr::glimpse()
## Rows: 20
## Columns: 2
## $ extra <dbl> 0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0.0, 2.0, 1.9, 0.8,…
## $ group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
Model
Let’s say we’re interested to learn whether the number of extra hours of sleep differs between groups.
Test
Below, we show an example of:
res_ttest1 <- sleep %>% infer::t_test(mod_ttest, order = c("1", "2")) # two sample t-test (two-sided), group 1 - 2
res_ttest1 <- sleep %>% infer::t_test(mod_ttest, order = c("2", "1")) # two sample t-test (two-sided), group 2 - 1
res_ttest2 <- sleep %>% infer::t_test(mod_ttest, order = c("1", "2"), alternative = "less") # use "greater" or "less" for one-sided
res_ttest3 <- sleep %>% infer::t_test(mod_ttest, order = c("1", "2"), paired = TRUE) # paired t-test
In a t-test, the order of the groups matters. Make sure the values in the order argument match the values in your data. If you do not pick an order, the t_test
function picks one for you and gives the ordering in a warning message.
sleep %>% infer::t_test(mod_ttest)
## Warning: The statistic is based on a difference or ratio; by default, for
## difference-based statistics, the explanatory variable is subtracted in the order
## "1" - "2", or divided in the order "1" / "2" for ratio-based statistics. To
## specify this order yourself, supply `order = c("1", "2")`.
## # A tibble: 1 x 6
## statistic t_df p_value alternative lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 -1.86 17.8 0.0794 two.sided -3.37 0.205
Data
The infer::gss
data set, a subset from the General Social Survey, provides information about American society and opinions, including college degree college
and family income finrela
.
?"gss" # data description
gss %>% dplyr::select(college, finrela) %>% dplyr::glimpse()
## Rows: 500
## Columns: 2
## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no d…
## $ finrela <fct> below average, below average, below average, above average, a…
Model
Let’s say we’re interested in understanding whether college degree and family income are independent.
Test
The following function gives us the desired results.
Note that the chi-squared test requires categorical variables. If one of the variables is not categorical, we should either change the variable type when reading in the data (see We R Novices) or coerce it into a factor (see We R Transformers). Here’s an example with the numeric hompop variable.
Data
The built-in data set iris
includes the sepal lengths Sepal.Length
for three different iris species Species
.
?"iris" # data description
iris <- iris %>% dplyr::as_tibble() # load data as tibble
iris %>% dplyr::select(Sepal.Length, Species) %>% dplyr::glimpse()
## Rows: 150
## Columns: 2
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …
Model
Let’s say we’re interested to learn whether sepal length differs between the three different iris species.
Test
We conduct a one-way ANOVA, followed by pairwise comparisons.
Data
The built-in data set mtcars
includes various categorical and numerical variables that describe 32 different cars, such as the fuel consumption mpg
, the weight wt
, the engine vs
, and the transmission am
.
?"mtcars" # data description
mtcars <- mtcars %>% dplyr::as_tibble() # load data as tibble
mtcars %>% dplyr::select(mpg, wt, vs, am) %>% dplyr::glimpse()
## Rows: 32
## Columns: 4
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
Model
Let’s say we’re interested to learn whether the fuel consumption differs between the types of engine and transmissions.
And let’s say we’re interested to learn whether the fuel consumption differs between the types of engine, given the weight of the car.
Test
We conduct a two-way ANOVA and an ANCOVA.
res_aov2 <- mtcars %>% stats::aov(mod_aov2, data = .) # two-way ANOVA
res_aov3 <- mtcars %>% stats::aov(mod_aov3, data = .) # ANCOVA
Importantly, in the aov
function the order of the independent variables matters for the outcome, because it uses type 1 (sequential) sums of squares (whereas the type 3 sums of squares is the default in SPSS). If you prefer to use type 2 or type 3 sums of squares, you must put your result from the aov
function into the Anova
function from the car package.
Data
The built-in data set iris
includes the petal and sepal lengths and widths for three different iris species.
?"iris" # data description
iris <- iris %>% dplyr::as_tibble() # load data as tibble
iris %>% dplyr::glimpse()
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …
Model
First, let’s repeat the one-way ANOVA example in a regression context. Importantly, when using a categorical / group variable in the lm
function for linear models, it must be a factor variable (or alternatively, you’d have to create dummy variables). Check the We R Novices masterclass if you forgot about factor variables. Catching a glimpse
of the data above, we see that the Species variable is a factor.
Then, let’s say we’re interested in testing a few more hypotheses.
mod_lm2 <- Sepal.Length ~ Sepal.Width
mod_lm3 <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
mod_lm4 <- Sepal.Length ~ Petal.Length + Petal.Width + Petal.Length:Petal.Width
Test
Now, let’s perform the analyses.
res_lm1 <- iris %>% stats::lm(mod_aov1, data = .) # look, we feed it the exact same formula as for the one-way ANOVA
res_lm2 <- iris %>% stats::lm(mod_lm2, data = .) # simple linear regression
res_lm3 <- iris %>% stats::lm(mod_lm3, data = .) # multiple linear regression
res_lm4 <- iris %>% stats::lm(mod_lm4, data = .) # multiple linear regression with interaction term
Data
The infer::gss
data set, a subset from the General Social Survey, provides information about American society and opinions, including college degree college
and household size hompop
.
?"gss" # data description
gss %>% dplyr::select(college, hompop) %>% dplyr::glimpse()
## Rows: 500
## Columns: 2
## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no d…
## $ hompop <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3, 2, 1, 2, 5…
Model
Let’s say we’re interested in understanding whether household size predicts college degree.
Test
For logistic regression, be sure to set the family argument to binomial. As you can read at ?"family"
, the default link function for the binomial family is the logit, which is what we need for logistic regression. It also tells you that when the independent variable is a factor variable, the binomial family takes the second level of that factor as the reference level. In the data paragraph above we can verify that the college
variable is indeed a factor variable. Now, using the levels
function and the summarize
function discussed in the summary statistics section, we can check which of the two levels is the second level and thus the reference level. We’ll show you how to change the order of the levels in the We R Transformers masterclass.
gss %>% summarize(college_levels = levels(college)) # check reference level (second level)
res_glm <- gss %>% stats::glm(mod_glm, data = ., family = "binomial")
lm
function only takes the Gaussian family, the glm
function generalizes to many more families. Check them at ?"family"
.
Give me more. As we explained previously, R has very advanced packages for multilevel modeling that you can explore on your own (e.g., lme4 and nlme). Nevertheless, below is an example of a simple repeated measures ANOVA, which you can enhance on your own (don’t tell anyone we gave you this link).
Data
The storms
data set includes multiple observations for storm names name
, storm classifications status
, maximum sustained wind speeds wind
, and report dates and times datetime
(combination of year
, month
, day
, hour
).
?"storms" # data description
storms <- storms %>% dplyr::as_tibble() # load data as tibble
storms <- storms %>% dplyr::mutate(datetime = lubridate::make_datetime(year = year, month = month, day = day, hour = hour)) %>% dplyr::select(-year, -month, -day, -hour) # create a single date variable, we might cover this in the We R Transformers masterclass
storms %>% dplyr::select(wind, status, name, datetime) %>% dplyr::glimpse()
## Rows: 10,010
## Columns: 4
## $ wind <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 45, 50, 50, 55, 60, …
## $ status <chr> "tropical depression", "tropical depression", "tropical depr…
## $ name <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy", "Amy…
## $ datetime <dttm> 1975-06-27 00:00:00, 1975-06-27 06:00:00, 1975-06-27 12:00:…
Model
Let’s say we’re interested in understanding whether the maximum sustained wind speeds differ between storm classifications (and let’s forget this is a rhetorical question). Evidently, errors of measurement within storms are more alike than errors of measurement between storms. To account for this, you need to specify how the errors are clustered.
Test
Let’s perform a repeated measures ANOVA.
vignette("available-methods")
to see the statistical packages that currently work with broom.
From a tidyverse statistical package
Directly inspect the object that you assigned the results to.
From a base R statistical package
Use broom to tidy the results for you.
broom::tidy
summarizes info on the model components, like regression coefficients.broom::glance
reports information about the entire model, like the R squared and model fit measures.broom::augment
adds information such as the fitted values and the residuals to the columns of the data.Regretfully, the broom package does not support output from the car::Anova
function yet, but it might be added in the future.
Here, we discuss various checks for assumptions such as in linear regression and analysis of variance:
The autoplot
function of ggplot2—tidyverse’s powerful graphical package that you’ll conquer in the We R Visualizers masterclass—helps us automatically generate default plots for all sorts of things. However, it doesn’t do this on its own and needs help from other packages.
The ggfortify package is there to help. It capitalizes on the autoplot
function to create default diagnostic plots for a great variety of statistical packages, including lm
and glm
. As ggfortify is not part of the tidyverse, you’ll need to install and load it separately.
For a linear regression object from lm
, it helps you check the linearity assumption (residuals vs. fitted plot), normality of residuals assumption (normal Q–Q plot), homoscedasticity assumption (scale-location plot), and influential observations (residuals vs. leverage plot).
Boxplot
A boxplot can be useful for checking influential observations and visually checking the homogeneity of variance assumption when performing an ANOVA. The boxplot below shows the distributions of sepal length for the three iris species (corresponding to the one-way ANOVA example above).
iris %>%
ggplot2::ggplot(aes(x = Species, y = Sepal.Length, fill = Species)) +
ggplot2::geom_boxplot() +
ggplot2::theme_minimal()
Density plot
In addition to the normal Q–Q plot, a density plot of the residuals can be used to check the normality of residuals assumption. Compare the density curve of the standardized residuals of your fitted multiple regression model (solid line) with the normal curve given the mean and standard deviation of those standardized residuals (dashed line).
res_lm3 %>% broom::augment() %>%
ggplot2::ggplot(aes(x = .std.resid)) +
ggplot2::geom_density(linetype = "solid") +
ggplot2::stat_function(fun = dnorm, linetype = "dashed") +
ggplot2::theme_minimal()
Correlation matrix
In the Data section of this masterclass it is shown how a correlation matrix (of the independent variables) is created, which can be used to check the no multicollinearity assumption.
Variance inflation factor
To check the no multicollinearity assumption in base R, the variance inflation factors are easily computed by inserting the results object of a multiple regression analysis (using the lm
function) into the vif
function.
Levene’s test
With multiple groups, such as in an ANOVA, the homogeneity of variance assumption can be checked with the Levene’s test in base R.
res_aov %>% car::leveneTest() # the default center of each group is the median
res_aov %>% car::leveneTest(center = mean) # but you can change the center to the mean
Shapiro–Wilk test
Base R supports a Shapiro–Wilk test on your standardized residuals, to check the normality of residuals assumption.
res_lm3 %>%
broom::augment() %>%
dplyr::pull(.std.resid) %>% # pulls out the vector of your standardized residuals, as discussed in the _We R Novices_ masterclass
stats::shapiro.test()
Give me more. There is a tidyverse-friendly function for the Shapiro–Wilk test in the rstatix package, which enables you to perform the test per group, but which must be installed and loaded separately.
In these exercises you will work again with the data used in this chapter’s demonstration script. The demonstration script might serve as a helpful example for most of these exercises.
Instruction
Test whether males and females differ in their agreement with the statement “Computers are complicated machines”.
Tips
female
and computers
, and then use this formula object in the t-test function.Instruction
The students were randomly assigned to three conditions; students in each group listened to a different song. Let’s check whether the random assignment yielded a roughly equal number of males and females in each of the three conditions.
Tips
Instruction
Does listening to different songs influence how much the students would enjoy eating at a diner?
cond
and diner
and use this model object in the aov
function.Instruction
Let’s try to predict students’ political orientation using their dad’s age, mom’s age, and their agreement with the statement “Computers are complicated machines”.
autoplot
function from the ggplot2 package.Instruction
Predict students’ sex by their political orientation and by their agreement with the statement “Computers are complicated machines”. Fit the corresponding logistic regression model and extract the results such that you could report the BIC of the model and the regression coefficients.
Instruction
Run the following lines of code. What happens? Why?
Tips
xkcd.com
Lingering questions or concerns? Use Piazza to post a question or up on our plenary discussion.
Readily available data sets
Some R packages contain data that you can freely use. Unfortunately, many data does not come as a data frame or tibble, and is thus not suitable for your needs (below we give a short list of data that is).
data()
to see which data sets are available in the (loaded) packages.data(package = "dplyr")
to look for data sets in the dplyr
package (replace with any other package).help("storms")
to get information on the storms
data set (replace with any other data set).data("storms")
to load the storms
data set (replace with any other data set).Data frames and tibbles
Data in packages that belong to the tidyverse (such as dplyr
and tidyr
), have a higher chance of being in the tibble()
format that you are now familiar with. Here are some examples that you might find useful.
Data packages
Some packages contain only data. There are many such data packages, but you won’t be able to find them easily. One you should definitely check out is the modeldata package. It’s part of the tidymodels collections, so you’ve already loaded it. Check the link or run data(package = "modeldata")
to see which data sets it includes. For a bunch of other data packages follow this link, it will give you an overview of some data packages from 2017.
Continue with the exercises, use Piazza to ask questions.
Grab some of your own data, tidy it, and load it into R. Start applying what you’ve learned so far!
Don’t have your own data? Or want practice data? Check the previous section to get some tips for using readily available data.
Working on a research project right now or starting one soon? It’s time to toss out That Other Statistical Software you normally use and force yourself to do the analyses in R. You now know where to find help. And you regularly compare your results with your co-pilot’s results anyway, don’t you?
Created with R Markdown and generated on November 19, 2020.
Reproducibility receipt
Session information
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Amsterdam
date 2020-11-19
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2)
BiocManager 1.30.10 2019-11-16 [1] CRAN (R 4.0.0)
broom * 0.7.2 2020-10-20 [1] CRAN (R 4.0.2)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
class 7.3-17 2020-04-26 [1] CRAN (R 4.0.2)
cli 2.1.0 2020-10-12 [1] CRAN (R 4.0.2)
clipr 0.7.0 2019-07-23 [1] CRAN (R 4.0.0)
codetools 0.2-16 2018-12-24 [1] CRAN (R 4.0.2)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0)
corrr 0.4.2 2020-03-22 [1] CRAN (R 4.0.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.0)
dbplyr 2.0.0 2020-11-03 [1] CRAN (R 4.0.2)
desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
details * 0.2.1 2020-01-12 [1] CRAN (R 4.0.0)
dials * 0.0.9 2020-09-16 [1] CRAN (R 4.0.2)
DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 4.0.0)
digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.0)
forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.0)
foreach 1.5.0 2020-03-30 [1] CRAN (R 4.0.0)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
furrr 0.1.0 2018-05-16 [1] CRAN (R 4.0.0)
future 1.19.1 2020-09-22 [1] CRAN (R 4.0.2)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
ggfortify * 0.4.10 2020-04-26 [1] CRAN (R 4.0.2)
ggimage 0.2.8 2020-04-02 [1] CRAN (R 4.0.0)
ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.0)
ggplotify 0.0.5 2020-03-12 [1] CRAN (R 4.0.0)
git2r * 0.27.1 2020-05-03 [1] CRAN (R 4.0.0)
globals 0.13.0 2020-09-17 [1] CRAN (R 4.0.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
gower 0.2.2 2020-06-23 [1] CRAN (R 4.0.2)
GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.0.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.0)
gridGraphics 0.5-0 2020-02-25 [1] CRAN (R 4.0.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.2)
hexbin 1.28.1 2020-02-03 [1] CRAN (R 4.0.0)
hexSticker * 0.4.7 2020-06-01 [1] CRAN (R 4.0.0)
highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.0)
htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
infer * 0.5.3 2020-07-14 [1] CRAN (R 4.0.2)
ipred 0.9-9 2019-04-28 [1] CRAN (R 4.0.0)
iterators 1.0.12 2019-07-26 [1] CRAN (R 4.0.0)
jsonlite 1.7.1 2020-09-07 [1] CRAN (R 4.0.2)
knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
labeling 0.3 2014-08-23 [1] CRAN (R 4.0.0)
lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.2)
lava 1.6.8 2020-09-26 [1] CRAN (R 4.0.2)
lhs 1.1.0 2020-09-29 [1] CRAN (R 4.0.2)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
listenv 0.8.0 2019-12-05 [1] CRAN (R 4.0.0)
lubridate 1.7.9 2020-06-08 [1] CRAN (R 4.0.2)
magick 2.4.0 2020-06-23 [1] CRAN (R 4.0.2)
magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
MASS 7.3-53 2020-09-09 [1] CRAN (R 4.0.2)
Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.2)
modeldata * 0.1.0 2020-10-22 [1] CRAN (R 4.0.2)
modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
nnet 7.3-14 2020-04-26 [1] CRAN (R 4.0.2)
parsnip * 0.1.4 2020-10-27 [1] CRAN (R 4.0.2)
pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
plotrix * 3.7-8 2020-04-16 [1] CRAN (R 4.0.2)
plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.0)
png 0.1-7 2013-12-03 [1] CRAN (R 4.0.0)
pROC 1.16.2 2020-03-19 [1] CRAN (R 4.0.0)
prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.0.0)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.0)
recipes * 0.1.15 2020-11-11 [1] CRAN (R 4.0.2)
reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.0)
rlang 0.4.8 2020-10-08 [1] CRAN (R 4.0.2)
rmarkdown 2.4 2020-09-30 [1] CRAN (R 4.0.2)
rpart 4.1-15 2019-04-12 [1] CRAN (R 4.0.2)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0)
rsample * 0.0.8 2020-09-23 [1] CRAN (R 4.0.2)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rvcheck 0.1.8 2020-03-01 [1] CRAN (R 4.0.0)
rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
scales * 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
sessioninfo * 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
showtext 0.9 2020-08-13 [1] CRAN (R 4.0.2)
showtextdb 3.0 2020-06-04 [1] CRAN (R 4.0.0)
stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
survival 3.2-7 2020-09-28 [1] CRAN (R 4.0.2)
sysfonts 0.8.1 2020-05-08 [1] CRAN (R 4.0.0)
tibble * 3.0.4 2020-10-12 [1] CRAN (R 4.0.2)
tidymodels * 0.1.1 2020-07-14 [1] CRAN (R 4.0.2)
tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2)
timeDate 3043.102 2018-02-21 [1] CRAN (R 4.0.0)
tune * 0.1.1 2020-07-08 [1] CRAN (R 4.0.2)
utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0)
vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
workflows * 0.2.1 2020-10-08 [1] CRAN (R 4.0.2)
xfun 0.18 2020-09-29 [1] CRAN (R 4.0.2)
xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
yardstick * 0.0.7 2020-07-13 [1] CRAN (R 4.0.2)
[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
License
We R Analysts by Alexander Savi & Simone Plak is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. An Open Educational Resource. Approved for Free Cultural Works.