The R Novices badge; learn it before you earn it.
[1] "If you know R only as a letter from the alphabet, you'll be surprised to learn it is an entire language. That is, a programming language for statistical computing and graphics. In the We R Champions masterclass series, the We R Novices masterclass lays the foundation for the other masterclasses. You'll not only learn the R syntax and a bit of vocabulary, but also its primary data science dialect, the 'tidyverse'. Although we understand there are few things more exciting than R's syntax, you'll never grasp R without also touching upon the utterly boring things it enables you to do, such as interactive visualizations. Yawn. By the end of the masterclass we'll give you a sneak preview of what's to come in the rest of this series."
Our online version of an academic quarter. We’ll take five minutes for everyone to get in and get settled. Are you ready?
All done? Familiarize yourself with the document.
None.
But of course, I successfully…
Sections.
Text boxes.
Yeh
Nah
Everything you don’t (necessarily) need when working with the tidyverse, such as…
$
, []
, and [[]]
.for
loops and if
statements.Exactly, we don’t cover the things that most R learners find most difficult. Many of you will probably never miss it, but as for some of you it can be quite powerful, we’ll cover it in the We R Programmers masterclass.
The demonstration gives you the opportunity to see everything you’ll learn in harmony, before we break it apart into all the various building blocks that you can explore on your own.
# This script celebrates our first lines of R code
# Create cake object, which is added to environment tab
cake <- 3.14
cake
# Create slices vector
slices <- c(cake / 3, cake / 3, cake / 3) # a vector object
slices
# Plot slices with pie function
pie(x = slices) # warning: pie charts are ridiculous, only use them for celebrations
# Let's do it more efficiently
?"rep"
slices <- rep(x = cake / 3, times = 3)
slices
# Let's separate 'ingredients' from 'recipe'
n_participants <- 20
slice <- cake / n_participants
slices <- rep(x = slice, times = n_participants)
pie(x = slices)
# Now, let's give this cake a bit more body
library("plotrix") # first run install.packages("plotrix") once (we like to leave it out of our script)
plotrix::pie3D(x = slices, explode = .1, main = "There's cake for everyone!")
RStudio
Going to use R for a new project?
It is absolutely key to understand the idea of a function. Luckily, the idea is straightforward. You put something in a function, and you get out something else.
Functions always look like this: function(argument_name_1 = input_1, argument_name_2 = input_2)
. The argument names tell you what you need to input, and are separated by commas. Below, we dissect the mean
, print
, length
, and round
functions.
The mean
function requires one primary argument, named x. The x argument takes a vector with data and computes the mean.
The length
and round
functions also take x as their first argument, and the round
function takes an additional argument.
Seriously busy persons can leave the argument names out, as R will nevertheless interpret the input.
However, seriously busy persons need to take care of the order in which the arguments appear, which we’ll show you in a bit.
Often, functions have default arguments. If you don’t give an input for such arguments, the function will use the default.
mean(x = some_data, na.rm = FALSE, trim = 0) # the mean function uses these default arguments, but you can change them, see below
## [1] 7.2712
Of course, you may change these defaults. Below, we set the na.rm argument to TRUE in order to remove missing values (called NA
’s).
some_data <- c(1, 4, 4, NA, 2.356, 25) # create vector with (missing) data
mean(x = some_data)
## [1] NA
Here’s an example with the print
function. The print
function prints any R object to your console. Conveniently, most of the time you won’t need to explicitly call the print function.
When you name the arguments, the order in which you input the arguments is not important. Let’s trim 20% of the observations from each end of the input, before the mean is computed.
However, if you do not name the arguments, the order does matter!
...
—argument. It signals that the function accepts any number of additional arguments. Check the help file for the mean function for instance (?"mean"
). Also, some functions take other functions as an argument. Yes, Matryoshka functions, why not. One day you’ll learn why this is all very powerful.
Functions live in packages. Base R comes with a bunch of pre-installed packages, such as the base package, and you can install additional packages and thus extend the functionality of R. There are currently 16618 available. Crazy! And even crazier, we’ll cover most of them in this masterclass.
No, of course we won’t.
Flattening the curve? Not quite. (source)
Installing and loading packages
Packages must first be installed and can then be loaded. As soon as the package is loaded, all its functions become readily available.
install.packages("package_name_here") # installs a package
library("package_name_here") # loads a package
library(package_name_here) # also loads a package
remove.packages("package_name_here") # removes a package
update.packages() # updates all packages
installed.packages() # see which packages are installed
available.packages() # see which packages are available
Installation of a package generally only needs to be done once for every project you are working on. On the other hand, loading a package generally needs to be done every time you reopen RStudio.
Give me more. Importantly, as anyone can contribute a package, functions from different packages may share the same name. Therefore, when loading a package, R ‘masks’ functions from previously loaded packages if those contain functions with an identical name. To ensure that you’re calling the right function, there are three options:
search()
to see in which order R searches packages for functions.package::function()
rather than function()
.Here, we choose the second option. It’s pedagogically sound and you will never make a mistake. For convenience, we’ll only call the package if it is not part of base R (that is, if we used the library
function to load it).
base::mean() # the mean function is in the base package, which is pre-installed
mean() # so we'll just call it like this
tibble::tibble() # tibble is in the tibble package from the tidyverse collection, so we'll call it like this
Cheat sheets
We love cheat sheets. Find them online or download them directly from your RStudio menu: Help > Cheatsheets. However, you’ll have to accept that especially the tidyverse dialect (more on that in the next chapter) is developed at a higher pace than the cheat sheets are updated.
Vignettes
We love vignettes. They are not as complete as help files, but provide an excellent start for getting to know a package and some of its most important functions. Unfortunately, many packages don’t come with a vignette.
Examples
Some functions come with an example. It can be useful to see how a function can be used.
Help
R has a help function with a convenient shortcut. Use it to get help on functions and packages. Be warned, whereas the shortcut is convenient, R’s help pages are generally not.
# Find function or package
apropos("mean") # find functions with 'mean' in their name
find("mean") # find package that contains the function 'mean'
# Get help on a function
help("help") # get help on the help function
?"help" # shortcut to get help on the help function
?"?" # shortcut to get help on the shortcut for the help function
?"=" # you can even get help on operators
?help # also works most of the time, but not always (try ?for)
??"mean" # find vignettes, code demonstration, help pages
# Get help on a package
help("dplyr") # r documentation for dplyr
?"dplyr" # shortcut
help(package = "dplyr") # help pages for all functions in dplyr
library(help = "dplyr") # general information on dplyr
Seized by despair?
You can store and reuse pretty much anything in R, by using the assignment operator. You’ll use it a lot. There are various ways you can assign values to an object, but luckily, you only need one.
# Assignment operators
a <- 1 # this is the best, forget the rest
1 -> a
a = 1
assign(x = "a", value = 1)
Remember that we do use the =
operator, but when specifying the arguments in a function.
We can assign all kinds of things to an object, although it only makes sense if we want to store and reuse it.
For clarity, use spaces around the assignment operator.
R can handle several different data types and structures. Below, we summarize the types and structures that come with base R. You’ll recognize the data frame as the data structure that you use for data analysis. Nonetheless, it’s good to know that other structures exist too.
Importantly, notice that data types are nested in data structures, and some data structures are nested in other data structures. Vectors hold scalars, data frames and matrices hold vectors, and list can simply hold anything.
The four most basic data types are numeric, character, and logical. The first comes in two flavors, integer (e.g., 1
or 312
) and double (e.g., 3.14
). A character can be anything from "R"
to "Some long questionnaire input."
. A logical can be either TRUE
or FALSE
.
The data structure that holds a single value, whether numeric, character, or logical, is referred to as a scalar.
s_int <- 1L # numeric (integer)
s_dbl <- 1 # numeric (double)
s_chr <- "R" # character (some call it a string)
s_lgl <- TRUE # logical
The logical type can be simplified to T
for TRUE
and F
for FALSE
, but this is not preferred. Also, TRUE
is stored as 1
and FALSE
as 0
.
Use the c()
function to combine scalars into a vector.
v_dbl <- c(s_dbl, 2, 3, 4) # yes, you can simply insert the previously created numeric scalar
v_chr <- c("I", "love", s_chr, "!") # or insert the previously created character scalar
print(v_chr)
## [1] "I" "love" "R" "!"
Read this if you want to get to the bottom of vectors.
The data types you’ll want to use most are numeric and categorical. So where is categorical? It’s here, it’s factor. The factor
type is a special kind of integer vector that can hold categorical data with different levels. Here, we create a nominal factor vector.
v_sex <- c("male", "female", "female")
fct_sex <- factor(x = v_sex, levels = c("female", "male", "other"))
print(fct_sex)
## [1] male female female
## Levels: female male other
Likewise, here we create an ordinal factor vector.
Data frames are perfect for data analysis. You probably don’t want to use variable_a
and so forth; pick a variable name that makes sense.
What matrices are good for? Think of correlation matrices or adjacency matrices (used for specifying the links in a network).
An array is a special type of matrix that stacks multiple matrices. Are you likely going to use it? No.
lst <- list(data_a = s_lgl, data_b = v_chr, data_c = m_dbl, data_d = df, data_e = list(v_dbl)) # here, we insert various previously created data structures
print(lst)
## $data_a
## [1] TRUE
##
## $data_b
## [1] "I" "love" "R" "!"
##
## $data_c
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $data_d
## variable_a variable_b
## 1 1 I
## 2 2 love
## 3 3 R
## 4 4 !
##
## $data_e
## $data_e[[1]]
## [1] 1 2 3 4
Indeed, if you want to get funky, you can even nest a list in a list. Call it a tribute to Droste.
Conveniently, we can force the one data type into the other if we need to. This is called coercion.
Give me more. Coercion is performed automatically when we try to create a vector with different data types. Values are then coerced to the value with the highest rank, where logical < integer < double < character. This is illustrated below.
There are many operators in R. Here’s a few that may come in handy.
Give me more. Modulo operations and integer divisions can also be really useful.
In data analysis, you’ll be mostly concerned with NA
, used to indicate missing values. NA
is used regardless of the type of data (character, logical, numeric). However, there’s more emptiness in R than just missing values.
You don’t need meditation to find the void:
void <- c(NA, NaN, Inf, NULL)
print(void) # NULL pertains to vectors and is thus (silently) removed
## [1] NA NaN Inf
“Youth always tries to fill the void, an old man learns to live with it.” — Mark Z. Danielewski
Make a habit of working in R projects.
Instruction
Create an R project and a script to work on these exercises.
Tips
Doing data analysis in R, you’ll work with data frames all the time. This is an exercise you should focus on.
Instruction
Create a data frame with five variables that each contain five observations. One logical variable, one double (numeric) variable, one character variable, one nominal categorical variable, and one ordinal categorical variable.
For each type, think of a variable that matches the type (e.g., sex as a nominal categorical variable), and give it a name that clearly communicates it.
Let’s make it look like real data and make sure each variable contains a missing value.
Tips
NA
. Check the into the void section.Example
Here’s an example. Note that the eye_color variable was created as a character variable, the sex variable was created as a nominal variable, and the level variable was created as an ordinal variable.
## body_length eye_color sex level control_condition
## 1 178.0 sky_blue male student TRUE
## 2 163.1 brown female professor TRUE
## 3 191.8 blue female student FALSE
## 4 180.0 sea_blue male phd_candidate FALSE
## 5 175.5 greenish other phd_candidate FALSE
These exercises can be hard, but are important if you want to learn R. It doesn’t matter if you can’t do them during the masterclass, but make sure to work on them afterwards.
Instruction
Check and run the examples below.
Tips
# 1
s <- "Is R case-sensitive?"
print(S)
# 2
r <- n <- o <- v <- i <- c <- e <- 1
# 3
r < -1
# 4
v <- c(1, 2, 3, NA)
mean(v)
# 5
v <- c(1, 2, 3, "NA")
mean(v)
# 6
v <- c("a", "b", "c", "d")
mean(v)
# 7
v <- c(TRUE, FALSE, TRUE, FALSE)
mean(v)
# 8
mean <- c(1, 2, 3, 4)
mean(mean)
# 9
v <- c("1", "2", "3", "NA")
as.numeric(v)
# 10
TRUE != FALSE & (TRUE == !TRUE | TRUE >= FALSE)
# 11
NA == NA
# 12
?"?"
# 13
????"mean"
Instruction
Run the following line of code. What happens? Why? How did you notice?
Tips
>
and +
signs in the left margin of the Console panel.
You wish it would be easy like that.
Thankfully, you’re not the only one learning R. Ask and help each other. Use Piazza. Consult us. R is not easy, but we’re 100% sure you’ll grasp it when you put effort in it!
xkcd.com
Lingering questions or concerns? Use Piazza to follow up on our plenary discussion or post a new question.
Yeh
Nah
Again, let’s first go through a demonstration script. The data we read into R in this script is from a paper published in 2011 that aimed to show how flexibility in data collection, analysis, and reporting can dramatically increase false-positive rates (Simmons, Nelson, & Simonsohn, 2011). They demonstrate how flexibility on these different aspects can result in statistically significant evidence for a false hypothesis. Here we only read in their data, but in the next masterclass—We R Analysts—you will work with this data in the exercises yourself!
# Load the tidyverse collection of packages
library("tidyverse")
# Read in and view data
study_1 <- readr::read_delim("FalsePositive_Data_in_/Study 1 .txt", delim = "\t", escape_double = FALSE, trim_ws = TRUE)
study_1
tibble::view(study_1)
tibble::glimpse(study_1)
# Use the pipe operator %>%
study_1 %>% tibble::glimpse()
# %>% more useful when performing sequences of operations
x <- c(2, 3, 7, 6, 9, 4)
x %>%
range() %>%
sum() %>%
sqrt()
# Subset data: columns
study_1 %>% dplyr::select(political)
study_1 %>% dplyr::select(contains("m"))
# Subset data: rows
study_1 %>% dplyr::slice(c(5:10)) # by their position
study_1 %>% dplyr::filter(dad >= 60) # by a certain criteria
study_1 %>% dplyr::filter(cond == "control")
This masterclass covers the R basics and the tidyverse dialect. These are the foundations of much more exciting visualizations and computations, which will be covered in the next masterclasses. The following code gives a sneak preview of what’s to come.
# Load additional packages and data
library("tidymodels")
data("gss")
# View data
gss %>% print()
# Transform data (We R Transformers)
gss %>%
dplyr::mutate(income = stringr::str_extract(income, "[\\d]{1,5}")) %>%
dplyr::mutate(income = as.numeric(income)) %>%
dplyr::mutate(weighted_income = income / hours)
# Explore and visualize data (We R Visualizers)
p <- gss %>%
ggplot2::ggplot(aes(x = year, y = hours, color = class, size = weight, shape = sex)) +
ggplot2::geom_point()
p
# Create standard error function (We R Programmers)
se <- function(x) {
sd(x) / sqrt(length(x))
}
# Summarize data (We R Analysts)
gss %>%
dplyr::group_by(class, sex) %>%
dplyr::summarize(hours_mean = mean(hours),
hours_se = se(hours))
# Perform simple regression (We R Analysts)
lm_gss <- gss %>% stats::lm(hours ~ sex + class, data = .)
lm_gss %>% broom::tidy()
lm_gss %>% broom::glance()
lm_gss %>% broom::augment()
# Check assumptions (We R Analysts)
library("ggfortify")
lm_gss %>%
ggplot2::autoplot() +
ggplot2::theme_minimal()
# Publication-ready visualization (We R Visualizers)
gss %>%
ggplot2::ggplot(aes(x = class, y = hours, linetype = sex)) +
ggplot2::geom_violin() +
ggplot2::theme_classic() +
ggplot2::labs(x = "Socioeconomic Class",
y = "No. of Hours Worked",
linetype = "Sex",
title = "You'll Figure It Out",
subtitle = "These violins look more like flutes.") +
ggplot2::ggsave(filename = "gss.pdf",
width = 15,
height = 15,
units = "cm",
dpi = 300)
# Interactive visualization (as promised!)
library("plotly")
p %>% plotly::ggplotly() # created from the previous visualization
You can’t be introduced to the tidyverse without loading the tidyverse package.
install.packages("tidyverse") # installs the tidyverse collection
library("tidyverse") # loads the tidyverse collection
tidyverse_update() # updates the tidyverse collection
tidyverse_packages() # shows the tidyverse packages
A collection of R packages
Although to be honest, the tidyverse is not really a package, but rather a collection of packages designed for data science. Installing and loading it installs and loads all the packages in the collection. We’ll use various of the included packages throughout the masterclasses, including:
An R dialect
The tidyverse is the currently most popular R dialect, and is being developed at breakneck speed. Reading and writing tidyverse code is to reading and writing base R code, as being read to by your older sister taking drama classes is to trying to understand your grumpy mumbling younger brother looking for missing Lego pieces.
A single-purpose-single-method philosophy
Whereas base R allows you to do virtually anything, and in annoyingly many different ways, the tidyverse is specifically built for the purpose of data science, and tries to give you a single ‘best’ method for performing an action. That means that the tidyverse is much more restricted than base R, but makes the insurmountable power of the R language—that makes paid and proprietary software like SPSS and Stata cry of embarrassment—much more accessible.
Base versus tidy. When starting to learn R, it can be hard to distinguish base R functions from tidyverse functions. Don’t worry, you’ll learn to distinguish them simply by using R. Passionate bird watchers may have an advantage though; they know how to spot subtle differences:
as_factor
. There is no such uniformity in base R functions. If you spot a beautifully feathered as.factor
defending its nest, you can be sure its songs won’t contain a single tidy verse.%>%
(more on that below) is not tied to the tidyverse, you won’t see them being used a lot in base R.
The tibble
is the tidyverse version of a data frame. On the surface the differences are small, and some people don’t bother giving them different names. Creating a tibble works very similar to creating a data frame (but compare how both are printed).
tb <- tibble::tibble(variable_a = v_dbl, variable_b = v_chr)
print(tb)
## # A tibble: 4 x 2
## variable_a variable_b
## <dbl> <chr>
## 1 1 I
## 2 2 love
## 3 3 R
## 4 4 !
If you already have a data frame, you can coerce it into a tibble.
Tidyverse functions gratefully exploit the pipe operator: %>%
. The pipe operator moves the object to its left into the first argument of the function on its right. Let’s compare the syntax for computing the mean of a vector with an without the use of the pipe.
Now, let’s create the vector c(1, 2, 3)
using the pipe operator:
Here, 1 %>% c(2)
evaluates to c(1, 2)
, such that we get c(1, 2) %>% c(3)
, which in turn evaluates to c(c(1, 2), 3)
. Think this through!
You’re right to think that in this example the pipe operator only complicates things. But now let’s do a series of transformations on a data frame or tibble, first in base R and then in the tidyverse. You don’t need to understand what it does (though you may try if you want), but notice that while the results are very similar, the syntax is very different. The example uses the iris
data set that is readily available in R.
# base R transformations without using the pipe operator
iris_2 <- iris[iris[, 1] < 5, c(3, 5)]
iris_2[, 1] <- log(iris_2[, 1])
by(iris_2, list(iris_2$Species), function(x) mean(x$Petal.Length))
## : setosa
## [1] 0.3384166
## ------------------------------------------------------------
## : versicolor
## [1] 1.193922
## ------------------------------------------------------------
## : virginica
## [1] 1.504077
# piped tidyverse transformations
iris %>%
dplyr::filter(Sepal.Length < 5) %>%
dplyr::select(Petal.Length, Species) %>%
dplyr::mutate(Petal.Length.Log = log(Petal.Length)) %>%
dplyr::group_by(Species) %>%
dplyr::summarize(average = mean(Petal.Length.Log))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## Species average
## <fct> <dbl>
## 1 setosa 0.338
## 2 versicolor 1.19
## 3 virginica 1.50
First, the tidyverse syntax explicitly communicates the function of the code (e.g., filter
, summarize
). Second, the pipe operator enables you to read the functions sequentially (that is, object %>% function() %>% function()
), whereas without the pipe operator functions are often nested within functions (that is, function(function(object))
). As a result, the tidyverse code actually reads like a verse: first filter
specific rows, then select
specific columns, then mutate
one column into a new column, then group_by
one of the selected columns to summarize
the newly created one.
Base versus tidy. We’ll continue to use lots of base R functions. The pipe operator works fine with most of those functions, but there are two things you must know.
Piping with a placeholder
The pipe operator moves the object to its left into the first argument of the function on its right. Whereas in tidyverse functions the first argument is reserved for the data (stored in a tibble), in many base R’s functions it is not. Long story short, you can use the pipe operator with base R’s statistical functions, but often you must use the magical .
placeholder to specify the location of the data argument.
sleep %>% infer::t_test(formula = extra ~ group) # tidyverse's t-test
sleep %>% infer::t_test(x = ., formula = extra ~ group) # identical t-test with explicit placeholder at the data argument
sleep %>% stats::t.test(formula = extra ~ group, data = .) # base R's t-test with placeholder at the data argument
Piping a vector out of a tibble
Tidyverse functions often work with tibbles, whereas some base R functions request vectors. Before piping a tibble into a function that requests a vector, you can pull
out the vector.
Give me more. Also, the magrittr package has another special pipe operator %$%
that allows you to use the variable names from your tibble, as if they were vectors.
Give me more. Another key characteristic of the tidyverse, is the use of tidy evaluation. This is a very advanced topic that we don’t want you to understand. To quote Thomas Gray (1972): “Where ignorance is bliss, ’tis folly to be wise.”
Base versus tidy. Speaking about quotes, because of tidy evaluation (really, to us it sounds like spaghetti too), in the tidyverse we refer to a variable without using quotation marks (such as group
), whereas in base R we would use quotation marks (such as "group"
). This might be helpful when trying to distinguish tidyverse from base R solutions that you find on the internet.
The tidyverse contains various packages for reading data into R.
You may read in your data in two ways.
Coded
readr::read_csv("file.csv") # comma delimited files
readr::read_csv2("file2.csv") # semi-colon delimited files
readr::read_delim("file.txt", delim = "|") # files with any delimiter (e.g., |)
haven::read_sav("file.sav") # SPSS
haven::read_dta("file.dta") # Stata
haven::read_sas("file.sas7bdat") # SAS
readxl::read_excel("file.xlsx") # Excel
Visualized
In RStudio, you may also go to the Environment tab and hit the Import Dataset dropdown menu. Choose the From Text (readr), From Excel, From SPSS, From SAS, or From Stata option. You’ll be guided through the process and able to preview your choices. Fancy pancy!
Note that during this process R guesses the type of each variable, but in the data preview panel you can manually change it, by clicking the dropdown menu next to the variable name of interest and selecting the desired type. If you change the type to factor, you’ll be asked to provide a list with the names of the factors. For instance, if the factor consist of two values, 1
for females and 0
for males, simply enter 1, 0
in the dialogue window.
Finally, as we love reproducibility, we force you to copy the generated code to your script. Of course, you should just pretend as if you wrote it all by yourself.
We will show some examples using the built-in data set in R called iris
(as in the flower).
?iris # explanation of this data
data("iris") # load iris data
iris <- iris %>% dplyr::as_tibble() # make it a tibble
There are various ways to take a look at your data, here are a couple.
iris %>% tibble::glimpse() # get a concise overview of the data
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, …
iris %>% base::print() # view the first ten rows of the data
iris %>% base::print(n = Inf) # view the whole shebang in the console (n determines the number of rows)
iris %>% tibble::view() # view the whole shebang in a spreadsheet-style data viewer
iris %>% base::names() # print the column names (i.e., the variable names in a data set)
For those of you that think that Petal and Sepal are two forgotten Marvel characters, they’re not. And if you think below is a picture of an iris, it’s not.
wikipedia.com
The dplyr package in the tidyverse collection contains three useful functions for subsetting your data:
select
for subsetting columnsfilter
for subsetting rows matching certain conditionsslice
for subsetting rows using their positionsSelect columns
You can select columns in a great many ways. Various helper functions can help you to select columns that satisfy a certain condition, such as columns with numeric values.
iris %>% dplyr::select(Species) # select one variable
iris %>% dplyr::select(Petal.Length, Species) # select more variables
iris %>% dplyr::select(Petal.Length:Species) # select all variables from Petal.Length to Species
iris %>% dplyr::select(-Species) # select all but one variable
iris %>% dplyr::select(starts_with("Petal")) # select variables whose names start with Petal
iris %>% dplyr::select(where(is.numeric)) # select numerical variables
iris %>% dplyr::select(where(is.numeric) & contains("S")) # select numerical variables that contain a capital S
Slice rows
Filtering rows using their row numbers is easy. Check the help file of the slice
function for more interesting options.
iris %>% dplyr::slice(c(2:5)) # filter rows 2 to 5
iris %>% dplyr::slice(-c(2:5)) # filter everything but rows 2 to 5
iris %>% dplyr::slice_sample(n = 5) # filter 5 randomly selected rows
Filter rows
Rows can also be filtered by matching a certain condition. Finally, a good use for the relational operators that you learned in the R chapter.
iris %>% dplyr::filter(Sepal.Length < 6) # filter rows where the sepal length is smaller than 6
iris %>% dplyr::filter(Species == "versicolor") # filter rows of the versicolor species
iris %>% dplyr::filter(Species == "setosa" & Petal.Length > 1.5) # filter rows of the setosa species where the petal length is larger than 1.5
iris %>% dplyr::filter(Sepal.Width > mean(Sepal.Width)) # filter rows where the sepal width is bigger than the mean
Instruction
Recreate the data frame from the exercises of the previous section, but this time make it a tidy tibble.
Tips
tibble
function works exactly like the data.frame
function.Instruction
Functions can be nested and piped. In the following examples, determine whether the functions are nested or piped, and write them the other way.
Tips
First run this bit of code to create the two example vectors.
Exercises:
Instruction
Explore the tidyverse website and its various packages.
Tips
Instruction
Read in the data we used in the Demonstration script above. You can download the data for Study 1 of the paper here. Open the .zip file, and find the data file called Study 1 .txt. Read this data into R.
Alternatively, read in some of your own data.
Tips
Instruction
Sometimes you want to select only a part of your data. Try to:
Tips
It’s helpful to be able to distinguish a base R function from a tidyverse function. However, it’s certainly not necessary in most cases, so you may safely skip this exercise if you don’t feel like doing it.
Instruction
Below are a couple of functions. Can you distinguish the base R species from the tidyverse species?
Tips
?
binoculars to get a closer look.Learning R is a journey, not a destination
xkcd.com
Lingering questions or concerns? Use Piazza to follow up on our plenary discussion or post a new question.
Bugs hate a clean desk policy. That’s why you shouldn’t save—nor restore—your workspace. RStudio does it by default, but we can tell it not to. Do save your scripts though! Those will help you reproduce exactly what you did the last time.
Go to RStudio’s Preferences, hit the General tab, change the settings below, and hit Apply.
Marie Kondo will be proud of you.
Give me more. Here are some more advanced options. You won’t need these most of the time, especially if you’re already not saving and restoring your workspace.
rm(list = ls()) # clear environment
rm(list = ls(all.names = TRUE)) # clear environment including hidden objects
cat("\014") # clear console
dev.off(dev.list()["RStudioGD"]) # clear plots
gc() # free up memory with the garbage collector (R collects garbage automatically, thus it's only needed if you can't wait and want to directly free up the memory after having removed a large object)
Take-away: (1) research your question (2) make it specific (3) most problems have been solved before: find and learn (4) if not: make sure others will learn from it.
Where and how to search
Where to ask
How to ask
If learning R would be mountaineering, you made it to base camp. Barely. At base camp, there are three things you can do. First, you can descent back into the valley, order a flat white with oat milk in the nearest coffee bar, and marvel at your once aspirations of climbing that mountain. Second, overtaken by hubris, you can disdainfully pass base camp, and take the first steep turn towards the top. Third, you can take a bit of a rest, absorb everything you’ve seen during the past climb, and start exploring base camp with your fellow mountaineers.
So, what’s next is up to you!
Okay, we failed. But you should nevertheless try the great projects below. They use R in the background, but have very friendly interfaces. They are like diet R; not as versatile as writing your own scripts, but easy to use, free and open source, and accompanied with great text books.
Also, our way of teaching might just not be your way of learning. You can find many other approaches online, for instance on DataCamp or Coursera.
Check out the syllabus to see the other masterclasses in this series. And check the graphics you can create with R and what you can do with R Markdown, fancy right!
Can’t wait for the next masterclass? You can explore the following resources on your own.
But. You’ll probably first want to practice your newly obtained skills. There’s a lot to do at base camp!
Great idea, it’s time to practice, practice, practice, practice, and conquer the learning curve! We’ll give you some tips and resources.
If you haven’t done so already, reserve a fixed time and day of the week for learning R. Mark it in you calendar. Now.
Continue with the exercises, use Piazza to ask questions.
Sure, solo mountaineers exist, but there’s a reason most form a group. Find a colleague, practice together.
Finished all the exercises? Read in your own data, and play around with it.
Created with R Markdown and generated on November 19, 2020.
Reproducibility receipt
Session information
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Amsterdam
date 2020-11-19
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2)
BiocManager 1.30.10 2019-11-16 [1] CRAN (R 4.0.0)
brew * 1.0-6 2011-04-13 [1] CRAN (R 4.0.0)
broom * 0.7.2 2020-10-20 [1] CRAN (R 4.0.2)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
class 7.3-17 2020-04-26 [1] CRAN (R 4.0.2)
cli 2.1.0 2020-10-12 [1] CRAN (R 4.0.2)
clipr 0.7.0 2019-07-23 [1] CRAN (R 4.0.0)
codetools 0.2-16 2018-12-24 [1] CRAN (R 4.0.2)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
crosstalk 1.1.0.1 2020-03-13 [1] CRAN (R 4.0.0)
data.table 1.13.0 2020-07-24 [1] CRAN (R 4.0.2)
DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.0)
dbplyr 2.0.0 2020-11-03 [1] CRAN (R 4.0.2)
desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
details * 0.2.1 2020-01-12 [1] CRAN (R 4.0.0)
dials * 0.0.9 2020-09-16 [1] CRAN (R 4.0.2)
DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 4.0.0)
digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.0)
forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.0)
foreach 1.5.0 2020-03-30 [1] CRAN (R 4.0.0)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
furrr 0.1.0 2018-05-16 [1] CRAN (R 4.0.0)
future 1.19.1 2020-09-22 [1] CRAN (R 4.0.2)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
ggimage 0.2.8 2020-04-02 [1] CRAN (R 4.0.0)
ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.0)
ggplotify 0.0.5 2020-03-12 [1] CRAN (R 4.0.0)
git2r * 0.27.1 2020-05-03 [1] CRAN (R 4.0.0)
globals 0.13.0 2020-09-17 [1] CRAN (R 4.0.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
gower 0.2.2 2020-06-23 [1] CRAN (R 4.0.2)
GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.0.0)
gridGraphics 0.5-0 2020-02-25 [1] CRAN (R 4.0.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.2)
hexbin 1.28.1 2020-02-03 [1] CRAN (R 4.0.0)
hexSticker * 0.4.7 2020-06-01 [1] CRAN (R 4.0.0)
highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.0)
htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
htmlwidgets 1.5.1 2019-10-08 [1] CRAN (R 4.0.0)
httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
infer * 0.5.3 2020-07-14 [1] CRAN (R 4.0.2)
ipred 0.9-9 2019-04-28 [1] CRAN (R 4.0.0)
iterators 1.0.12 2019-07-26 [1] CRAN (R 4.0.0)
jsonlite 1.7.1 2020-09-07 [1] CRAN (R 4.0.2)
knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
labeling 0.3 2014-08-23 [1] CRAN (R 4.0.0)
lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.2)
lava 1.6.8 2020-09-26 [1] CRAN (R 4.0.2)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.0.0)
lhs 1.1.0 2020-09-29 [1] CRAN (R 4.0.2)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
listenv 0.8.0 2019-12-05 [1] CRAN (R 4.0.0)
lubridate 1.7.9 2020-06-08 [1] CRAN (R 4.0.2)
magick 2.4.0 2020-06-23 [1] CRAN (R 4.0.2)
magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
MASS 7.3-53 2020-09-09 [1] CRAN (R 4.0.2)
Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.2)
modeldata * 0.1.0 2020-10-22 [1] CRAN (R 4.0.2)
modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
nnet 7.3-14 2020-04-26 [1] CRAN (R 4.0.2)
parsnip * 0.1.4 2020-10-27 [1] CRAN (R 4.0.2)
pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
plotly 4.9.2.1 2020-04-04 [1] CRAN (R 4.0.2)
plotrix * 3.7-8 2020-04-16 [1] CRAN (R 4.0.2)
plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.0)
png 0.1-7 2013-12-03 [1] CRAN (R 4.0.0)
pROC 1.16.2 2020-03-19 [1] CRAN (R 4.0.0)
prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.0.0)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.0)
Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.0)
recipes * 0.1.15 2020-11-11 [1] CRAN (R 4.0.2)
reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.0)
rlang 0.4.8 2020-10-08 [1] CRAN (R 4.0.2)
rmarkdown 2.4 2020-09-30 [1] CRAN (R 4.0.2)
rpart 4.1-15 2019-04-12 [1] CRAN (R 4.0.2)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0)
rsample * 0.0.8 2020-09-23 [1] CRAN (R 4.0.2)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rvcheck 0.1.8 2020-03-01 [1] CRAN (R 4.0.0)
rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
scales * 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
sessioninfo * 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
showtext 0.9 2020-08-13 [1] CRAN (R 4.0.2)
showtextdb 3.0 2020-06-04 [1] CRAN (R 4.0.0)
sos * 2.0-0 2017-07-03 [1] CRAN (R 4.0.0)
stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
survival 3.2-7 2020-09-28 [1] CRAN (R 4.0.2)
sysfonts 0.8.1 2020-05-08 [1] CRAN (R 4.0.0)
tibble * 3.0.4 2020-10-12 [1] CRAN (R 4.0.2)
tidymodels * 0.1.1 2020-07-14 [1] CRAN (R 4.0.2)
tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2)
timeDate 3043.102 2018-02-21 [1] CRAN (R 4.0.0)
tune * 0.1.1 2020-07-08 [1] CRAN (R 4.0.2)
utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0)
vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
viridisLite 0.3.0 2018-02-01 [1] CRAN (R 4.0.0)
withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
workflows * 0.2.1 2020-10-08 [1] CRAN (R 4.0.2)
xfun 0.18 2020-09-29 [1] CRAN (R 4.0.2)
xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
yardstick * 0.0.7 2020-07-13 [1] CRAN (R 4.0.2)
[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
License
We R Novices by Alexander Savi & Simone Plak is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. An Open Educational Resource. Approved for Free Cultural Works.