Regression

October 9, 2017

Assignments

Grades of assignment 2 & 3 in today (fingers crossed)
Answer sheets of assignment 2 & 3 available (handed out during tutorial)

Pub quiz

The author of the best question wins a analogue light-emitting sample function!

Make pairs
Login to kahoot.it or download the app
Enter the game pin

Today

Lecture

Warming up
Linear regression
Model selection
Cooling down

Tutorial

Study advice and bachelor projects (Manon Slockers)
Assignment 4

Warming Up

Where are we?

Last time we learned …

when and how to use the sum and product rule (monday)
what a probality distribution is (e.g. quincunx) and …
how to calc probabilties based on the binomial distribution (monday)
what a distribution of sample means looks like (wednesday)
why the central limit theorem is so usefull (wednesday)
and what a confidence interval means (wednesday)
how we can test hypoteses (p-values) (friday)
and why power is so important (friday)

Today we'll learn …

recap linear regression and fit models in R
use all our skills in probabilities, inference and validity in regression
and how we select the best model

Linear Regression

From means to relations

Untill now:

is \(\bar{x}\) statistically different from 0 (mu)? (t.test(x, mu = 0))
is \(\bar{x}\) statistically different from \(\bar{y}\)? (t.test(x, y))

Today:

what is the relation between x and y?
or: can we predict the values of y with x?
or: how much variance can we explain in y with x?
goal: if we know x we can predict y in a new situation (experiment / data set / …)

Regression

data(women)
plot(women$height, women$weight)

Regression

\(y = bo + b1 * x + e\)

Informal: Give we an b1 such that the var(error) is a small as possible

Regression

\(y = bo + b1 * x + e\) \(y = bo + e\); \(b1 = 0\)

To Rrrrrr

x <- 1:10
y <- 5 + 1.5 * x + rnorm(length(x), 0, 2)

mod <- lm(y ~ x)

Formula's

mod <- lm(y ~ x)  # with intercept (implicite)
mod <- lm(y ~ 1 + x)  # with intercept (explicite)
mod <- lm(y ~ -1 + x)  # omit intercept

mod <- lm(y ~ x * z) # x + z + x * z
mod <- lm(y ~ x:z) # only x * z
mod <- lm(y ~ x1 * x2 * x3 - x1:x2:x3) # no third order interaction

mod <- lm(y ~ x + I(x^2)) # x + x^2
mod <- lm(y ~ poly(x, 2)) # x + x^2

Functions

y <- 5 + 1.5 * x + rnorm(length(x), 0, 2)

summary(mod)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6148 -1.7504  0.5499  1.6458  2.3275 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0146     1.5158   1.989 0.081919 .  
## x             1.5379     0.2443   6.295 0.000234 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.219 on 8 degrees of freedom
## Multiple R-squared:  0.832,  Adjusted R-squared:  0.811 
## F-statistic: 39.63 on 1 and 8 DF,  p-value: 0.000234

Functions

coef(mod)

## (Intercept)           x 
##    3.014634    1.537854

coef(summary(mod))

##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.014634  1.5157668 1.988851 0.081918642
## x           1.537854  0.2442879 6.295249 0.000233982

Get info

str(mod)
str(summary(mod))

names(mod)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

names(summary(mod))

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

Predictions

fitted(mod)

##         1         2         3         4         5         6         7 
##  4.552488  6.090341  7.628195  9.166048 10.703902 12.241755 13.779609 
##         8         9        10 
## 15.317462 16.855316 18.393170

predict(mod)

##         1         2         3         4         5         6         7 
##  4.552488  6.090341  7.628195  9.166048 10.703902 12.241755 13.779609 
##         8         9        10 
## 15.317462 16.855316 18.393170

newdata <- data.frame(x = 11) # has to be a dataframe!
predict(mod, newdata)

##        1 
## 19.93102

More info

resid(mod)

##          1          2          3          4          5          6 
## -3.6147972  1.7341655  2.2823220  0.7060903  1.3805284 -2.0340745 
##          7          8          9         10 
## -0.8994253  2.3275091 -2.2760312  0.3937130

confint(mod)

##                  2.5 %   97.5 %
## (Intercept) -0.4807306 6.509999
## x            0.9745245 2.101183

AIC(mod)

## [1] 48.08717

BIC(mod)

## [1] 48.99492

Model Comparison

mod0 <- lm(y ~ 1)
anova(mod, mod0)

## Analysis of Variance Table
## 
## Model 1: y ~ x
## Model 2: y ~ 1
##   Res.Df     RSS Df Sum of Sq     F   Pr(>F)    
## 1      8  39.387                                
## 2      9 234.499 -1   -195.11 39.63 0.000234 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Comparison

mod_women <- lm(weight ~ height, data = women)
mod_women0 <- lm(weight ~ 1, data = women)
anova(mod_women, mod_women0)

## Analysis of Variance Table
## 
## Model 1: weight ~ height
## Model 2: weight ~ 1
##   Res.Df    RSS Df Sum of Sq    F    Pr(>F)    
## 1     13   30.2                                
## 2     14 3362.9 -1   -3332.7 1433 1.091e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residuals

resid(mod_women)

##           1           2           3           4           5           6 
##  2.41666667  0.96666667  0.51666667  0.06666667 -0.38333333 -0.83333333 
##           7           8           9          10          11          12 
## -1.28333333 -1.73333333 -1.18333333 -1.63333333 -1.08333333 -0.53333333 
##          13          14          15 
##  0.01666667  1.56666667  3.11666667

resid(mod_women0)

##          1          2          3          4          5          6 
## -21.733333 -19.733333 -16.733333 -13.733333 -10.733333  -7.733333 
##          7          8          9         10         11         12 
##  -4.733333  -1.733333   2.266667   5.266667   9.266667  13.266667 
##         13         14         15 
##  17.266667  22.266667  27.266667

Residuals

Correlation == Regression

round(cor(x, y), 3)

## [1] 0.912

round(coef(lm(y ~ x)), 3)

## (Intercept)           x 
##       3.015       1.538

round(coef(lm(scale(y) ~ scale(x))), 3)

## (Intercept)    scale(x) 
##       0.000       0.912

ANOVA == Regression

y <- rep(c(0, 3), each = 15) + rnorm(10)
group <- as.factor(rep(c('A', 'B'), each = 15))

mod <- lm(y ~ group)

ANOVA == Regression

summary(mod)

## 
## Call:
## lm(formula = y ~ group)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8144 -0.1852  0.2326  0.5212  1.4105 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3174     0.2482  -1.279    0.212    
## groupB        3.2180     0.3511   9.166 6.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9614 on 28 degrees of freedom
## Multiple R-squared:   0.75,  Adjusted R-squared:  0.7411 
## F-statistic: 84.02 on 1 and 28 DF,  p-value: 6.343e-10

ANOVA == Regression

aov(mod)

## Call:
##    aov(formula = mod)
## 
## Terms:
##                    group Residuals
## Sum of Squares  77.66612  25.88235
## Deg. of Freedom        1        28
## 
## Residual standard error: 0.9614414
## Estimated effects may be unbalanced

table(predict(mod))

## 
## -0.317408484442345   2.90058509588782 
##                 15                 15

Dig deeper

http://rpsychologist.com/d3/correlation/

http://students.brown.edu/seeing-theory/regression/index.html#first

Model Selection

Data

75 mood items from the Motivational State Questionnaire for 3896 participants:

library(psych)  # load psych package
library(help = "psych")  # inspect enclosed data sets
?msq  # inspect msq data set
data(msq)  # load msq
head(msq)

##   active afraid alert angry anxious aroused ashamed astonished at.ease
## 1      1      1     1     0       1       1       0          0       1
## 2      1      0     1     0       0       0       0          0       1
## 3      1      0     0     0       0       0       0          0       2
## 4      1      0     1     0       1       1       1          0       1
## 5      2      0     1     0      NA       2       0          3       3
## 6      2      0     1     0      NA       1       0          0       1
##   at.rest attentive blue bored calm cheerful clutched.up confident content
## 1       1         1    1     2    1        1           1         1       1
## 2       1         0    0     0    1        1           0         1       1
## 3       2         0    0     2    1        1           0         1       3
## 4       2         1    1     0    1        2           0         0       2
## 5       1         1    0     1    3       NA           0         2       2
## 6       1         1    1     1    2       NA           0         1       1
##   delighted depressed determined distressed drowsy dull elated energetic
## 1         1         0          1          1      1    1      1         1
## 2         1         0          1          0      1    0      0         1
## 3         0         0          0          0      0    0      0         0
## 4         2         1          2          1      1    0      1         2
## 5         0         0          2          0      1    1      0         0
## 6         0         1          0          0      2    1      0         1
##   enthusiastic excited fearful frustrated full.of.pep gloomy grouchy
## 1            1       1       1          1           0      1       1
## 2            0       0       0          0           1      0       0
## 3            0       0       0          1           0      0       0
## 4            3       3       0          1           2      1       0
## 5            1       2       0          0           2      0       0
## 6            1       0       0          1           1      1       1
##   guilty happy hostile idle inactive inspired intense interested irritable
## 1      0     1       0    1        1        1       1          1         1
## 2      0     1       0    0        1        1       0          1         0
## 3      0     1       0    0        0        0       0          0         0
## 4      2     2       0    0        1        2       1          2         1
## 5      0     2       0   NA       NA        0       1          1         0
## 6      0     1       0   NA       NA        0       0          1         1
##   jittery lively lonely nervous placid pleased proud quiescent quiet
## 1       1      1      1       1      1       1     1         1     1
## 2       0      1      0       0      2       1     1         0     0
## 3       0      0      0       0      1       0     0         0     2
## 4       3      2      1       2      1       1     2         0     1
## 5       1      2      0       0      3       1     0         0     3
## 6       0      1      0       0      1       1     1         0     2
##   relaxed sad satisfied scared serene sleepy sluggish sociable sorry still
## 1       1   1         1      0      1      1        1        1     0     1
## 2       1   0         1      0      1      1        1        1     0     0
## 3       2   0         2      0      0      0        1        1     0     1
## 4       2   1         2      1      1      1        1        3     0     0
## 5       3   0         2      0      3      1        1        1     0     3
## 6       2   1         2      0      1      2        1        2     0     1
##   strong surprised tense tired tranquil unhappy upset vigorous wakeful
## 1      1         1     1     1        1       1     0        1       1
## 2      1         0     0     1        1       0     0        1       1
## 3      0         0     0     1        1       0     0        0       0
## 4      2         1     0     1        0       1     1        0       3
## 5      1         3     0     1       NA       0     0        0       1
## 6      1         0     0     2       NA       1     0        0       2
##   warmhearted wide.awake alone kindly scornful EA TA PA NegAff
## 1           1          1    NA     NA       NA 12 15 10      5
## 2           1          0    NA     NA       NA 12 11  7      0
## 3           1          1    NA     NA       NA 10  8  1      0
## 4           3          1    NA     NA       NA NA NA NA     NA
## 5           2          0     2      2        0 13  4 11      1
## 6           2          1     0      2        0 11  8  8      1
##   Extraversion Neuroticism Lie Sociability Impulsivity MSQ_Round   ID
## 1            9          20   1           3           4        15  193
## 2           18           8   1          11           6        15  130
## 3           15          11   3           8           5        15 2135
## 4           20          20   1          12           7        NA   18
## 5           13          15   4           5           7         6    2
## 6           19          15   2          10           7         6    3
##   condition MSQ_Time   TOD TOD24 scale exper
## 1         2    15.30 15.00    NA     r Rim.1
## 2         2    15.30 15.00    NA     r Rim.2
## 3         2    15.30 15.00    NA     r Rim.2
## 4         2       NA    NA    NA     r  COPE
## 5         5     5.75  5.83   5.5   msq rob-1
## 6         5     5.75  5.83   5.5   msq rob-1

Model

An initial naive model:

dat <- msq[1:5,]  # subset of the data
mdl_1 <- lm(dat$Sociability ~ dat$confident)
summary(mdl_1)$r.squared

## [1] 0.4166667

The explained variance (\(R^2\)) is 0.42, in other words:

42% of the variance in sociability can be accounted for by your confidence.

\(R^2\)

Wikipedia

Model building

mdl_2 <- lm(dat$Sociability ~ dat$confident + dat$interested)
summary(mdl_2)$r.squared

## [1] 0.4195011

mdl_3 <- lm(dat$Sociability ~ dat$confident + dat$interested + dat$enthusiastic)
summary(mdl_3)$r.squared

## [1] 0.5102041

mdl_4 <- lm(dat$Sociability ~ dat$confident + dat$interested + dat$enthusiastic 
            + dat$hostile); summary(mdl_4)$r.squared

## [1] 0.5102041

mdl_5 <- lm(dat$Sociability ~ dat$confident + dat$interested + dat$enthusiastic 
            + dat$hostile + dat$ID); summary(mdl_5)$r.squared

## [1] 1

Model selection

Model 5 is saturated: n parameters == n data points.
Model 3 and 4 explain the same amount of variance, which one should we choose?
I choose model 3: most parsimonious of the two.

observed = dat$Sociability
fitted = fitted(mdl_3)
data.frame(observed, fitted, squared_error = (observed-fitted)^2)

##   observed fitted squared_error
## 1        3    7.8         23.04
## 2       11    9.8          1.44
## 3        8    6.8          1.44
## 4       12   10.8          1.44
## 5        5    3.8          1.44

We've got a winner!
Do we? Does the model generalize to new data?

New data

How does or model fare on new data?

dat_new <- msq[6:10,]
observed = dat_new$Sociability
predicted = predict(mdl_3, dat_new)
data.frame(observed, predicted, squared_error = (observed-predicted)^2)

##    observed predicted squared_error
## 6        10       7.8          4.84
## 7         6       9.8         14.44
## 8         8       6.8          1.44
## 9         5      10.8         33.64
## 10        9       3.8         27.04

Terrible!

What's wrong?

We modelled noise.
What's meant with this? And how do we call it?

Overfitting

x <- 1:10  # adapted example Abe
y <- 5 + 1.5 * x + rnorm(length(x), 0, 5)
mod <- lm(y ~ poly(x, 9))  # 9th-order polynomial
plot(x, y, bty="n")
lines(fitted(mod))

How can we prevent this?

Occam's razor

"Among competing hypotheses, the one with the fewest assumptions should be selected."

Model selection without overfitting

Punish model for complexity. This can be done in many ways, we discuss 3:

Adjusted R-squared
Cross-validation
AIC/BIC

Adjusted R-squared

\(R^2\)

doesn't take complexity into account (i.e., the number of predictors)
increases almost always with complexer model, why?
biased population estimator: perstistantly too high estimates (simulate and see :))
compare with sample mean: unbiased population estimate, why?
normally distributed!

Adjusted \(R^2\)

adjusted to model complexity
can decrease, even become negative (no longer interpretable)
unbiased

Adjusted R-squared

data.frame(R_squared = c(summary(mdl_1)$r.squared, summary(mdl_2)$r.squared, summary(mdl_3)$r.squared, summary(mdl_4)$r.squared, summary(mdl_5)$r.squared),
           adjusted_R_squared = c(summary(mdl_1)$adj.r.squared, summary(mdl_2)$adj.r.squared, summary(mdl_3)$adj.r.squared, summary(mdl_4)$adj.r.squared, summary(mdl_5)$adj.r.squared))

##   R_squared adjusted_R_squared
## 1 0.4166667          0.2222222
## 2 0.4195011         -0.1609977
## 3 0.5102041         -0.9591837
## 4 0.5102041         -0.9591837
## 5 1.0000000                NaN

Which model wins?

Cross-validation

What is cross-validation?

We just did it: training on data points 1:5, testing on data points 6:10.
How could we use this as a model selection principle?

Cross-validation

We trained models on data points 1:5, now we test them on the test set, and determine their fit:

observed = dat_new$Sociability
predicted_1 = predict(mdl_1, dat_new)
predicted_2 = predict(mdl_2, dat_new)
predicted_3 = predict(mdl_3, dat_new)
sum((observed-predicted_1)^2)

## [1] 69.9

sum((observed-predicted_2)^2)

## [1] 70.73333

sum((observed-predicted_3)^2)

## [1] 81.4

How can we judge their fit? Which model wins?

K-fold cross-validation

Can things go wrong with cross-validation? Or: can we increase reliability?

We train on just 1 part, and test on just 1 part.
K-fold cross-validation to the rescue:

You fit and test every model on every fold. You can average the error rates.

AIC and BIC

Finally the AIC and BIC. You'll find out more in the assignment.

In short:

data.frame(AIC = c(AIC(mdl_1), AIC(mdl_2), AIC(mdl_3)),
           BIC = c(BIC(mdl_1), BIC(mdl_2), BIC(mdl_3)))

##        AIC      BIC
## 1 29.81792 28.64624
## 2 31.79357 30.23132
## 3 32.94407 30.99126

Which model wins?

Model selection selection?!

Adjusted R-squared, cross-validation, AIC, BIC, help?!

Advanced! And with the tools provided in this course, and a bit of puzzling, it's possible.

Dig deeper

Two visualizations for explaining variance explained by Joel Schneider.

Many kinds of cross-validation, such as leave-p-out cross-validation, on Wikipedia.

AIC, BIC, been there, done that. But DIC? FIC? WAIC? Learn more about information criteria or invent your own.

Cooling Down

Where are we?

descriptive statistics inferential statistics probability sum rule mutually exclusive events independence product rule conditional probability venn diagram discrete probability distribution continuous probability distribution binomial distribution quincunx binomial theorem normal distribution gamma distribution central limit theorem sample mean sampling distribution standard deviation standard error one-sample t-test confidence interval 1.96 null hypothesis p-value p-value distribution test statistic t-value z-value student's t-test two-sample t-test one- and two-tailed tests statistical significance type 1 and type 2 errors significance level family-wise error rate multiple comparisons problem effect size statistical power prediction / association linear regression regression coefficients polynomial regression explained variation errors and residuals model selection occam’s razor saturated model mean squared prediction error overfitting adjusted r-squared cross-validation information criterion

Our door is open for questions, also after the course!

How to approach learning statistics

Year 1	Year 2-3	(Research )Master
correlation	partial cor	non-linear regression
lin regression	MANOVA	logistic regression
1-way ANOVA (t-test)	MANCOVA	IRT
validity	ANCOVA	DIF
reliability	classical test theory	factor analysis
effect size	PCA	Time series
power		meta-analysis
confidence intervals		cluster analysis
significance-tests		latent class analysis
		structural equation models …

How to approach learning statistics

	Level	regr.	rasch	cluster
1	no idea..
2	find the button
3	understand output
4	when to use this method			X
5	simulate data	X
6	program the method		X
7	do the maths

How to approach learning statistics

Pick a method you want to learn
Read a book/paper/tutorials
Find out how to do it R (you can do anything!)
Check whether you understand what happened. 5a) Simulate data that you can analyse, and check if you can find 'it' back in your results 5b) Test different scenarions!
Apply the method to some real data
Do a justified inference based on your knowledge of statisics

Assignment 4

See syllabus and assignment.

Make sure to explain …

in your own words
what you do,
how you do it,
and why you do it.

Deadline tonight 23:59.

Excellerate your learning

Fool around

e.g., apply a method to a problem you personally like
try extreme values in a simulation
question every detail of a model
play with the parameters of a model
minimize your lines of code
relate a concept to something entirely different
can you create a puzzle for the class?
what's your favorite probability distribution?

That is, consolidate and deepen your understanding

Evaluation

Please tell us

what you liked about our lectures, tutorials, and assignments
and where there's room for improvement

Other tips of ideas are also welcome!

Exam

Preparation

Question time for mathematics, programming, and statistics: wednesday 9:00 - 11:00
Literature + exercises from assignments
Fool around

Exam

Part 1: pen & paper
Part 2: R, other materials, no communication

15-minute break in between

Be on time!