- Grades of assignment 2 & 3 in today (fingers crossed)
- Answer sheets of assignment 2 & 3 available (handed out during tutorial)
October 9, 2017
The author of the best question wins a analogue light-emitting sample function!
Lecture
Tutorial
Last time we learned …
Today we'll learn …
Untill now:
(t.test(x, mu = 0))
(t.test(x, y))
Today:
data(women) plot(women$height, women$weight)
\(y = bo + b1 * x + e\)
Informal: Give we an b1 such that the var(error) is a small as possible
\(y = bo + b1 * x + e\) \(y = bo + e\); \(b1 = 0\)
x <- 1:10 y <- 5 + 1.5 * x + rnorm(length(x), 0, 2)
mod <- lm(y ~ x)
mod <- lm(y ~ x) # with intercept (implicite) mod <- lm(y ~ 1 + x) # with intercept (explicite) mod <- lm(y ~ -1 + x) # omit intercept mod <- lm(y ~ x * z) # x + z + x * z mod <- lm(y ~ x:z) # only x * z mod <- lm(y ~ x1 * x2 * x3 - x1:x2:x3) # no third order interaction mod <- lm(y ~ x + I(x^2)) # x + x^2 mod <- lm(y ~ poly(x, 2)) # x + x^2
y <- 5 + 1.5 * x + rnorm(length(x), 0, 2)
summary(mod)
## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.6148 -1.7504 0.5499 1.6458 2.3275 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0146 1.5158 1.989 0.081919 . ## x 1.5379 0.2443 6.295 0.000234 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.219 on 8 degrees of freedom ## Multiple R-squared: 0.832, Adjusted R-squared: 0.811 ## F-statistic: 39.63 on 1 and 8 DF, p-value: 0.000234
coef(mod)
## (Intercept) x ## 3.014634 1.537854
coef(summary(mod))
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.014634 1.5157668 1.988851 0.081918642 ## x 1.537854 0.2442879 6.295249 0.000233982
str(mod) str(summary(mod))
names(mod)
## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "xlevels" "call" "terms" "model"
names(summary(mod))
## [1] "call" "terms" "residuals" "coefficients" ## [5] "aliased" "sigma" "df" "r.squared" ## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
fitted(mod)
## 1 2 3 4 5 6 7 ## 4.552488 6.090341 7.628195 9.166048 10.703902 12.241755 13.779609 ## 8 9 10 ## 15.317462 16.855316 18.393170
predict(mod)
## 1 2 3 4 5 6 7 ## 4.552488 6.090341 7.628195 9.166048 10.703902 12.241755 13.779609 ## 8 9 10 ## 15.317462 16.855316 18.393170
newdata <- data.frame(x = 11) # has to be a dataframe! predict(mod, newdata)
## 1 ## 19.93102
resid(mod)
## 1 2 3 4 5 6 ## -3.6147972 1.7341655 2.2823220 0.7060903 1.3805284 -2.0340745 ## 7 8 9 10 ## -0.8994253 2.3275091 -2.2760312 0.3937130
confint(mod)
## 2.5 % 97.5 % ## (Intercept) -0.4807306 6.509999 ## x 0.9745245 2.101183
AIC(mod)
## [1] 48.08717
BIC(mod)
## [1] 48.99492
mod0 <- lm(y ~ 1) anova(mod, mod0)
## Analysis of Variance Table ## ## Model 1: y ~ x ## Model 2: y ~ 1 ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 8 39.387 ## 2 9 234.499 -1 -195.11 39.63 0.000234 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod_women <- lm(weight ~ height, data = women) mod_women0 <- lm(weight ~ 1, data = women) anova(mod_women, mod_women0)
## Analysis of Variance Table ## ## Model 1: weight ~ height ## Model 2: weight ~ 1 ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 13 30.2 ## 2 14 3362.9 -1 -3332.7 1433 1.091e-14 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
resid(mod_women)
## 1 2 3 4 5 6 ## 2.41666667 0.96666667 0.51666667 0.06666667 -0.38333333 -0.83333333 ## 7 8 9 10 11 12 ## -1.28333333 -1.73333333 -1.18333333 -1.63333333 -1.08333333 -0.53333333 ## 13 14 15 ## 0.01666667 1.56666667 3.11666667
resid(mod_women0)
## 1 2 3 4 5 6 ## -21.733333 -19.733333 -16.733333 -13.733333 -10.733333 -7.733333 ## 7 8 9 10 11 12 ## -4.733333 -1.733333 2.266667 5.266667 9.266667 13.266667 ## 13 14 15 ## 17.266667 22.266667 27.266667
round(cor(x, y), 3)
## [1] 0.912
round(coef(lm(y ~ x)), 3)
## (Intercept) x ## 3.015 1.538
round(coef(lm(scale(y) ~ scale(x))), 3)
## (Intercept) scale(x) ## 0.000 0.912
y <- rep(c(0, 3), each = 15) + rnorm(10) group <- as.factor(rep(c('A', 'B'), each = 15))
mod <- lm(y ~ group)
summary(mod)
## ## Call: ## lm(formula = y ~ group) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.8144 -0.1852 0.2326 0.5212 1.4105 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.3174 0.2482 -1.279 0.212 ## groupB 3.2180 0.3511 9.166 6.34e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9614 on 28 degrees of freedom ## Multiple R-squared: 0.75, Adjusted R-squared: 0.7411 ## F-statistic: 84.02 on 1 and 28 DF, p-value: 6.343e-10
aov(mod)
## Call: ## aov(formula = mod) ## ## Terms: ## group Residuals ## Sum of Squares 77.66612 25.88235 ## Deg. of Freedom 1 28 ## ## Residual standard error: 0.9614414 ## Estimated effects may be unbalanced
table(predict(mod))
## ## -0.317408484442345 2.90058509588782 ## 15 15
http://rpsychologist.com/d3/correlation/
http://students.brown.edu/seeing-theory/regression/index.html#first
75 mood items from the Motivational State Questionnaire for 3896 participants:
library(psych) # load psych package library(help = "psych") # inspect enclosed data sets ?msq # inspect msq data set data(msq) # load msq head(msq)
## active afraid alert angry anxious aroused ashamed astonished at.ease ## 1 1 1 1 0 1 1 0 0 1 ## 2 1 0 1 0 0 0 0 0 1 ## 3 1 0 0 0 0 0 0 0 2 ## 4 1 0 1 0 1 1 1 0 1 ## 5 2 0 1 0 NA 2 0 3 3 ## 6 2 0 1 0 NA 1 0 0 1 ## at.rest attentive blue bored calm cheerful clutched.up confident content ## 1 1 1 1 2 1 1 1 1 1 ## 2 1 0 0 0 1 1 0 1 1 ## 3 2 0 0 2 1 1 0 1 3 ## 4 2 1 1 0 1 2 0 0 2 ## 5 1 1 0 1 3 NA 0 2 2 ## 6 1 1 1 1 2 NA 0 1 1 ## delighted depressed determined distressed drowsy dull elated energetic ## 1 1 0 1 1 1 1 1 1 ## 2 1 0 1 0 1 0 0 1 ## 3 0 0 0 0 0 0 0 0 ## 4 2 1 2 1 1 0 1 2 ## 5 0 0 2 0 1 1 0 0 ## 6 0 1 0 0 2 1 0 1 ## enthusiastic excited fearful frustrated full.of.pep gloomy grouchy ## 1 1 1 1 1 0 1 1 ## 2 0 0 0 0 1 0 0 ## 3 0 0 0 1 0 0 0 ## 4 3 3 0 1 2 1 0 ## 5 1 2 0 0 2 0 0 ## 6 1 0 0 1 1 1 1 ## guilty happy hostile idle inactive inspired intense interested irritable ## 1 0 1 0 1 1 1 1 1 1 ## 2 0 1 0 0 1 1 0 1 0 ## 3 0 1 0 0 0 0 0 0 0 ## 4 2 2 0 0 1 2 1 2 1 ## 5 0 2 0 NA NA 0 1 1 0 ## 6 0 1 0 NA NA 0 0 1 1 ## jittery lively lonely nervous placid pleased proud quiescent quiet ## 1 1 1 1 1 1 1 1 1 1 ## 2 0 1 0 0 2 1 1 0 0 ## 3 0 0 0 0 1 0 0 0 2 ## 4 3 2 1 2 1 1 2 0 1 ## 5 1 2 0 0 3 1 0 0 3 ## 6 0 1 0 0 1 1 1 0 2 ## relaxed sad satisfied scared serene sleepy sluggish sociable sorry still ## 1 1 1 1 0 1 1 1 1 0 1 ## 2 1 0 1 0 1 1 1 1 0 0 ## 3 2 0 2 0 0 0 1 1 0 1 ## 4 2 1 2 1 1 1 1 3 0 0 ## 5 3 0 2 0 3 1 1 1 0 3 ## 6 2 1 2 0 1 2 1 2 0 1 ## strong surprised tense tired tranquil unhappy upset vigorous wakeful ## 1 1 1 1 1 1 1 0 1 1 ## 2 1 0 0 1 1 0 0 1 1 ## 3 0 0 0 1 1 0 0 0 0 ## 4 2 1 0 1 0 1 1 0 3 ## 5 1 3 0 1 NA 0 0 0 1 ## 6 1 0 0 2 NA 1 0 0 2 ## warmhearted wide.awake alone kindly scornful EA TA PA NegAff ## 1 1 1 NA NA NA 12 15 10 5 ## 2 1 0 NA NA NA 12 11 7 0 ## 3 1 1 NA NA NA 10 8 1 0 ## 4 3 1 NA NA NA NA NA NA NA ## 5 2 0 2 2 0 13 4 11 1 ## 6 2 1 0 2 0 11 8 8 1 ## Extraversion Neuroticism Lie Sociability Impulsivity MSQ_Round ID ## 1 9 20 1 3 4 15 193 ## 2 18 8 1 11 6 15 130 ## 3 15 11 3 8 5 15 2135 ## 4 20 20 1 12 7 NA 18 ## 5 13 15 4 5 7 6 2 ## 6 19 15 2 10 7 6 3 ## condition MSQ_Time TOD TOD24 scale exper ## 1 2 15.30 15.00 NA r Rim.1 ## 2 2 15.30 15.00 NA r Rim.2 ## 3 2 15.30 15.00 NA r Rim.2 ## 4 2 NA NA NA r COPE ## 5 5 5.75 5.83 5.5 msq rob-1 ## 6 5 5.75 5.83 5.5 msq rob-1
An initial naive model:
dat <- msq[1:5,] # subset of the data mdl_1 <- lm(dat$Sociability ~ dat$confident) summary(mdl_1)$r.squared
## [1] 0.4166667
The explained variance (\(R^2\)) is 0.42, in other words:
42% of the variance in sociability can be accounted for by your confidence.
Wikipedia
mdl_2 <- lm(dat$Sociability ~ dat$confident + dat$interested) summary(mdl_2)$r.squared
## [1] 0.4195011
mdl_3 <- lm(dat$Sociability ~ dat$confident + dat$interested + dat$enthusiastic) summary(mdl_3)$r.squared
## [1] 0.5102041
mdl_4 <- lm(dat$Sociability ~ dat$confident + dat$interested + dat$enthusiastic + dat$hostile); summary(mdl_4)$r.squared
## [1] 0.5102041
mdl_5 <- lm(dat$Sociability ~ dat$confident + dat$interested + dat$enthusiastic + dat$hostile + dat$ID); summary(mdl_5)$r.squared
## [1] 1
observed = dat$Sociability fitted = fitted(mdl_3) data.frame(observed, fitted, squared_error = (observed-fitted)^2)
## observed fitted squared_error ## 1 3 7.8 23.04 ## 2 11 9.8 1.44 ## 3 8 6.8 1.44 ## 4 12 10.8 1.44 ## 5 5 3.8 1.44
How does or model fare on new data?
dat_new <- msq[6:10,] observed = dat_new$Sociability predicted = predict(mdl_3, dat_new) data.frame(observed, predicted, squared_error = (observed-predicted)^2)
## observed predicted squared_error ## 6 10 7.8 4.84 ## 7 6 9.8 14.44 ## 8 8 6.8 1.44 ## 9 5 10.8 33.64 ## 10 9 3.8 27.04
x <- 1:10 # adapted example Abe y <- 5 + 1.5 * x + rnorm(length(x), 0, 5) mod <- lm(y ~ poly(x, 9)) # 9th-order polynomial plot(x, y, bty="n") lines(fitted(mod))
"Among competing hypotheses, the one with the fewest assumptions should be selected."
Punish model for complexity. This can be done in many ways, we discuss 3:
\(R^2\)
data.frame(R_squared = c(summary(mdl_1)$r.squared, summary(mdl_2)$r.squared, summary(mdl_3)$r.squared, summary(mdl_4)$r.squared, summary(mdl_5)$r.squared), adjusted_R_squared = c(summary(mdl_1)$adj.r.squared, summary(mdl_2)$adj.r.squared, summary(mdl_3)$adj.r.squared, summary(mdl_4)$adj.r.squared, summary(mdl_5)$adj.r.squared))
## R_squared adjusted_R_squared ## 1 0.4166667 0.2222222 ## 2 0.4195011 -0.1609977 ## 3 0.5102041 -0.9591837 ## 4 0.5102041 -0.9591837 ## 5 1.0000000 NaN
Which model wins?
What is cross-validation?
We trained models on data points 1:5, now we test them on the test set, and determine their fit:
observed = dat_new$Sociability predicted_1 = predict(mdl_1, dat_new) predicted_2 = predict(mdl_2, dat_new) predicted_3 = predict(mdl_3, dat_new) sum((observed-predicted_1)^2)
## [1] 69.9
sum((observed-predicted_2)^2)
## [1] 70.73333
sum((observed-predicted_3)^2)
## [1] 81.4
How can we judge their fit? Which model wins?
Can things go wrong with cross-validation? Or: can we increase reliability?
Finally the AIC and BIC. You'll find out more in the assignment.
In short:
data.frame(AIC = c(AIC(mdl_1), AIC(mdl_2), AIC(mdl_3)), BIC = c(BIC(mdl_1), BIC(mdl_2), BIC(mdl_3)))
## AIC BIC ## 1 29.81792 28.64624 ## 2 31.79357 30.23132 ## 3 32.94407 30.99126
Which model wins?
Adjusted R-squared, cross-validation, AIC, BIC, help?!
Two visualizations for explaining variance explained by Joel Schneider.
Many kinds of cross-validation, such as leave-p-out cross-validation, on Wikipedia.
AIC, BIC, been there, done that. But DIC? FIC? WAIC? Learn more about information criteria or invent your own.
descriptive statistics inferential statistics probability sum rule mutually exclusive events independence product rule conditional probability venn diagram discrete probability distribution continuous probability distribution binomial distribution quincunx binomial theorem normal distribution gamma distribution central limit theorem sample mean sampling distribution standard deviation standard error one-sample t-test confidence interval 1.96 null hypothesis p-value p-value distribution test statistic t-value z-value student's t-test two-sample t-test one- and two-tailed tests statistical significance type 1 and type 2 errors significance level family-wise error rate multiple comparisons problem effect size statistical power prediction / association linear regression regression coefficients polynomial regression explained variation errors and residuals model selection occam’s razor saturated model mean squared prediction error overfitting adjusted r-squared cross-validation information criterion
Our door is open for questions, also after the course!
Year 1 | Year 2-3 | (Research )Master |
---|---|---|
correlation | partial cor | non-linear regression |
lin regression | MANOVA | logistic regression |
1-way ANOVA (t-test) | MANCOVA | IRT |
validity | ANCOVA | DIF |
reliability | classical test theory | factor analysis |
effect size | PCA | Time series |
power | meta-analysis | |
confidence intervals | cluster analysis | |
significance-tests | latent class analysis | |
structural equation models … |
Level | regr. | rasch | cluster | |
---|---|---|---|---|
1 | no idea.. | |||
2 | find the button | |||
3 | understand output | |||
4 | when to use this method | X | ||
5 | simulate data | X | ||
6 | program the method | X | ||
7 | do the maths |
See syllabus and assignment.
Make sure to explain …
Deadline tonight 23:59.
Fool around
That is, consolidate and deepen your understanding
Please tell us
Other tips of ideas are also welcome!
Preparation
Exam
15-minute break in between
Be on time!