October 4, 2017
Ever needed to use the sample function,
without a computer within reach?
And in the dark?
No correspondence about the result can be entered into.
Lecture
Tutorial
Last time we learned …
Today we'll learn …
"Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling."
Imagine that the mean psychology student's perseverance is '20'. In a sample of PML students a mean of 26 is found.
Which sample means can we expect? We take (or simulate) many samples from the population, and compare it to our PML sample mean.
You can visualize the probability as follows:
The area is difficult to calculate. Why?
From which distribution does our sample mean originate?
Example 1: the binomial distribution (e.g., a Xhosa item).
m <- c() for (i in 1:10000) { x <- rbinom(100, 1, .5) m[i] <- mean(x) } hist(m, prob = T, main = "sampling distribution")
Example 2: the gamma distribution (e.g., response time).
m <- c() for (i in 1:10000) { x <- rgamma(100, 2) m[i] <- mean(x) } hist(m, prob = T, main = "sampling distribution")
Milestone 1: If our sample is large enough, we know that its mean originates from an (approximately) normal distribution.
No! We don't know the mean and the standard deviation of this distribution.
How do we determine the mean?
How do we determine the standard deviation?
Milestone 2: If our sample is large enough, we know that its standard deviation originates from an (approximately) normal distribution.
Now that we know the exact distribution of our null hypothesis, a normal distribution with a mean of 20 and standard error \(SE\), we can calculate the area under de curve, and thus the likelihood of our mean.
That is, using data drawn from the population with some form of sampling, we can make a proposition about a population.
"A sampling distribution is the probability distribution of a given statistic based on a random sample."
Sampling distribution of the …
Standard error is the standard deviation of such a sampling distribution.
Watch a great animation that once again explains it very clearly (source: NYTimes / CreatureCast):
Watch another demonstration or fool around with it in R (use R console, don't use RStudio) (source: Vistat):
library(animation) ani.options(interval = 1) par(mar = c(3, 3, 1, 0.5), mgp = c(1.5, 0.5, 0), tcl = -0.3) lambda = 4 f = function(n) rpois(n, lambda) clt.ani(FUN = f, mean = lambda, sd = lambda)
Or buy your favorite distribution.
## [1] 179 190 204 177 187 189 195 186 208 187 192 198 184 178 206 165 197 ## [18] 188 198 192 209 176 204 208 188 163 193 182 196 191
What is your best guess of \(\mu\)?
mean(s) = 190.2867181
We have some randomness: How can we quantify this?
Confidence interval (CI)
qnorm(.025, 0, 1) # left border in standard normal distribution
## [1] -1.959964
pnorm(-1.96)
## [1] 0.0249979
If we would repeat the experiment, in 95% of the time, the true pop. mean (\(\mu\)) will fall within the constructed interval
mean(s) - 1.96 * (sd(s) / sqrt(length(s))) # left border
## [1] 186.0866
mean(s) + 1.96 * (sd(s) / sqrt(length(s))) # right border
## [1] 194.4868
bounds <- matrix(NA, 100, 3) for(i in 1:100) { mysample <- rnorm(30, 188, 10) # some sample se_mysample <- sd(mysample)/sqrt(length(mysample)) left_border <- mean(mysample) - 1.96 * se_mysample right_border <- mean(mysample) + 1.96 * se_mysample in_interval <- 188 > left_border & 188 < right_border # 1 Yes; 0 No bounds[i,] <- c(left_border, right_border, in_interval) }
head(bounds, 2)
## [,1] [,2] [,3] ## [1,] 183.5443 191.7547 1 ## [2,] 181.6433 189.4397 1
head(bounds[bounds[,3]==0,], 2)
## [,1] [,2] [,3] ## [1,] 190.0145 195.6054 0 ## [2,] 188.1520 195.2818 0
table(bounds[,3])
## ## 0 1 ## 9 91
plot(1, xlim=c(1, 100), ylim=c(180, 200), type='n', axes = FALSE, ylab = 'CI', xlab = 'Experiment') axis(1) axis(2) abline(h=188, col='blue', lty =3) for (i in 1:100) { if(bounds[i,3] == 1) { lines(x = c(i,i), y = c(bounds[i,1:2]), col = 'darkgreen') } else { lines(x = c(i,i), y = c(bounds[i,1:2]), col = 'darkred') } }
What if we increase the sample size (e.g. 300)?
Do we have less CI that don't include the true mean? Does the width or the CI change?
Tip: \(95\% = P(\bar{x} - 1.96 * \frac{sd}{\sqrt{N}} \le \mu \le \bar{x} + 1.96 * \frac{sd}{\sqrt{N}})\)
What if we increase the sample size (e.g. 300)?
s <- rnorm(300, 188, 10) mean(s) - 1.96 * (sd(s) / sqrt(length(s))) # left border
## [1] 187.536
mean(s) + 1.96 * (sd(s) / sqrt(length(s))) # right border
## [1] 189.7576
Tip: \(95\% = P(\bar{x} - 1.96 * \frac{sd}{\sqrt{N}} \le \mu \le \bar{x} + 1.96 * \frac{sd}{\sqrt{N}})\)
x <- rnorm(30, 188, 10) mean(x)
## [1] 187.7537
sd(x)/sqrt(length(x))
## [1] 1.609613
188.8 - 1.96 * (1.687/sqrt(length(30))) # left border
## [1] 185.4935
qnorm(.025, 188.8, 1.687) # left border
## [1] 185.4935
Statements from: Rink Hoekstra et al. (2014). Robust misinterpretation of confidence intervals
x <- rnorm(30, 188, 10) t.test(x)
## ## One Sample t-test ## ## data: x ## t = 125.9, df = 29, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 184.0789 190.1585 ## sample estimates: ## mean of x ## 187.1187
mean(x) + qt(.025, 29) * sd(x)/sqrt(length(x))
## [1] 184.0789
http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf
http://rpsychologist.com/d3/CI/
http://students.brown.edu/seeing-theory/statistical-inference/index.html#second
Today we learned …
Next we'll learn …
descriptive statistics inferential statistics probability sum rule mutually exclusive events independence product rule conditional probability venn diagram discrete probability distribution continuous probability distribution binomial distribution quincunx binomial theorem normal distribution gamma distribution central limit theorem sample mean sampling distribution standard deviation standard error one-sample t-test confidence interval 1.96 null hypothesis p-value p-value distribution test statistic t-value z-value student's t-test two-sample t-test degrees of freedom one- and two-tailed tests statistical significance type 1 and type 2 errors significance level family-wise error rate multiple comparisons problem false discovery rate statistical power observed / predicted power lindley’s paradox effect size prediction / association least squares linear regression linear equation regression coefficients polynomial regression logistic regression explained variation errors and residuals model selection occam’s razor saturated model mean squared prediction error bias-variance trade-off overfitting / underfitting adjusted r-squared cross-validation information criterion statistical inference frequentist inference bayesian inference parametric statistics nonparametric statistics multicollinearity heteroscedasticity
See syllabus and assignment.
Make sure to explain …
Deadline tomorrow 17:00.
Learn from your mistakes