Validity

October 6, 2017

Assignments

- Grades are in!
- Compare mistakes with answer sheet
- Remaining questions during tutorial
- 11c: What is the probability of obtaining any of the numbers 2, 4, or 6 if a die is tossed three times in a row?
  - Not tossing a 2, 4, or 6 in 3 tosses is \(.5^3\). Tossing 2, 4, or 6 at least once is \(1-(.5^3)\).
  - Or \(.5+.5^2+.5^3\)
- We're grading!

Pub quiz

The author of the best question wins a analogue light-emitting sample function!

Make pairs
Login to kahoot.it or download the app
Enter the game pin

Today

Lecture

Warming up
P-values
Errors in inference
Cooling down

Tutorial

Assignment 3

Warming Up

Where are we?

We now learned …

when and how to use the sum and product rule (monday)
what a probality distribution is (e.g. quincunx) and …
how to calc probabilties based on the binomial distribution (monday)
what a distribution of sample means looks like (wednesday),
why the central limit theorem is so usefull (wednesday),
and what a confidence interval means (wednesday).

Today we'll learn …

how we can test hypoteses (p-values)
and why power is so important

What is validity?

"Validity is the extent to which a concept, conclusion or measurement is well-founded and corresponds accurately to the real world."

How can we make such a conclusion: p-values
How can we increase the validity: power

Recap: Where are we?

Recap: Calculate with binomial distrubtions

What is the probability of observing a certain value (discrete distribution)

Example:

Last year on average 70% of the exam questions were answered correctly
How many of you (N=30) will score higher than an 8?
Exam is 10 questions
assumption?

Recap: Calculate with binomial distrubtions

We can simulate:

x <- rbinom(30, 10, .70) # we can simulate!
table(x > 8)

## 
## FALSE  TRUE 
##    27     3

x <- rbinom(30, 10, .70) # we can simulate!
table(x > 8)

## 
## FALSE  TRUE 
##    26     4

Recap: Calculate with binomial distrubtions

But we can also calculate the true probability: \(P(s > 8) = 1 - P(s <= 8)\)

1 - pbinom(8, 10, .7)

## [1] 0.1493083

dbinom(9, 10, .7) + dbinom(10, 10, .7)

## [1] 0.1493083

Recap: Calculate with binomial distrubtions

But we can also calculate the true probability: \(P(s > 8) = 1 - P(s <= 7)\)

(1 - pbinom(8, 10, .7)) * 30

## [1] 4.47925

Recap: Distribution of sample means (sampling distribution)

Example Grades:

You all make the exam
And we calculate the mean score: \(\bar{x} = 7.51\)
What is the probability that we observe a \(\bar{x}\) within some range of values (continuous distribution), if we repeat our experiment? (wednesday)
What is the probability that we observed an \(\bar{x}\) of 7.51 or higher?

Recap: Distribution of sample means

We can simulate:

m <- numeric()
for(i in 1:500000) { 
  grades <- rbinom(30, 10, .70)
  m[i] <- mean(grades) 
}

Recap: Distribution of sample means

We can also calculate (thanks to the central limit theorem):

n <- 10; p <- .7
expected_mean <- n * p # (check wikipedia!)
expected_var <- n * p * (1-p) # (check wikipedia!)
expected_sd <- sqrt(expected_var)
expected_se <- expected_sd / sqrt(30)
1 - pnorm(7.51, expected_mean, expected_se) # normale density!!

## [1] 0.02695128

Recap: Distribution of sample means

What is the probability that we observe a \(\bar{x}\) within some range of values (continious distribution) , if we repeat our experiment?

What was our (hidden) hypothesis / assumption?

Where is it hidden in our simulation?

m <- numeric(); 
for(i in 1:1000) { 
  cijfers <- rbinom(30, 10, .7)
  m[i] <- mean(cijfers) 
}

P-Value…

P-Value

The probability of a certain \(\bar{x}\), or a more extreme one (area), given some hypothesis, if we would keep repeating the experiment.

Definition p-value:

The probability of observing some test statistic, or a more extreme one, given some null hypothesis, when we would keep repeating the experiment.
given some de null hypothesis!
\(p(\bar{x} \ or \ more \ extreme \ | \ H0)\) or \(p(Data \ | \ H0)\)

p-value vs confidence interval:

p-value compares the test statistic with an explicit hypothesis
CI quantifies the uncertainty in the estimation of some test statistic (independent of any hypothesis)

Example Grades

So we collected data and calculated:

1 - pnorm(7.51, expected_mean, expected_se) # expected = some hypothesis

## [1] 0.02695128

This is our p-value! What would we conclude?

some test-statistic

Test-statistics untill now: \(\bar{x}\) and \(sd\).
But there are much more:
Z-value (normal distributed; see book) or t-values (t-distribution):

\[t = \frac{\bar{x} - \mu_{0}}{\frac{\sigma} {\sqrt{n}}}\]

Just more of the same: we only standardize!:
Distance between data and null hypothesis.
Accounting for the variance in the data (-> population).

See: http://en.wikipedia.org/wiki/Student's_t-test

Test-statistic

m1 <- mean(rnorm(10, 3, 2))
m2 <- mean(rnorm(10, -2, 5))  
t1 <- (m1 - 3) / (2 / sqrt(10)) # H0 = populatie gemiddelde
t2 <- (m2 - -2) / (5 / sqrt(10))

Test-statistic

\[ t = \frac{\bar{x} - \mu_{0}}{\frac{\sigma} {\sqrt{n}}} \]

We achieve that:

Distribution of t-values only depends on the standardized difference between data (\(\bar{x}\)) and our hypothesis (\(\mu_{0}\))
What is the effect of the sample size? why?

Example: Ritalin

http://www.foliaweb.nl/foliavond/studenten-grijpen-naar-ritalin-voor-studieprestaties/

Example: Ritalin

Side effects:

nervousness, agitation, anxiety, sleep problems (insomnia), stomach pain, loss of appetite, weight loss, nausea, vomiting, dizziness, palpitations, headache, vision problems, increased heart rate, increased blood pressure, sweating, skin rash, psychosis, and numbness, tingling, or cold feeling in your hands or feet.

https://www.rxlist.com/ritalin-side-effects-drug-center.htm

Example: Ritalin

Theory: Ritalin has a positive effect on study performance (also for students without ADHD)
Prediction:
- Use of ritalin increases your exam grade
- Condition A: Ritalin & Condition B: Placebo
- Average score of A will be higher than the average score of B
Experiment: double blind, 100 students (2 * 50).
Null hypothesis?

Unpaired two-sample t.test

Test statistic:

\[ t = \frac{(\bar{x_{1}} - \bar{x_{2}}) - (\mu_{1} - \mu_{2})} {\sqrt(s_{1}^{2} / n_{1} + s_{2}^{2} / n_{2}) } \]

\[ df = \frac{n_{1} + n_{2}} { 2 } \]

t.test(scoresA, scoresB)

http://en.wikipedia.org/wiki/Student's_t-test

Distribution of t-values given H0:

No effect:

\(\mu_{1} = \mu_{2}\)
\(\mu_{1} - \mu_{2} = 0\)

s1 <- rnorm(50, 7, 1); s2 <- rnorm(50, 7, 1)

P-values

The probability of observing some test statistic, or a more extreme one, given some null hypothesis, when we would keep repeating the experiment

0.048, 0.952

One or two sided testing?

\(\mu_{1} > \mu_{2}\)
What is the advantage?

Staistical Inference:

Example Ritalin:

What can we conclude if the p-value is .001?
The probability of observing some test statistic, or a more extreme one, given some null hypothesis, when we would keep repeating the experiment

If \(H_{0}\) is true, is the probability of observing a p-value between .01-.05 bigger/the same/smaller than the probability of a p-value between .06-.10?

Distribution of p-values:

Given \(H_{0}\) is distribution of p-values is uniform.

Assignment: What does this distribution look like if \(H_{0}\) is false?

Dig deeper

It's never too late to redefine statistical significance.

http://rpsychologist.com/d3/pdist/

https://www.nature.com/news/scientific-method-statistical-errors-1.14700

https://www.bayesianspectacles.org/redefine-statistical-significance-part-x-why-the-point-null-will-never-die/

Errors in Inference

Errors in inference

"Statistical inference makes propositions about a population, using data drawn from the population with some form of sampling."

"Validity is the extent to which a concept, conclusion or measurement is well-founded and corresponds accurately to the real world."

Can we make invalid inferences?

Can we prevent that?

Statistical errors

Ritalin example. Figure below shows null hypothesis (Ritalin does not affect study succes) and \(\alpha\) (significance level).

If sample mean is in dark grey area, we presume it does not origin from this population (the null hypothesis). But what if we had an extreme sample and it actually does?
We speak of an error.

Type I error

In this case a type I error: the probability of incorrectly rejecting the null hypothesis.

How does this error affect our understanding of Ritalin?

We presume an effect of Ritalin that in reality is not there.
Write as \(P\left(reject \: H_0 \mid H_0 \: is \: true\right)\).
Probability lecture! How do we call such a probability?

Type I error

The type I error is just one possible outcome. These are all of them.

Decision / reality	\(H_0\) is true (Ritalin has no effect)	\(H_0\) is not true (Ritalin has effect)
Reject \(H_0\)	\(P\left(reject \mid H_0\right) =\) \(?\)	\(P\left(reject \mid \neg H_0\right) = 1−\beta\)
Don't reject \(H_0\)	\(P\left(\neg reject \mid H_0\right) = 1−\alpha\)	\(P\left(\neg reject \mid \neg H_0\right) = \beta\)

In which cel do we put the type I error? Why?
What do we put at the question mark? Why?

Type I error

\(\alpha\) (determined by ourselves) thus determines the probability of a type I error: incorrectly rejecting the null hypothesis.

Decision / reality	\(H_0\) is true (Ritalin has no effect)	\(H_0\) is not true (Ritalin has effect)
Reject \(H_0\)	\(P\left(reject \mid H_0\right) = \alpha\)	\(P\left(reject \mid \neg H_0\right) = 1−\beta\)
"	Type I error	?
Don't reject \(H_0\)	\(P\left(\neg reject \mid H_0\right) = 1−\alpha\)	\(P\left(\neg reject \mid \neg H_0\right) = \beta\)
"		?

If we speak of \(P(reject \mid H_0)\), \(\alpha\), or type I error, we speak of the same!

Family-wise error rate

With \(\alpha=.05\), what happens if we perform multiple tests? Say 5?

We can calculate it! (probability lecture, remember?)

alpha <- .05
n <- 5
fwer <- 1 - (1 - alpha)^n
fwer

## [1] 0.2262191

at least 1 error = 1 - no errors at all
independent probabilies, so we can multiply

Multiple comparisons problem!

Type II error

Our table shows we can make another error. Imagine our sample mean does origin from a different population than the null hypothesis (students that take Ritalin perform either better or worse).

What type of error can we make?
Can you show it? Hint: \(P(\neg reject \mid \neg H_0)\)

Type II error

We speak of a type II error: the probability of incorrectly retaining (not rejecting) the null hypothesis.

We incorrectly presume no effect of Ritalin.

Error prevention

How can we prevent statistical errors?

An obvious choice to minimize the chance of making a type I error is to decrease \(\alpha\), as we determine it ourselves.
Why does it only displace the problem?
If Ritalin does have an effect, then we increase the probability of a type II error (see figure).

Error prevention

Is decreasing \(\alpha\) always a bad idea? Why?

Drugs like Ritalin have side effects.
In some cases you want to be very confident of a possible effect before you accept it. For that reason, in medical research a smaller \(\alpha\) is often used.

What can we do to prevent statistical errors?

Power

What happens if the standard error is decreased?

How can we decrease the standard error?
Yes, a bigger sample!

Power

We speak of increasing the power: the probability of correctly rejecting the null hypothesis.

Decision / reality	\(H_0\) is true (Ritalin has no effect)	\(H_0\) is not true (Ritalin has effect)
Reject \(H_0\)	\(P\left(reject \mid H_0\right) = \alpha\)	\(P\left(reject \mid \neg H_0\right) = 1−\beta\)
"	Type I error	Power
Don't reject \(H_0\)	\(P\left(\neg reject \mid H_0\right) = 1−\alpha\)	\(P\left(\neg reject \mid \neg H_0\right) = \beta\)
"		Type II error

Power

Imagine what happens if you increase the power a lot.

Do you expect to find a significant effect more often or less often?
Hint: power = \(P\left(reject \mid \neg H_0\right)\)

Effect size

If the power is big enough, you can find every miniscule difference.

Example: students that use Ritalin are found to score on average .05 point higher on the exam.
This effect is tiny. Can you call it significant?
Yes, the samples are huge and thus your measurements are very accurate: indeed the effect exists!
But is it of any clinical significance? And does it make up for the side effects and costs?
Therefore, effect size is of critical importance.

Controlling for multiple comparisons

Correction

Bonferroni correction: \(\frac{\alpha}{m}\)
Conservative, increase type II error / decreases power
Better alternatives

Better test

t-test 1: PML students vs SP students; t-test 2: PML students vs BC students; t-test 3: SP students vs BC students
ANOVA: PML students vs SP students vs BC students

Dig deeper

A brilliant lecture by Richard Feynman on hypothesis testing.

Two great interactive visualizations of power and effect sizes. Source: Kristoffer Magnusson

Daniel Lakens on observed power and type I and II errors.

Become famous with your own type III error.

Cooling Down

Where are we?

Today we learned …

based on the knowledge of the lectures of monday and wednesday how we can test our hypothesis
and what the p-value is
what the type 1 and type 2 errors are (and how to control them)
and why power and effect sizes make a big difference

Next we'll learn …

how to apply everything we learned to fit linear regression models!

Where are we?

descriptive statistics inferential statistics probability sum rule mutually exclusive events independence product rule conditional probability venn diagram discrete probability distribution continuous probability distribution binomial distribution quincunx binomial theorem normal distribution gamma distribution central limit theorem sample mean sampling distribution standard deviation standard error one-sample t-test confidence interval 1.96 null hypothesis p-value p-value distribution test statistic t-value z-value student's t-test two-sample t-test one- and two-tailed tests statistical significance type 1 and type 2 errors significance level family-wise error rate multiple comparisons problem effect size statistical power prediction / association least squares linear regression linear equation regression coefficients polynomial regression logistic regression explained variation errors and residuals model selection occam’s razor saturated model mean squared prediction error bias-variance trade-off overfitting / underfitting adjusted r-squared cross-validation information criterion statistical inference frequentist inference bayesian inference parametric statistics nonparametric statistics multicollinearity heteroscedasticity

Assignment 3

See syllabus and assignment.

Make sure to explain …

in your own words
what you do,
how you do it,
and why you do it.

Deadline saturday 17:00.

Excellerate your learning

Space your practice

cramming before the exam is bad for retention
spaced practice gives one of the most robust effects in learning
exploit the spacing effect and redo past mathematics, programming, and statistics excercises once in a while