readings for today

one short note

  • these slides contain a lot of code
  • as well as some interactive apps
  • they were designed for you to go over at home (repeatedly)
    • as always, if you have any questions, just drop me an email

two main goals

  1. get a feeling for statistics as a discipline
    • with an idiosyncratic history
    • where heated debate was and is not uncommon

  2. get to know differences between frequentist and Bayesian statistics
    • on a theoretical level
    • on a practical / computational level

comparison

Classical Statistics Bayesian Statistics
Ad-hoc Axiomatic
Incoherent Coherent
Paradoxical Intuitive
Irrational Rational
Ugly Pretty
Irrelevant Relevant
what's taught what's not taught
tongue in cheek; borrowed from EJ Wagenmakers

short note on history

  • the field of probability emerged with the treatment of "games of chance"
  • probability was intially viewed as degree of belief (Bayes, Laplace)
  • but was gradually replaced by concerns of frequency (Venn, Pearson)
  • because "degree of belief" was believed to be unscientific (Fisher, Neyman)
  • during the 20th century, frequentist and Bayesian statisticians were vile enemies


short note on history

  • still today, the dominating approach to statistics is frequentist
  • especially so in the social and life sciences (e.g. psychology, biology, neuroscience)
  • what is often not realized, is that the dominant approach is an incoherent hybrid
    • mixing ideas from Ronald Fisher (p value, evidential interpretation)
    • and the Neyman-Pearson framework (CIs, power, error control)


consequences

consequences

  • many statistical concepts are misunderstood by the majority of researchers
    • Oakes (1986): p values are misinterpreted all the time
    • Haller & Kraus (2002): also true for senior researchers and statistics teachers
    • Hoekstra et al. (2014): the same holds true for confidence intervals

consequences

  • my favourite misconceptions are:
    • p > .05 means that there is no difference
    • small p (e.g. \(p = .0001\)) means that there is a large difference
    • if we control at \(\alpha = .05\), that means that only 5% of research is false
    • high statistical power leads to highly informative data

outline for today

  • what is probability?
  • likelihood and statistical models
  • key concepts in frequentist statistics
    • estimators, consistency, bias
    • maximum likelihood, sampling distribution
    • significance testing and interval estimation
  • fundamental problems of frequentist inference
    • unable to support the null hypothesis
    • overestimate evidence against the null hypothesis

outline for today

  • likelihood ratios
  • Bayesian extension
    • Bayes factor for model selection
  • practical Bayesian problems
    • (Markov Chain) Monte carlo methods
  • apply all those concepts to the case of a t-test
    • and show why 30 years ago, nobody was a Bayesian

what is probability?

frequentist definition

  • frequentists view probability as long run relative frequency:

\[ P(X = x) = \lim_{n \to \infty} \frac{x}{n} \]

  • to say that \(P(X = \text{heads}) = 0.5\) we need an infinite number of toin cosses \(n\)
  • \(P(X = \text{heads}) = 0.5\) then denotes the proportion of heads

frequentist definition

  • this definition carries some severe problems, because it assumes repeatability
  • frequentists thus cannot answer important questions:
    • what is the probability that the sea level will rise 5cm in the next 5 years?
    • what are the chances that Donald Trump will win the next election?
    • what is the probability that my hypothesis is true?
    • any event that has not yet occured


Bayesian definition

  • probability as a (normative) measure of degree of belief
  • probability as an extension of logic to include inductive inferences
  • coin difference: has two heads or two tails
    • still \(P(X = \text{heads}) = .5\)

likelihood

statistical models

  • to give the world some structure, we use models
  • say I flip a coin to estimate its bias
    • what exactly do I mean by bias?
    • only becomes clear after specifying the functional relationship
    • of how a latent parameter relates to empirical observation
  • the statistical model, i.e. the likelihood, describes this functional relationship

coin flip model

  • in this simple case, I assume a Bernoulli model

\[ P(X = x|\theta) = \theta^x (1 - \theta)^{1 - x} \]

  • this describes one single coin flip

coin flip model

  • to model \(n\) coin flips, one assumes that they are independently and identically distributed, yielding

\[ \begin{align*} P(X = x^n|\theta) &= \prod_{i=1}^n \theta^{x_i} (1 - \theta)^{1 - x_i} \\ P(X = x^n|\theta) &= \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{n - \sum_{i=1}^n x_i} \end{align*} \]

coin flip model

  • in general, I am not interested in the order of the outcomes
  • instead, I am interested in modeling the number of successes:

\[ Y = \sum_{i=1}^n x_i \]

  • noting that there are: \[ {n \choose y} = \frac{n!}{y!(n - y)!} \]

  • possible ways of obtaining \(y\) successes yields the Binomial model

coin flip model

  • the binomial model thus relates the number of successes to the bias:

\[ \begin{align*} P(X = x^n|\theta) &= \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{n - \sum_{i=1}^n x_i} \\ P(Y = y|\theta) &= {n \choose y} \theta^y (1 - \theta)^{n - y} \end{align*} \]

  • note that instead of the whole data vector \(X = \{0, 1, 1, 0, 0, \ldots\}\)
    • we use the statistic \(Y = \sum_{i=1}^n x_i\) to summarize the data
    • \(Y\) is called a sufficient statistic

estimators and all that

good estimators

  • an estimator is a function of the data, returning an estimate
  • estimators are scrutinized along several dimensions
    • bias
      • is the estimate off with respect to the true population parameter?
    • consistency
      • does the estimate converge to the true population parameter?
    • efficiency
      • how fast does the estimate converge to the true value?

bias

  • check if \[ \mathbb{E}(\hat X) = \theta^{\star} \]

consistency

  • check if \[ \lim_{n \to \infty} \mathbb{Var}(\hat X) = 0 \]

efficiency

  • assume a normal model
    • both the median and the mean are unbiased, consistent estimators of \(\mu^{\star}\)
  • the mean is preferred because it is the more efficient estimator
    • that is, the mean comes closer to \(\mu^{\star}\) more quickly than the median

bias, consistency, efficiency

  • these are popular dimensions to assess the properties of estimators
  • note, however, that this is rather ad-hoc
  • in fact, we do not always prefer unbiased estimators against biased estimators
    • in hierarchical models, the unbiased estimator for the variance can yield negative values
  • because this is a rather unprincipled approach, "paradoxes" lurk around
    • see for example Efron & Morris (1977) on Stein's paradox

maximum likelihood

  • most popular estimator is the maximum likelihood estimator (Fisher, 1922; Stigler, 2007)
  • once we have data and specified the likelihood, we can see how likely the data is given certain parameter values

  • assume coin flip data: \(\{0, 0, 1, 1, 0, 0, 1, 1\}\)

\[ f(Y = 4|n = 8, \theta = .4) = {8 \choose 4} .4^4 (1 - .4)^4 = 0.232 \\ f(Y = 4|n = 8, \theta = .7) = {8 \choose 4} .7^4 (1 - .7)^4 = 0.136 \]

  • let's look at the whole likelihood curve

likelihood curve

maximum likelihood

  • the MLE is the parameter value that maximizes the likelihood
  • in the binomial case, it can be easily shown that:

\[ \begin{align*} f(y|\theta) &= y \log(\theta) + (n - y) \log(1 - \theta) \\ \frac{d}{d\theta}f(y|\theta) &= \frac{y}{\theta} - \frac{n - y}{1 - \theta} \\ 0 &= \frac{y}{\theta} - \frac{n - y}{1 - \theta} \\ n\theta - y\theta &= y - y\theta \\ \theta &= \frac{y}{n} \end{align*} \]

frequentist uncertainty

interpretation of the likelihood

  • frequentist statistics can be viewed as data-generating paradigm
  • once we have estimated the maximum likelihood value, we plug it in:

\[ f(Y = y|\theta = .5) = {n \choose y} .5^y (1 - .5)^{n - y} \]

  • using this relation, we generate data
    • we simulate exactly 8 coin flips, and compute \(\theta_{mle} = \frac{y}{n}\) again

sampling distribution

  • we need this because our concept of probability affords repeatability
  • for every generated data set, we estimate the maximum likelihood again
    • thus we get a distribution of the maximum likelihood estimates
    • this we call the sampling distribution of the mean

sampling distribution: analytical

  • in the binomial case, it is easy to derive the sampling distribution
  • we note that our estimate is:

\[ \hat X = \frac{1}{n} \sum_{i=1}^n X_i = \frac{1}{n} Y \]

  • so the estimate \(\hat X\) follows a scaled binomial distribution: \[ \hat X \sim \frac{1}{n} \mathrm{Bin}(\hat \theta, n) \]

sampling distribution: asymptotics

  • in general, the sampling distribution might be very hard to derive

  • when using the mean as estimator, the central limit theorem applies:
    • the average (or sum) of a large number of i.i.d. variables will be approximately normal regardless of the underlying distribution (but needs finite variance)

  • for more, see the optional reading on Fisher information

CLT: demonstration

n <- 100
x <- rbeta(n, 1, 100)
hist(x, col = 'grey50')

CLT: demonstration (bootstrap)

boot <- function(y, times = 1000) {
    sapply(seq_len(times), function(i) {
        yy <- sample(y, replace = TRUE)
        mean(yy)
    })
}

CLT: demonstration (bootstrap)

xx <- boot(x)
hist(xx, col = 'grey50')

frequentist inference

key concepts

  • we have now seen how frequentist statistics works
  • later we will apply these concepts to the case of a t-test
  • let's talk about
    • the \(p\) value
    • confidence intervals
    • \(\alpha\) and \(\beta\)
    • statistical power

\(p\) value

  • popularized by R.A. Fisher as a measure of degree of evidence against \(H_0\)
  • for example, \(p = .02\) is believed to be more evidence than \(p = .04\)
  • in Fisher's view, the researcher only specifies \(H_0\)

\(p\) value

\(p\) value: example

  • test if the coin is biased, \(H_0: \theta = .5\)
  • flip it \(n = 24\) times, observe \(y = 18\) heads
  • we choose \(\hat \theta = \frac{18}{24}\) as our estimate
  • however, we need the sampling distribution under \(H_0\), which is

\[ \theta_{H_0} \sim \frac{1}{n} \mathrm{Bin}(0.5, n) \]

\(p\) value: example

n <- 24
heads <- 18
tails <- n - heads

# probability of obtaining at least as extreme data
sum(dbinom(heads:n, n, .5)) + sum(dbinom(0:tails, n, .5)) # two-tailed p value
## [1] 0.02265584

confidence intervals

  • when I prepared this, I thought this would be easy
  • alas, it is not! wikipedia lists 8 (!!) ways to compute CIs
    • that's insane
    • funny enough, one very good method is Bayesian
  • wikipedia link

confidence intervals

  • again, it does not mean what you think it means
  • the standard deviation of the sampling distribution is called standard error
  • approximating the sampling distribution by a Gaussian, the standard error is \(\frac{\hat \sigma}{\sqrt{n}}\)
  • a 95% CI is given by:

\[ \hat \theta \pm 1.96 \frac{\hat \sigma}{\sqrt{n}} \]

confidence intervals

  • if one takes repeated samples and computes a CI each time, 95% of those will contain the true \(\theta\)
  • for the specific data set collected, the true \(\theta\) either is or is not in the CI

confidence intervals

simulate.cis <- function(y, n, times = 100) {
    p <- y / n
    cis <- matrix(NA, nrow = times, ncol = 2)
    
    for (i in 1:times) {
        y <- sample(c(1, 0), n, replace = TRUE, prob = c(p, 1-p))
        
        p.hat <- mean(y)
        se <- sd(y) / sqrt(n)
        
        cis[i, 1] <- p.hat - 1.96 * se
        cis[i, 2] <- p.hat + 1.96 * se
    }
    
    within <- apply(cis, 1, function(row) row[1] < p && row[2] > p)
    cbind(cis, within)
}

times <- 100
cis <- simulate.cis(18, 24, times = times)

confidence intervals

plot.cis <- function(cis, times) {
    x <- 0:times
    y <- seq(0, 1, 1/times)
    
    plot(x, y, type = 'n', xlab = 'i-th repeated sample',
         ylab = 'bias', main = '95% CIs for bias estimate')
    
    abline(18/24, 0, lwd = 3)
    
    arrows(x0 = x, y0 = cis[, 1], x1 = x, length = 0, lwd = 1.2,
           y1 = cis[, 2], col = ifelse(cis[, 3], 'black', 'red'))
}

confidence intervals

\(p\) value & CI: example

binom.test(heads, n, .5)
## 
##  Exact binomial test
## 
## data:  heads and n
## number of successes = 18, number of trials = 24, p-value = 0.02266
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5328872 0.9022696
## sample estimates:
## probability of success 
##                   0.75

\(p\) value

  • is based upon the following logic

  • Premise 1: If \(H_0\), then \(\textbf{y}\) is very unlikely.
  • Premise 2: \(\textbf{y}\).
  • Conclusion: \(H_0\) is very unlikely.

\(p\) value

  • if the \(p\) value is small, say \(.03\), this means that
    • if \(H_0\) is true, the probability of obtaining at least as extreme data as the one observed is only 3%
  • so either something rare has occured, or \(H_0\) is false

\(p\) value: flawed logic

  • Premise 1: If an individual is a man, he is unlikely to be the Pope.
  • Premise 2: Francis is the Pope.
  • Conclusion: Francis is probably not a man.

bias against \(H_0\)

quick note

  • \(p(H|\textbf{y}) \neq p(\textbf{y}|H)\)

  • \(p(\text{dead}|\text{head bitten of by a shark}) = 1\)
  • \(p(\text{head bitten of by a shark}|\text{dead}) < .0001\)

\(\alpha\) and \(\beta\)

  • Jerzy Neyman and Egon Pearson introduced the alternative hypothesis
  • now we can make two errors
    • reject \(H_0\), even though it is true (false positive)
    • keep \(H_0\), even though it is false (false negative)
  • \(\alpha\) controls the false positive rate (usually \(\alpha = .05\))
  • \(\beta\) controls the false negative rate (usually not specified)
  • either \(p <= \alpha\) or \(p > \alpha\)

statistical power

  • is \(1 - \beta\)
  • is a function of \(\alpha\), the sample size, and the effect size
  • gives the probability of rejecting \(H_0\) when it is really false
  • say \(\beta = .2\), this does not mean that
    • you have a 80% chance of correctly rejecting \(H_0\) in this experiment
  • again, the frequentist definition of probability bites us

  • averaged over all possible experiments, there's an 80% chance
    • for the specific experiment at hand, power tells us nothing
    • power is a pre-experimental concept, useful for planning

  • great visualization

\(p\) values kill – again!

  • \(p\) value cannot be interpreted in isolation
  • it depends on the sample size, \(n\), thus on statistical power

  • often overlooked, and paired with misconceptions lead to grave errors
    • do an experiment, find \(p > .05\)
    • two possible reasons: either \(H_0\) is true, or uninformative data
    • cannot distinguish between those cases

\(p\) values kill – again!

  • traffic experiments (70s)
    • change something, record number of deaths
    • failed to find an increase in deaths, \(p > .05\)
    • went ahead and implemented the changes
  • turned out the experiment was just underpowered; people died

  • Bayesian inference can indicate when data is uninformative

Bayesian hypothesis testing

likelihood ratios

  • one main issue is that (Fisherian) significance testing does not specify an alternative
  • use likelihood ratios to quantify evidence (inherently comparative)

likelihood ratios

likelihood ratios

  • one important issue
    • have to specify the alternative as a point
    • this is too demanding

  • Bayes solution
    • specify a distribution to quantify uncertainty

Bayesian updating

Bayesian hypothesis testing

  • is a generalization of the simple likelihood ratio test
  • instead of testing two point hypotheses, one tests composite hypotheses
  • remember the normalizing constant?

\[ p(\theta|\textbf{y}) = \frac{p(\textbf{y}|\theta)p(\theta)}{\int p(\textbf{y}|\theta)p(\theta)\mathrm{d}\theta} \]

  • hypothesis testing boils down to comparing normalizing constants

Bayesian hypothesis testing

  • how strongly can we believe in our hypothesis, given the collected data?
  • that is, we want \(p(H|\textbf{y})\)
  • a hypothesis can be instantiated in a model
  • \(H_0: \delta = 0\) corresponds to a model where the parameter \(\delta\) is fixed to 0
  • \(H_1: \delta \neq 0\) corresponds to a model where the parameter \(\delta\) is free to vary

Bayesian hypothesis testing

  • assume two models, \(M_0\) and \(M_1\), that instantiate \(H_0\) and \(H_1\), respectively
  • after we have collected data which model should we prefer?
  • the probability of each model is computed using Bayes' rule:

\[ \begin{align} p(M_0|\textbf{y}) &= \frac{p(\textbf{y}|M_0)p(M_0)}{p(\textbf{y})} \\[1ex] p(M_1|\textbf{y}) &= \frac{p(\textbf{y}|M_1)p(M_1)}{p(\textbf{y})} \end{align} \]

Bayesian hypothesis testing

\[ \begin{align} \frac{p(M_0|\textbf{y})}{p(M_1|\textbf{y})} &= \frac{\frac{p(\textbf{y}|M_0)p(M_0)}{p(\textbf{y})}} {\frac{p(\textbf{y}|M_1)p(M_1)}{p(\textbf{y})}} \\[2ex] \frac{p(M_0|\textbf{y})}{p(M_1|\textbf{y})} &= \frac{p(M_0)}{p(M_1)}\frac{p(\textbf{y}|M_0)}{p(\textbf{y}|M_1)} \\[2ex] \text{posterior odds} &= \text{prior odds} \times \text{Bayes factor} \end{align} \]

Bayesian hypothesis testing

  • the Bayes factor is an updating factor
  • it tells us how to update our prior beliefs about the hypotheses, given the data
  • note that there are two priors: over models \(M\), and over parameters \(\theta\)
  • the term \(p(\textbf{y}|M_0)\) is the marginal likelihood under model \(M_0\)
  • in other words, it is the probability of the data under model \(M_0\)
  • it quantifies how well this model predicts the data

t-test

t-test: effect of hats on creativity

  • does wearing hats increase creativity?
    • one control group (no hat)
    • one "treatment" group (hat)
    • subsequent creativity test
  • via this example we will again see the fundamental differences

t-test: dataset

set.seed(1774)

n <- 40
dat <- data.frame(hat = rnorm(n, mean = 50, sd = 5),
                  nohat = rnorm(n, mean = 50, sd = 5))
head(dat)
##        hat    nohat
## 1 53.98915 55.36670
## 2 48.84655 50.20971
## 3 52.84362 52.68426
## 4 45.78909 52.44228
## 5 50.34128 53.13781
## 6 49.68294 44.85191

t-test: dataset

t-test: likelihood

  • what is our statistical model?
  • for each group, we assume a normal model:

\[ p(X = x^n|\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left\{ \frac{-(x_i - \mu)^2}{2\sigma^2} \right\}} \\ p(X = x^n|\mu, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{\frac{n}{2}}} \exp{\left\{ \frac{-\sum_{i=1}^n (x_i - \mu)^2}{2\sigma^2} \right\}} \]

MLE

  • the maximum likelihood estimates for \(\mu\) and \(\sigma^2\) are:

\[ \hat x = \frac{1}{n} \sum_{i=1}^n x_i \\ \hat s^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 \]

  • \(\hat s^2\) is biased; unbiased estimate is:

\[ \hat s^2 = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \mu)^2 \]

t-statistic

  • we need a distance metric to quantify difference between the two groups
  • this will be the t-statistic:

\[ t = \frac{\hat x_1 - \hat x_2}{ \sqrt{ \frac{s_1^2}{n} + \frac{s_2^2}{n} } } \]

  • which indicates the sample difference between the two groups

classical inference

  • now we have a sample difference, t
  • to test \(H_0: \mu_1 = \mu_2\), we need the sampling distribution of t under \(H_0\)
  • unsurprisingly, this turns out to follow a Student's t-distribution with \(df = 2n - 2\)


classical inference

t.test(dat$hat, dat$nohat, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  dat$hat and dat$nohat
## t = 0.28052, df = 78, p-value = 0.7798
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.735484  2.304783
## sample estimates:
## mean of x mean of y 
##  50.51049  50.22584

Bayesian inference

  • let's estimate \(\mu\) and \(\sigma^2\) for each group separately
  • for convenience, we assume \(p(\mu, \sigma^2) = p(\mu)p(\sigma^2)\)

\[ p(\mu) \sim \mathcal{N}(\mu_0, \omega^2_0) \\ p(\sigma^2) \sim \mathcal{IG}(v_0, \frac{v_0 \sigma_0}{2}) \]

  • these are conjugate priors, but there is a serious problem

Bayesian inference: mean

  • given these conjugate priors, the posterior mean is:

\[ p(\mu|\sigma^2, \textbf{y}) = \mathcal{N}(\mu_1, \omega_1^2) \]

\[ \begin{align*} \mu_1 &= \frac{ \frac{n}{\sigma^2} \hat x + \frac{1}{\omega_0^2} \mu_0}{\frac{n}{\sigma^2} + \frac{1}{\omega_0^2}} \\ \omega_1^2 &= \left(\frac{n}{\sigma^2} + \frac{1}{\omega_0^2}\right)^{-1} \end{align*} \]

Bayesian inference: variance

  • given these conjugate priors, the posterior variance is:

\[ p(\sigma^2|\mu, \textbf{y}) = \mathcal{IG}(\frac{v_1}{2}, \frac{v_1 \sigma_1^2}{2}) \]

where

\[ \begin{align*} v_1 &= v_0 + n \\ v_1 \sigma_1^2 &= v_0 \sigma_0^2 + S \\ S &= \sum_{i=1}^n (x_i - \mu)^2 \end{align*} \]

  • see this nice visualisation for various distributions

Bayesian inference: problem

  • posterior distribution of the mean is conditional on the variance
  • the posterior distribution of the variance is conditional on the mean
  • thus we cannot simply apply the sum rule and integrate out \(\mu\) or \(\sigma^2\)
  • instead, we have to resort to computational methods

Monte carlo methods

  • anything we want to know about a random variable \(\theta\) can be learned by sampling many times from its density, \(f(\theta)\)
  • might well be the most important principle in modern statistics
  • let's have three examples:
    • computing the expectation of a poisson random variable
    • computing marginal densities using numerical integration
    • solving our problem using Gibbs sampling

Monte carlo I

  • suppose I have a poisson with \(\lambda = 3\)
  • I can calculate the expectation numerically, relying on the law of large numbers: \[ \begin{align*} X &\sim \mathrm{Pois}(\lambda) \\ \mathbb{E}(X) &= \lim\limits_{n \to \infty} \frac{1}{n} \sum_{i=1}^n x_i \end{align*} \]
x <- rpois(10000, lambda = 3)
mean(x)
## [1] 3.027

Monte carlo I

  • yep; the expectation, first moment, or mean of the poisson distribution is given by:

\[ \begin{align*} \mathbb{E}(X) &= \sum_{x=1}^{\infty} xp(x) \\ &= \sum_{x=1}^{\infty} x \frac{\lambda^x\mathrm{e}^{-\lambda}}{x!} \\ &= \mathrm{e}^{-\lambda} \sum_{x=1}^{\infty} \frac{\lambda^x}{(x - 1)!} \\ &= \mathrm{e}^{-\lambda} \sum_{k=x-1}^{\infty} \frac{\lambda^{k + 1}}{k!} \\ &= \lambda\mathrm{e}^{-\lambda} \sum_{k=x-1}^{\infty} \frac{\lambda^k}{k!} \\ &= \lambda\mathrm{e}^{-\lambda}\mathrm{e}^{\lambda} = \lambda \end{align*} \]

Monte carlo II

  • an Ex-Gaussian distribution fits best for reaction time data
  • it is the sum of an exponential and a Gaussian

\[ x | \eta \sim \mathcal{N}(\mu + \eta, \sigma^2) \\[1ex] \eta \sim \mathcal{Exp}(\nu) \]

Monte carlo II

  • but we do not want \(p(x|\eta)\) (conditional probability)
  • we want \(p(x)\) (marginal probability)
  • analytically, one would apply the sum rule and "integrate out" \(\nu\)
  • however, that is not tractable – need monte carlo methods

Monte carlo II

exgauss <- function(mu, sigma, rate, times = 5000) {
    
  res <- rep(NA, times)
  
  for (i in 1:times) {
    nu <- rexp(1, rate)
    res[i] <- rnorm(1, mu + nu, sigma)
  }
  
  res
}

Monte carlo II

Monte carlo III

  • in our creativity example, both distributions are conditional on each other
  • in a sense, now we have to do monte carlo integration for both of them
  • we do this iteratively, in each step drawing from both distributions, conditional on the other value
  • this process constitutes a Markov Chain, were the draws are not independent

Monte carlo III

gibbs <- function(mu0, w0, v0, sigma0, y, n.iter = 8000, burnin = n.iter / 4) {
  n <- length(y)
  
  v1 <- v0 + n
  mu_post <- rep(mean(y), n.iter)
  var_post <- rep(var(y), n.iter)
  
  for (i in 2:n.iter) {
    mu <- mu_post[i-1]
    sigma2 <- var_post[i-1]
    
    w1 <- 1 / (n / sigma2 + 1 / w0^2) # condition on current variance
    mu1 <- ((n / sigma2) * mean(y) + (1 / w0^2) * mu0) / (n / sigma2 + 1 / w0^2)
    mu_post[i] <- rnorm(1, mu1, sqrt(w1))
    
    S <- sum((y - mu)^2) # condition on current mean
    var_post[i] <- 1 / rgamma(1, v1 / 2, S / 2)
  }
  
  cbind(mu_post, var_post)[-(1:burnin), ]
}

estimating posterior distributions

  • for parameter estimation we can specify uninformative priors:

\[ \begin{align} \omega_0^2 &= 1000 \\ \nu_0 &= .0001 \\ \sigma_0^2 &= .0001 \end{align} \]

estimating posterior distributions

plot_post <- function(samples, title) {
  samples <- sort(samples)
  
  hist(samples, breaks = 60, main = title)
  abline(v = samples[length(samples) * 0.025],
         lwd = 2, lty = 'dashed', col = 'green')
  
  abline(v = samples[length(samples) * 0.975],
         lwd = 2, lty = 'dashed', col = 'green')
}

hat <- gibbs(50, 1000, .001, .001, dat$hat)
nohat <- gibbs(50, 1000, .001, .001, dat$nohat)

estimating posterior distributions

Bayesian inference: effect size

  • effect size is a deterministic function:

\[ p(d|\mu_1, \mu_2, \sigma) = \frac{\mu_1 - \mu_2}{\sigma} \]

  • in order to obtain \(p(d|\textbf{y})\), we just plug in our posterior samples

Bayesian inference: effect size

par(mfrow = c(1, 1))
d <- (hat[, 1] - nohat[, 1]) / sqrt(hat[, 2])
plot_post(d, 'Cohen\'s d of wearing hats')

Bayesian inference

  • we have estimated something, but we are not sure it exists!
  • that is, hypothesis testing is logically prior to estimation
  • before we estimate, we should test!

Bayesian inference: Bayes factor

  • pioneered by Harold Jeffreys (Ly et al., in press)
  • using Liang et al. (2008), Rouder et al. (2009) developed Bayes factor for t-test
  • the formula is rather unwieldly, but involves just a single integral
  • for more information, see Rouder et al. (2009)

Bayesian inference: Bayes factor

library('BayesFactor') # written by Richard Morey


ttestBF(dat$hat, dat$nohat)
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 0.2404271 ±0.02%
## 
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS

Bayesian inference: Bayes factor

  • given the prior:

\[ d \sim \mathrm{Cauchy}(\sqrt{2} / 2) \]

  • we obtain only a Bayes factor of \(4.16\) in favour of the null
  • that is, given our data, \(H_0\) is roughly 4 times more likely than \(H_1\)

Summary: Frequentist deficits

  • \(p\) values
    • do not quantify statistical evidence & depend on \(n\)
    • overestimate the evidence against \(H_0\)
    • do depend on the intentions of the researcher
    • do depend on data not observed
    • cannot support \(H_0\)
  • confidence intervals
    • are by no means a solution
    • checkout Richard Morey's work on that
  • no optional stopping

Summary: Bayesian benefits

  • Bayes factor
    • can support \(H_0\) (e.g. no gender differences)
    • monitor evidence as the data come in (optional stopping)
    • does not depend on \(n\), and can tell you if data are uninformative
  • posterior distributions

  • see the optional reading for a nice paper on Bayesian benefits

##
possible projects
  • more mathematical / simulation
    • write a paper on the Bayesian bootstrap
    • write a paper on error probabilities of the Bayes factor
    • summarize the literature on Markov Chain Monte carlo techniques
    • summarize the literature on finding objective prior distributions
    • summarize the literature on ways to compute the Bayes factor
  • more foundational / philosophical
    • write a paper on the difference between frequentism and Bayes
    • write a paper on the detrimental influence of Popper on statistics
  • more programming
    • implement an R package to do (Bayesian) task X
  • more analyzing
    • meta-analyze a certain field of research using a Bayesian mixture model
    • re-analyze studies published in the Journal in Support of the Null Hypothesis

next week

  • recap of today, answering questions
  • (Bayesian) regression modeling in R
  • check the course website for more info