Bayesian data analysis & cognitive modeling

readings for today

required reading
- Wagenmakers (2007), "Problems with \(p\) values"
optional reading
- Wagenmakers et al. (submitted) "Bayesian Benefits"
- Ly et al. (submitted) "Tutorial on Fisher Information"
- Wetzels et al. (2011) "Statistical Evidence in Experimental Psychology"

one short note

these slides contain a lot of code
as well as some interactive apps
they were designed for you to go over at home (repeatedly)
- as always, if you have any questions, just drop me an email

two main goals

get a feeling for statistics as a discipline
- with an idiosyncratic history
- where heated debate was and is not uncommon
get to know differences between frequentist and Bayesian statistics
- on a theoretical level
- on a practical / computational level

comparison

Classical Statistics	Bayesian Statistics
Ad-hoc	Axiomatic
Incoherent	Coherent
Paradoxical	Intuitive
Irrational	Rational
Ugly	Pretty
Irrelevant	Relevant
what's taught	what's not taught

tongue in cheek; borrowed from EJ Wagenmakers

short note on history

the field of probability emerged with the treatment of "games of chance"
probability was intially viewed as degree of belief (Bayes, Laplace)
but was gradually replaced by concerns of frequency (Venn, Pearson)
because "degree of belief" was believed to be unscientific (Fisher, Neyman)
during the 20th century, frequentist and Bayesian statisticians were vile enemies

short note on history

still today, the dominating approach to statistics is frequentist
especially so in the social and life sciences (e.g. psychology, biology, neuroscience)
what is often not realized, is that the dominant approach is an incoherent hybrid
- mixing ideas from Ronald Fisher (p value, evidential interpretation)
- and the Neyman-Pearson framework (CIs, power, error control)

consequences

many statistical concepts are misunderstood by the majority of researchers
- Oakes (1986): p values are misinterpreted all the time
- Haller & Kraus (2002): also true for senior researchers and statistics teachers
- Hoekstra et al. (2014): the same holds true for confidence intervals

consequences

my favourite misconceptions are:
- p > .05 means that there is no difference
- small p (e.g. \(p = .0001\)) means that there is a large difference
- if we control at \(\alpha = .05\), that means that only 5% of research is false
- high statistical power leads to highly informative data

outline for today

what is probability?
likelihood and statistical models
key concepts in frequentist statistics
- estimators, consistency, bias
- maximum likelihood, sampling distribution
- significance testing and interval estimation
fundamental problems of frequentist inference
- unable to support the null hypothesis
- overestimate evidence against the null hypothesis

outline for today

likelihood ratios
Bayesian extension
- Bayes factor for model selection
practical Bayesian problems
- (Markov Chain) Monte carlo methods
apply all those concepts to the case of a t-test
- and show why 30 years ago, nobody was a Bayesian

what is probability?

frequentist definition

frequentists view probability as long run relative frequency:

\[ P(X = x) = \lim_{n \to \infty} \frac{x}{n} \]

to say that \(P(X = \text{heads}) = 0.5\) we need an infinite number of toin cosses \(n\)
\(P(X = \text{heads}) = 0.5\) then denotes the proportion of heads

frequentist definition

this definition carries some severe problems, because it assumes repeatability
frequentists thus cannot answer important questions:
- what is the probability that the sea level will rise 5cm in the next 5 years?
- what are the chances that Donald Trump will win the next election?
- what is the probability that my hypothesis is true?
- any event that has not yet occured

Bayesian definition

probability as a (normative) measure of degree of belief
probability as an extension of logic to include inductive inferences
coin difference: has two heads or two tails
- still \(P(X = \text{heads}) = .5\)

likelihood

statistical models

to give the world some structure, we use models
say I flip a coin to estimate its bias
- what exactly do I mean by bias?
- only becomes clear after specifying the functional relationship
- of how a latent parameter relates to empirical observation
the statistical model, i.e. the likelihood, describes this functional relationship

coin flip model

in this simple case, I assume a Bernoulli model

\[ P(X = x|\theta) = \theta^x (1 - \theta)^{1 - x} \]

this describes one single coin flip

coin flip model

to model \(n\) coin flips, one assumes that they are independently and identically distributed, yielding

\[ \begin{align*} P(X = x^n|\theta) &= \prod_{i=1}^n \theta^{x_i} (1 - \theta)^{1 - x_i} \\ P(X = x^n|\theta) &= \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{n - \sum_{i=1}^n x_i} \end{align*} \]

coin flip model

in general, I am not interested in the order of the outcomes
instead, I am interested in modeling the number of successes:

\[ Y = \sum_{i=1}^n x_i \]

noting that there are: \[ {n \choose y} = \frac{n!}{y!(n - y)!} \]
possible ways of obtaining \(y\) successes yields the Binomial model

coin flip model

the binomial model thus relates the number of successes to the bias:

\[ \begin{align*} P(X = x^n|\theta) &= \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{n - \sum_{i=1}^n x_i} \\ P(Y = y|\theta) &= {n \choose y} \theta^y (1 - \theta)^{n - y} \end{align*} \]

note that instead of the whole data vector \(X = \{0, 1, 1, 0, 0, \ldots\}\)
- we use the statistic \(Y = \sum_{i=1}^n x_i\) to summarize the data
- \(Y\) is called a sufficient statistic

estimators and all that

good estimators

an estimator is a function of the data, returning an estimate
estimators are scrutinized along several dimensions
- bias
  - is the estimate off with respect to the true population parameter?
- consistency
  - does the estimate converge to the true population parameter?
- efficiency
  - how fast does the estimate converge to the true value?

bias

check if \[ \mathbb{E}(\hat X) = \theta^{\star} \]

consistency

check if \[ \lim_{n \to \infty} \mathbb{Var}(\hat X) = 0 \]

efficiency

assume a normal model
- both the median and the mean are unbiased, consistent estimators of \(\mu^{\star}\)
the mean is preferred because it is the more efficient estimator
- that is, the mean comes closer to \(\mu^{\star}\) more quickly than the median

bias, consistency, efficiency

these are popular dimensions to assess the properties of estimators
note, however, that this is rather ad-hoc
in fact, we do not always prefer unbiased estimators against biased estimators
- in hierarchical models, the unbiased estimator for the variance can yield negative values
because this is a rather unprincipled approach, "paradoxes" lurk around
- see for example Efron & Morris (1977) on Stein's paradox

maximum likelihood

most popular estimator is the maximum likelihood estimator (Fisher, 1922; Stigler, 2007)
once we have data and specified the likelihood, we can see how likely the data is given certain parameter values
assume coin flip data: \(\{0, 0, 1, 1, 0, 0, 1, 1\}\)

\[ f(Y = 4|n = 8, \theta = .4) = {8 \choose 4} .4^4 (1 - .4)^4 = 0.232 \\ f(Y = 4|n = 8, \theta = .7) = {8 \choose 4} .7^4 (1 - .7)^4 = 0.136 \]

let's look at the whole likelihood curve

likelihood curve

maximum likelihood

the MLE is the parameter value that maximizes the likelihood
in the binomial case, it can be easily shown that:

\[ \begin{align*} f(y|\theta) &= y \log(\theta) + (n - y) \log(1 - \theta) \\ \frac{d}{d\theta}f(y|\theta) &= \frac{y}{\theta} - \frac{n - y}{1 - \theta} \\ 0 &= \frac{y}{\theta} - \frac{n - y}{1 - \theta} \\ n\theta - y\theta &= y - y\theta \\ \theta &= \frac{y}{n} \end{align*} \]

frequentist uncertainty

interpretation of the likelihood

frequentist statistics can be viewed as data-generating paradigm
once we have estimated the maximum likelihood value, we plug it in:

\[ f(Y = y|\theta = .5) = {n \choose y} .5^y (1 - .5)^{n - y} \]

using this relation, we generate data
- we simulate exactly 8 coin flips, and compute \(\theta_{mle} = \frac{y}{n}\) again

sampling distribution

we need this because our concept of probability affords repeatability
for every generated data set, we estimate the maximum likelihood again
- thus we get a distribution of the maximum likelihood estimates
- this we call the sampling distribution of the mean

sampling distribution: analytical

in the binomial case, it is easy to derive the sampling distribution
we note that our estimate is:

\[ \hat X = \frac{1}{n} \sum_{i=1}^n X_i = \frac{1}{n} Y \]

so the estimate \(\hat X\) follows a scaled binomial distribution: \[ \hat X \sim \frac{1}{n} \mathrm{Bin}(\hat \theta, n) \]

sampling distribution: asymptotics

in general, the sampling distribution might be very hard to derive
when using the mean as estimator, the central limit theorem applies:
- the average (or sum) of a large number of i.i.d. variables will be approximately normal regardless of the underlying distribution (but needs finite variance)
for more, see the optional reading on Fisher information

CLT: demonstration

n <- 100
x <- rbeta(n, 1, 100)
hist(x, col = 'grey50')

CLT: demonstration (bootstrap)

boot <- function(y, times = 1000) {
    sapply(seq_len(times), function(i) {
        yy <- sample(y, replace = TRUE)
        mean(yy)
    })
}

CLT: demonstration (bootstrap)

xx <- boot(x)
hist(xx, col = 'grey50')

frequentist inference

key concepts

we have now seen how frequentist statistics works
later we will apply these concepts to the case of a t-test
let's talk about
- the \(p\) value
- confidence intervals
- \(\alpha\) and \(\beta\)
- statistical power

\(p\) value

popularized by R.A. Fisher as a measure of degree of evidence against \(H_0\)
for example, \(p = .02\) is believed to be more evidence than \(p = .04\)
in Fisher's view, the researcher only specifies \(H_0\)

\(p\) value

\(p\) value: example

test if the coin is biased, \(H_0: \theta = .5\)
flip it \(n = 24\) times, observe \(y = 18\) heads
we choose \(\hat \theta = \frac{18}{24}\) as our estimate
however, we need the sampling distribution under \(H_0\), which is

\[ \theta_{H_0} \sim \frac{1}{n} \mathrm{Bin}(0.5, n) \]

\(p\) value: example

n <- 24
heads <- 18
tails <- n - heads

# probability of obtaining at least as extreme data
sum(dbinom(heads:n, n, .5)) + sum(dbinom(0:tails, n, .5)) # two-tailed p value

## [1] 0.02265584

confidence intervals

when I prepared this, I thought this would be easy
alas, it is not! wikipedia lists 8 (!!) ways to compute CIs
- that's insane
- funny enough, one very good method is Bayesian
wikipedia link

confidence intervals

again, it does not mean what you think it means
the standard deviation of the sampling distribution is called standard error
approximating the sampling distribution by a Gaussian, the standard error is \(\frac{\hat \sigma}{\sqrt{n}}\)
a 95% CI is given by:

\[ \hat \theta \pm 1.96 \frac{\hat \sigma}{\sqrt{n}} \]

confidence intervals

if one takes repeated samples and computes a CI each time, 95% of those will contain the true \(\theta\)
for the specific data set collected, the true \(\theta\) either is or is not in the CI

confidence intervals

simulate.cis <- function(y, n, times = 100) {
    p <- y / n
    cis <- matrix(NA, nrow = times, ncol = 2)
    
    for (i in 1:times) {
        y <- sample(c(1, 0), n, replace = TRUE, prob = c(p, 1-p))
        
        p.hat <- mean(y)
        se <- sd(y) / sqrt(n)
        
        cis[i, 1] <- p.hat - 1.96 * se
        cis[i, 2] <- p.hat + 1.96 * se
    }
    
    within <- apply(cis, 1, function(row) row[1] < p && row[2] > p)
    cbind(cis, within)
}

times <- 100
cis <- simulate.cis(18, 24, times = times)

confidence intervals

plot.cis <- function(cis, times) {
    x <- 0:times
    y <- seq(0, 1, 1/times)
    
    plot(x, y, type = 'n', xlab = 'i-th repeated sample',
         ylab = 'bias', main = '95% CIs for bias estimate')
    
    abline(18/24, 0, lwd = 3)
    
    arrows(x0 = x, y0 = cis[, 1], x1 = x, length = 0, lwd = 1.2,
           y1 = cis[, 2], col = ifelse(cis[, 3], 'black', 'red'))
}

confidence intervals

\(p\) value & CI: example

binom.test(heads, n, .5)

## 
##  Exact binomial test
## 
## data:  heads and n
## number of successes = 18, number of trials = 24, p-value = 0.02266
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5328872 0.9022696
## sample estimates:
## probability of success 
##                   0.75

\(p\) value

is based upon the following logic
Premise 1: If \(H_0\), then \(\textbf{y}\) is very unlikely.
Premise 2: \(\textbf{y}\).
Conclusion: \(H_0\) is very unlikely.

\(p\) value

if the \(p\) value is small, say \(.03\), this means that
- if \(H_0\) is true, the probability of obtaining at least as extreme data as the one observed is only 3%
so either something rare has occured, or \(H_0\) is false

\(p\) value: flawed logic

Premise 1: If an individual is a man, he is unlikely to be the Pope.
Premise 2: Francis is the Pope.
Conclusion: Francis is probably not a man.

bias against \(H_0\)

story of Sally Clark

quick note

\(p(H|\textbf{y}) \neq p(\textbf{y}|H)\)
\(p(\text{dead}|\text{head bitten of by a shark}) = 1\)
\(p(\text{head bitten of by a shark}|\text{dead}) < .0001\)

\(\alpha\) and \(\beta\)

Jerzy Neyman and Egon Pearson introduced the alternative hypothesis
now we can make two errors
- reject \(H_0\), even though it is true (false positive)
- keep \(H_0\), even though it is false (false negative)
\(\alpha\) controls the false positive rate (usually \(\alpha = .05\))
\(\beta\) controls the false negative rate (usually not specified)
either \(p <= \alpha\) or \(p > \alpha\)

statistical power

is \(1 - \beta\)
is a function of \(\alpha\), the sample size, and the effect size
gives the probability of rejecting \(H_0\) when it is really false
say \(\beta = .2\), this does not mean that
- you have a 80% chance of correctly rejecting \(H_0\) in this experiment
again, the frequentist definition of probability bites us
averaged over all possible experiments, there's an 80% chance
- for the specific experiment at hand, power tells us nothing
- power is a pre-experimental concept, useful for planning
great visualization

\(p\) values kill – again!

\(p\) value cannot be interpreted in isolation
it depends on the sample size, \(n\), thus on statistical power
often overlooked, and paired with misconceptions lead to grave errors
- do an experiment, find \(p > .05\)
- two possible reasons: either \(H_0\) is true, or uninformative data
- cannot distinguish between those cases

\(p\) values kill – again!

traffic experiments (70s)
- change something, record number of deaths
- failed to find an increase in deaths, \(p > .05\)
- went ahead and implemented the changes
turned out the experiment was just underpowered; people died
Bayesian inference can indicate when data is uninformative

Bayesian hypothesis testing

likelihood ratios

one main issue is that (Fisherian) significance testing does not specify an alternative
use likelihood ratios to quantify evidence (inherently comparative)

likelihood ratios

one important issue
- have to specify the alternative as a point
- this is too demanding
Bayes solution
- specify a distribution to quantify uncertainty

Bayesian updating

Bayesian hypothesis testing

is a generalization of the simple likelihood ratio test
instead of testing two point hypotheses, one tests composite hypotheses
remember the normalizing constant?

\[ p(\theta|\textbf{y}) = \frac{p(\textbf{y}|\theta)p(\theta)}{\int p(\textbf{y}|\theta)p(\theta)\mathrm{d}\theta} \]

hypothesis testing boils down to comparing normalizing constants

Bayesian hypothesis testing

how strongly can we believe in our hypothesis, given the collected data?
that is, we want \(p(H|\textbf{y})\)
a hypothesis can be instantiated in a model
\(H_0: \delta = 0\) corresponds to a model where the parameter \(\delta\) is fixed to 0
\(H_1: \delta \neq 0\) corresponds to a model where the parameter \(\delta\) is free to vary

Bayesian hypothesis testing

assume two models, \(M_0\) and \(M_1\), that instantiate \(H_0\) and \(H_1\), respectively
after we have collected data which model should we prefer?
the probability of each model is computed using Bayes' rule:

\[ \begin{align} p(M_0|\textbf{y}) &= \frac{p(\textbf{y}|M_0)p(M_0)}{p(\textbf{y})} \\[1ex] p(M_1|\textbf{y}) &= \frac{p(\textbf{y}|M_1)p(M_1)}{p(\textbf{y})} \end{align} \]

Bayesian hypothesis testing

\[ \begin{align} \frac{p(M_0|\textbf{y})}{p(M_1|\textbf{y})} &= \frac{\frac{p(\textbf{y}|M_0)p(M_0)}{p(\textbf{y})}} {\frac{p(\textbf{y}|M_1)p(M_1)}{p(\textbf{y})}} \\[2ex] \frac{p(M_0|\textbf{y})}{p(M_1|\textbf{y})} &= \frac{p(M_0)}{p(M_1)}\frac{p(\textbf{y}|M_0)}{p(\textbf{y}|M_1)} \\[2ex] \text{posterior odds} &= \text{prior odds} \times \text{Bayes factor} \end{align} \]

Bayesian hypothesis testing

the Bayes factor is an updating factor
it tells us how to update our prior beliefs about the hypotheses, given the data
note that there are two priors: over models \(M\), and over parameters \(\theta\)
the term \(p(\textbf{y}|M_0)\) is the marginal likelihood under model \(M_0\)
in other words, it is the probability of the data under model \(M_0\)
it quantifies how well this model predicts the data

t-test

t-test: effect of hats on creativity

does wearing hats increase creativity?
- one control group (no hat)
- one "treatment" group (hat)
- subsequent creativity test
via this example we will again see the fundamental differences

t-test: dataset

set.seed(1774)

n <- 40
dat <- data.frame(hat = rnorm(n, mean = 50, sd = 5),
                  nohat = rnorm(n, mean = 50, sd = 5))
head(dat)

##        hat    nohat
## 1 53.98915 55.36670
## 2 48.84655 50.20971
## 3 52.84362 52.68426
## 4 45.78909 52.44228
## 5 50.34128 53.13781
## 6 49.68294 44.85191

t-test: dataset

t-test: likelihood

what is our statistical model?
for each group, we assume a normal model:

\[ p(X = x^n|\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left\{ \frac{-(x_i - \mu)^2}{2\sigma^2} \right\}} \\ p(X = x^n|\mu, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{\frac{n}{2}}} \exp{\left\{ \frac{-\sum_{i=1}^n (x_i - \mu)^2}{2\sigma^2} \right\}} \]

MLE

the maximum likelihood estimates for \(\mu\) and \(\sigma^2\) are:

\[ \hat x = \frac{1}{n} \sum_{i=1}^n x_i \\ \hat s^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 \]

\(\hat s^2\) is biased; unbiased estimate is:

\[ \hat s^2 = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \mu)^2 \]

t-statistic

we need a distance metric to quantify difference between the two groups
this will be the t-statistic:

\[ t = \frac{\hat x_1 - \hat x_2}{ \sqrt{ \frac{s_1^2}{n} + \frac{s_2^2}{n} } } \]

which indicates the sample difference between the two groups

classical inference

now we have a sample difference, t
to test \(H_0: \mu_1 = \mu_2\), we need the sampling distribution of t under \(H_0\)
unsurprisingly, this turns out to follow a Student's t-distribution with \(df = 2n - 2\)

classical inference

t.test(dat$hat, dat$nohat, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  dat$hat and dat$nohat
## t = 0.28052, df = 78, p-value = 0.7798
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.735484  2.304783
## sample estimates:
## mean of x mean of y 
##  50.51049  50.22584

Bayesian inference

let's estimate \(\mu\) and \(\sigma^2\) for each group separately
for convenience, we assume \(p(\mu, \sigma^2) = p(\mu)p(\sigma^2)\)

\[ p(\mu) \sim \mathcal{N}(\mu_0, \omega^2_0) \\ p(\sigma^2) \sim \mathcal{IG}(v_0, \frac{v_0 \sigma_0}{2}) \]

these are conjugate priors, but there is a serious problem

Bayesian inference: mean

given these conjugate priors, the posterior mean is:

\[ p(\mu|\sigma^2, \textbf{y}) = \mathcal{N}(\mu_1, \omega_1^2) \]

\[ \begin{align*} \mu_1 &= \frac{ \frac{n}{\sigma^2} \hat x + \frac{1}{\omega_0^2} \mu_0}{\frac{n}{\sigma^2} + \frac{1}{\omega_0^2}} \\ \omega_1^2 &= \left(\frac{n}{\sigma^2} + \frac{1}{\omega_0^2}\right)^{-1} \end{align*} \]

Bayesian inference: variance

given these conjugate priors, the posterior variance is:

\[ p(\sigma^2|\mu, \textbf{y}) = \mathcal{IG}(\frac{v_1}{2}, \frac{v_1 \sigma_1^2}{2}) \]

where

\[ \begin{align*} v_1 &= v_0 + n \\ v_1 \sigma_1^2 &= v_0 \sigma_0^2 + S \\ S &= \sum_{i=1}^n (x_i - \mu)^2 \end{align*} \]

see this nice visualisation for various distributions

Bayesian inference: problem

posterior distribution of the mean is conditional on the variance
the posterior distribution of the variance is conditional on the mean
thus we cannot simply apply the sum rule and integrate out \(\mu\) or \(\sigma^2\)
instead, we have to resort to computational methods

Monte carlo methods

anything we want to know about a random variable \(\theta\) can be learned by sampling many times from its density, \(f(\theta)\)
might well be the most important principle in modern statistics
let's have three examples:
- computing the expectation of a poisson random variable
- computing marginal densities using numerical integration
- solving our problem using Gibbs sampling

Monte carlo I

suppose I have a poisson with \(\lambda = 3\)
I can calculate the expectation numerically, relying on the law of large numbers: \[ \begin{align*} X &\sim \mathrm{Pois}(\lambda) \\ \mathbb{E}(X) &= \lim\limits_{n \to \infty} \frac{1}{n} \sum_{i=1}^n x_i \end{align*} \]

x <- rpois(10000, lambda = 3)
mean(x)

## [1] 3.027

Monte carlo I

yep; the expectation, first moment, or mean of the poisson distribution is given by:

\[ \begin{align*} \mathbb{E}(X) &= \sum_{x=1}^{\infty} xp(x) \\ &= \sum_{x=1}^{\infty} x \frac{\lambda^x\mathrm{e}^{-\lambda}}{x!} \\ &= \mathrm{e}^{-\lambda} \sum_{x=1}^{\infty} \frac{\lambda^x}{(x - 1)!} \\ &= \mathrm{e}^{-\lambda} \sum_{k=x-1}^{\infty} \frac{\lambda^{k + 1}}{k!} \\ &= \lambda\mathrm{e}^{-\lambda} \sum_{k=x-1}^{\infty} \frac{\lambda^k}{k!} \\ &= \lambda\mathrm{e}^{-\lambda}\mathrm{e}^{\lambda} = \lambda \end{align*} \]

Monte carlo II

an Ex-Gaussian distribution fits best for reaction time data
it is the sum of an exponential and a Gaussian

\[ x | \eta \sim \mathcal{N}(\mu + \eta, \sigma^2) \\[1ex] \eta \sim \mathcal{Exp}(\nu) \]

Monte carlo II

but we do not want \(p(x|\eta)\) (conditional probability)
we want \(p(x)\) (marginal probability)
analytically, one would apply the sum rule and "integrate out" \(\nu\)
however, that is not tractable – need monte carlo methods

Monte carlo II

exgauss <- function(mu, sigma, rate, times = 5000) {
    
  res <- rep(NA, times)
  
  for (i in 1:times) {
    nu <- rexp(1, rate)
    res[i] <- rnorm(1, mu + nu, sigma)
  }
  
  res
}

Monte carlo II

Monte carlo III

in our creativity example, both distributions are conditional on each other
in a sense, now we have to do monte carlo integration for both of them
we do this iteratively, in each step drawing from both distributions, conditional on the other value
this process constitutes a Markov Chain, were the draws are not independent

Monte carlo III

gibbs <- function(mu0, w0, v0, sigma0, y, n.iter = 8000, burnin = n.iter / 4) {
  n <- length(y)
  
  v1 <- v0 + n
  mu_post <- rep(mean(y), n.iter)
  var_post <- rep(var(y), n.iter)
  
  for (i in 2:n.iter) {
    mu <- mu_post[i-1]
    sigma2 <- var_post[i-1]
    
    w1 <- 1 / (n / sigma2 + 1 / w0^2) # condition on current variance
    mu1 <- ((n / sigma2) * mean(y) + (1 / w0^2) * mu0) / (n / sigma2 + 1 / w0^2)
    mu_post[i] <- rnorm(1, mu1, sqrt(w1))
    
    S <- sum((y - mu)^2) # condition on current mean
    var_post[i] <- 1 / rgamma(1, v1 / 2, S / 2)
  }
  
  cbind(mu_post, var_post)[-(1:burnin), ]
}

estimating posterior distributions

for parameter estimation we can specify uninformative priors:

\[ \begin{align} \omega_0^2 &= 1000 \\ \nu_0 &= .0001 \\ \sigma_0^2 &= .0001 \end{align} \]

estimating posterior distributions

plot_post <- function(samples, title) {
  samples <- sort(samples)
  
  hist(samples, breaks = 60, main = title)
  abline(v = samples[length(samples) * 0.025],
         lwd = 2, lty = 'dashed', col = 'green')
  
  abline(v = samples[length(samples) * 0.975],
         lwd = 2, lty = 'dashed', col = 'green')
}

hat <- gibbs(50, 1000, .001, .001, dat$hat)
nohat <- gibbs(50, 1000, .001, .001, dat$nohat)

estimating posterior distributions

Bayesian inference: effect size

effect size is a deterministic function:

\[ p(d|\mu_1, \mu_2, \sigma) = \frac{\mu_1 - \mu_2}{\sigma} \]

in order to obtain \(p(d|\textbf{y})\), we just plug in our posterior samples

Bayesian inference: effect size

par(mfrow = c(1, 1))
d <- (hat[, 1] - nohat[, 1]) / sqrt(hat[, 2])
plot_post(d, 'Cohen\'s d of wearing hats')

Bayesian inference

we have estimated something, but we are not sure it exists!
that is, hypothesis testing is logically prior to estimation
before we estimate, we should test!

Bayesian inference: Bayes factor

pioneered by Harold Jeffreys (Ly et al., in press)
using Liang et al. (2008), Rouder et al. (2009) developed Bayes factor for t-test
the formula is rather unwieldly, but involves just a single integral
for more information, see Rouder et al. (2009)

Bayesian inference: Bayes factor

library('BayesFactor') # written by Richard Morey


ttestBF(dat$hat, dat$nohat)

## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 0.2404271 ±0.02%
## 
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS

Bayesian inference: Bayes factor

given the prior:

\[ d \sim \mathrm{Cauchy}(\sqrt{2} / 2) \]

we obtain only a Bayes factor of \(4.16\) in favour of the null
that is, given our data, \(H_0\) is roughly 4 times more likely than \(H_1\)

Summary: Frequentist deficits

\(p\) values
- do not quantify statistical evidence & depend on \(n\)
- overestimate the evidence against \(H_0\)
- do depend on the intentions of the researcher
- do depend on data not observed
- cannot support \(H_0\)
confidence intervals
- are by no means a solution
- checkout Richard Morey's work on that
no optional stopping

Summary: Bayesian benefits

Bayes factor
- can support \(H_0\) (e.g. no gender differences)
- monitor evidence as the data come in (optional stopping)
- does not depend on \(n\), and can tell you if data are uninformative
posterior distributions
see the optional reading for a nice paper on Bayesian benefits

possible projects

more mathematical / simulation
- write a paper on the Bayesian bootstrap
- write a paper on error probabilities of the Bayes factor
- summarize the literature on Markov Chain Monte carlo techniques
- summarize the literature on finding objective prior distributions
- summarize the literature on ways to compute the Bayes factor
more foundational / philosophical
- write a paper on the difference between frequentism and Bayes
- write a paper on the detrimental influence of Popper on statistics
more programming
- implement an R package to do (Bayesian) task X
more analyzing
- meta-analyze a certain field of research using a Bayesian mixture model
- re-analyze studies published in the Journal in Support of the Null Hypothesis

next week

recap of today, answering questions
(Bayesian) regression modeling in R
check the course website for more info