Bayesian data analysis & cognitive modeling

recap

Bayes rule for parameter estimation:

\[\underbrace{P(\theta \, | \, D)}_{\text{posterior}} \propto \underbrace{P(\theta)}_{\text{prior}} \times \underbrace{P(D \, | \, \theta)}_{\text{likelihood}}\]

Bayes factor for model comparison:

\[ \begin{align*} \underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} & = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}} \\ \underbrace{P(D \mid M_i)}_{\text{evidence}} & = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i \end{align*} \]

\(p\)-values for null-hypothesis significance testing

key notions

NHST by parameter estimation
- region of practical equivalence (ROPE)
- a posteriori credible values
NHST by Bayes factor model comparison
- Savage-Dickey method
- Lindley paradox
model criticism
- posterior predictive check
- posterior \(p\)-value

overview

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

NHST example

recap

\(p\)-values for null-hypothesis significance testing:

fix a data set \(d^*\) from a set of possible observations \(D\)
fix a likelihood function \(P(D = d \mid \theta)\)
fix a null hypothesis: \(\theta = \theta^*\)
fix a test statistic \(t \colon D \rightarrow \mathbb{R}\)
yields a sampling distribution:

\[ \begin{align*} & P(t(d) \mid \theta^*) = \int \delta_{t(d') = t(d)} P(D = d' \mid \theta^*) \text{ d}d' \\ & \text{where } \delta_{ a = b } = \begin{cases} 1 & \text{if } a = b \\ 0 & \text{otherwise} \end{cases} \end{align*} \]

define \(p\)-value:

\[ \int \delta_{ P(t(d) \mid \theta^*) \le P(t(d^*) \mid \theta^*) } \ P( t(d) \mid \theta^*) \text{ d}d \]

fixing a sample space

observed: \(k = 7\) out of \(N = 24\) flips came up heads
what's the set of possible observations?

dummy dummy

sampleSpaces

fix \(N\)

observed: \(k = 7\) out of \(N = 24\) flips came up heads
set of possible observations: \(D = \left\{ 0, 1, \dots, 23, 24 \right\}\)
binomial likelihood function:

\[P(D = d \mid \theta) = {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k}\]

null hypothesis: \(\theta = 0.5\) & test statistic: \(t(d) = d\)

fix \(k\)

observed: \(k = 7\) out of \(N = 24\) flips came up heads
set of possible observations: \(D = \left\{ 7, 8, \dots \right\}\)
"negative binomial" likelihood function:

\[P(D = d \mid \theta) = \frac{z}{N} {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k}\]

null hypothesis: \(\theta = 0.5\) & test statistic: \(t(d) = d\)

upshot

\(p\)-values depend on the assumed sample space
- e.g., stopping intention behind experiment / data collection
- otherwise we cannot normalize \(P(D \mid \theta)\)
- i.o.w., assumptions about sampling space/procedure are inherent in the model that is to be tested
this can be good or bad:
- good, if we know the sampling procedure
- bad, if we must guess

overview

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

NHST using HDIs

example

observed: \(k = 7\) out of \(N = 24\) flips came up heads
goal: estimate \(P(\theta \mid D)\)
which likelihood function to use?

dummy dummy

sampleSpaces

anything goes

estimation of \(P(\theta \mid D)\) is independent of assumptions about sample space and sample procedure
any normalizing constant \(X\) cancels out:

\[ \begin{align*} P(\theta \mid D) & = \frac{P(\theta) \ P(D \mid \theta)}{\int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{ \frac{1}{X} \ P(\theta) \ P(D \mid \theta)}{ \ \frac{1}{X}\ \int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{P(\theta) \ \frac{1}{X}\ P(D \mid \theta)}{ \int_{\theta'} P(\theta') \ \frac{1}{X}\ P(D \mid \theta')} \end{align*} \]

posterior credible \(\theta\)'s

observed: \(k = 7\) out of \(N = 24\) flips came up heads
goal: estimate \(P(\theta \mid D)\)

ROPEs and credible values

regions of practical equivalence

small regions \([\theta - \epsilon, \theta + \epsilon]\) around each \(\theta\)
- values (practically) indistinguishable from \(\theta\)

credible values

value \(\theta\) is rejectable if its ROPE lies entirely outside of posterior HDI
value \(\theta\) is believable if its ROPE lies entirely whithin posterior HDI

overview

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

NHST using Bayes factors

model comparison

observed: \(k = 7\) out of \(N = 24\) flips came up heads
goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
- problem: how to specify \(M_0\) and \(M_1\)?

dummy dummy

properly nested models

suppose that ther are \(n\) continuous parameters of interest \(\theta = \langle \theta_1, \dots, \theta_n \rangle\)
\(M_1\) is a model defined by \(P(\theta \mid M_1)\) & \(P(D \mid \theta, M_1)\)
\(M_0\) is properly nested under \(M_1\) if:
- \(M_0\) assigns fixed values to parameters \(\theta_i = x_i, \dots, \theta_n = x_n\)
- \(\lim_{\theta_i \rightarrow x_i, \dots, \theta_n \rightarrow x_n} P(\theta_1, \dots, \theta_{i-1} \mid \theta_i, \dots, \theta_n, M_1) = P(\theta_1, \dots, \theta_{i-1} \mid M_0)\)
- \(P(D \mid \theta_1, \dots, \theta_{i-1}, M_0) = P(D \mid \theta_1, \dots, \theta_{i-1}, \theta_i = x_i, \dots, \theta_n = x_n, M_1)\)

example

observed: \(k = 7\) out of \(N = 24\) flips came up heads
goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
model specification:
- \(M_0\) has \(\theta = 0.5\) and \(k \sim \text{Binomial}(0.5, N)\)
- \(M_1\) has \(\theta \sim \text{Beta}(1,1)\) and \(k \sim \text{Binomial}(\theta, N)\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{\text{Binomial}(k,N,0.5)}{\int_0^1 \text{Beta}(\theta, 1, 1) \ \text{Binomial}(k,N, \theta) \text{ d}\theta} \\ & = \frac{{{N}\choose{k}} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{0.5^{k} \, (1-0.5)^{N - k}}{BetaFunction(k+1, N-k+1)} \approx 0.516 \end{align*} \]

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

Lindley paradox

new example

k = 49581
N = 98451

\(p\)-value NHST

binom.test(k, N)$p.value

## [1] 0.02364686

Savage-Dickey BF

dbeta(0.5, k+1, N - k + 1)

## [1] 19.21139

overview

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

model criticism

motivation

parameter estimation: what \(\theta\) to believe in?
model comparison: which model is better than another?
model criticism: is a given model plausible (enough)?

dummy dummy

posterior predictive checks

graphically compare simulated observations with actual observation

Bayesian predictive \(p\)-values

measure surprise level of data under a model

posterior predictive checks

exponential forgetting model

y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

model{
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  for (i in 1: 6){
    p[i] = min(max( a*exp(-t[i]*b), 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)    # condition on data
    obsRep[i] ~ dbinom(p[i], 100) # replicate fake data
  }
}

PPC: exponential model

black dots: data
blue dots: mean of replicated fake data
blue bars: 95% HDIs of replicated fake data

PPC: power model

black dots: data
blue dots: mean of replicated fake data
blue bars: 95% HDIs of replicated fake data

Bayesian predictive model criticism

fix a data set \(d^*\) from a set of possible observations \(D\)
fix a model with \(P(\theta)\) and \(P(D = d \mid \theta)\)
fix a test statistic \(t \colon D, \theta \rightarrow \mathbb{R}\)
- test statistic may depend on parameters
Bayesian predictive \(p\)-value:

\[ \int \int \delta_{ t(d,\theta) \ge t(d^*,\theta) } \ P(d' \mid \theta) \ P(\theta \mid d^*) \text{ d}\theta \text{ d}d' \]

example

obs = c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
k = sum(obs) # 7
N = length(obs) #20

\(p\)-value NHST:

do not reject NH \(\theta = 0.5\)

binom.test(k, N, 0.5)$p.value

## [1] 0.263176

Bayesian posterior predictive \(p\)-value

test statistic: no. switches 1 <-> 0
\(t(d^*)\) = 2
PP-\(p\)-value \(\approx 0.028\)

pppvalue

Gelman et al. 2014, p.147–8

summary

overview

	estimation	comparison	criticism
goal	which \(\theta\), given \(M\) & \(D\)?	which better: \(M_0\) or \(M_1\)?	\(M\) good model of \(D\)?
method	Bayes rule	Bayes factor	\(p\)-value
no. of models	1	2	1
\(H_0\)	subset of \(\theta\)	\(P(\theta \mid M_0), P(D \mid \theta, M_0)\)	\(P(\theta), P(D \mid \theta)\)
\(H_1\)	—	\(P(\theta \mid M_1), P(D \mid \theta, M_1)\)	—
prerequisites	\(P(\theta), \alpha \times P(D \mid \theta)\)	—	test statistic
pros	lean, easy	intuitive, plausible, Ockham's razor	absolute
cons	vagueness in ROPE	prior dependence, computational load	sample space?

next time

read Kruschke Chapter 14 in preparation
install STAN and JASP
think about project
finish homework 3