recap

Bayes rule for parameter estimation:

\[\underbrace{P(\theta \, | \, D)}_{\text{posterior}} \propto \underbrace{P(\theta)}_{\text{prior}} \times \underbrace{P(D \, | \, \theta)}_{\text{likelihood}}\]

Bayes factor for model comparison:

\[ \begin{align*} \underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} & = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}} \\ \underbrace{P(D \mid M_i)}_{\text{evidence}} & = \int P(\theta_i \mid M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i \end{align*} \]

\(p\)-values for null-hypothesis significance testing

key notions

  • NHST by parameter estimation
    • region of practical equivalence (ROPE)
    • a posteriori credible values
  • NHST by Bayes factor model comparison
    • Savage-Dickey method
    • Lindley paradox
  • model criticism
    • posterior predictive check
    • posterior \(p\)-value

overview

estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
method Bayes rule Bayes factor \(p\)-value
no. of models 1 2 1
\(H_0\) subset of \(\theta\) \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) \(P(\theta), P(D \mid \theta)\)
\(H_1\) \(P(\theta \mid M_1), P(D \mid \theta, M_1)\)
prerequisites \(P(\theta), \alpha \times P(D \mid \theta)\) test statistic
pros lean, easy intuitive, plausible, Ockham's razor absolute
cons vagueness in ROPE prior dependence, computational load sample space?

NHST example

recap

\(p\)-values for null-hypothesis significance testing:

  • fix a data set \(d^*\) from a set of possible observations \(D\)
  • fix a likelihood function \(P(D = d \mid \theta)\)
  • fix a null hypothesis: \(\theta = \theta^*\)
  • fix a test statistic \(t \colon D \rightarrow \mathbb{R}\)
  • yields a sampling distribution:

\[ \begin{align*} & P(t(d) \mid \theta^*) = \int \delta_{t(d') = t(d)} P(D = d' \mid \theta^*) \text{ d}d' \\ & \text{where } \delta_{ a = b } = \begin{cases} 1 & \text{if } a = b \\ 0 & \text{otherwise} \end{cases} \end{align*} \]

  • define \(p\)-value:

\[ \int \delta_{ P(t(d) \mid \theta^*) \le P(t(d^*) \mid \theta^*) } \ P( t(d) \mid \theta^*) \text{ d}d \]

fixing a sample space

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • what's the set of possible observations?

dummy dummy

sampleSpaces

fix \(N\)

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • set of possible observations: \(D = \left\{ 0, 1, \dots, 23, 24 \right\}\)
  • binomial likelihood function:

\[P(D = d \mid \theta) = {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k}\]

  • null hypothesis: \(\theta = 0.5\) & test statistic: \(t(d) = d\)

fix \(k\)

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • set of possible observations: \(D = \left\{ 7, 8, \dots \right\}\)
  • "negative binomial" likelihood function:

\[P(D = d \mid \theta) = \frac{z}{N} {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k}\]

  • null hypothesis: \(\theta = 0.5\) & test statistic: \(t(d) = d\)

upshot

  • \(p\)-values depend on the assumed sample space
    • e.g., stopping intention behind experiment / data collection
    • otherwise we cannot normalize \(P(D \mid \theta)\)
    • i.o.w., assumptions about sampling space/procedure are inherent in the model that is to be tested
  • this can be good or bad:
    • good, if we know the sampling procedure
    • bad, if we must guess

overview

estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
method Bayes rule Bayes factor \(p\)-value
no. of models 1 2 1
\(H_0\) subset of \(\theta\) \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) \(P(\theta), P(D \mid \theta)\)
\(H_1\) \(P(\theta \mid M_1), P(D \mid \theta, M_1)\)
prerequisites \(P(\theta), \alpha \times P(D \mid \theta)\) test statistic
pros lean, easy intuitive, plausible, Ockham's razor absolute
cons vagueness in ROPE prior dependence, computational load sample space?

NHST using HDIs

example

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • goal: estimate \(P(\theta \mid D)\)
  • which likelihood function to use?

dummy dummy

sampleSpaces

anything goes

  • estimation of \(P(\theta \mid D)\) is independent of assumptions about sample space and sample procedure
  • any normalizing constant \(X\) cancels out:

\[ \begin{align*} P(\theta \mid D) & = \frac{P(\theta) \ P(D \mid \theta)}{\int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{ \frac{1}{X} \ P(\theta) \ P(D \mid \theta)}{ \ \frac{1}{X}\ \int_{\theta'} P(\theta') \ P(D \mid \theta')} \\ & = \frac{P(\theta) \ \frac{1}{X}\ P(D \mid \theta)}{ \int_{\theta'} P(\theta') \ \frac{1}{X}\ P(D \mid \theta')} \end{align*} \]

posterior credible \(\theta\)'s

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • goal: estimate \(P(\theta \mid D)\)

ROPEs and credible values

regions of practical equivalence

  • small regions \([\theta - \epsilon, \theta + \epsilon]\) around each \(\theta\)
    • values (practically) indistinguishable from \(\theta\)

credible values

  • value \(\theta\) is rejectable if its ROPE lies entirely outside of posterior HDI
  • value \(\theta\) is believable if its ROPE lies entirely whithin posterior HDI

overview

estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
method Bayes rule Bayes factor \(p\)-value
no. of models 1 2 1
\(H_0\) subset of \(\theta\) \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) \(P(\theta), P(D \mid \theta)\)
\(H_1\) \(P(\theta \mid M_1), P(D \mid \theta, M_1)\)
prerequisites \(P(\theta), \alpha \times P(D \mid \theta)\) test statistic
pros lean, easy intuitive, plausible, Ockham's razor absolute
cons vagueness in ROPE prior dependence, computational load sample space?

NHST using Bayes factors

model comparison

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
    • problem: how to specify \(M_0\) and \(M_1\)?

dummy dummy

properly nested models

  • suppose that ther are \(n\) continuous parameters of interest \(\theta = \langle \theta_1, \dots, \theta_n \rangle\)
  • \(M_1\) is a model defined by \(P(\theta \mid M_1)\) & \(P(D \mid \theta, M_1)\)
  • \(M_0\) is properly nested under \(M_1\) if:
    • \(M_0\) assigns fixed values to parameters \(\theta_i = x_i, \dots, \theta_n = x_n\)
    • \(\lim_{\theta_i \rightarrow x_i, \dots, \theta_n \rightarrow x_n} P(\theta_1, \dots, \theta_{i-1} \mid \theta_i, \dots, \theta_n, M_1) = P(\theta_1, \dots, \theta_{i-1} \mid M_0)\)
    • \(P(D \mid \theta_1, \dots, \theta_{i-1}, M_0) = P(D \mid \theta_1, \dots, \theta_{i-1}, \theta_i = x_i, \dots, \theta_n = x_n, M_1)\)

example

  • observed: \(k = 7\) out of \(N = 24\) flips came up heads
  • goal: compare a null-model \(M_0\) with an alternative model \(M_1\)
  • model specification:
    • \(M_0\) has \(\theta = 0.5\) and \(k \sim \text{Binomial}(0.5, N)\)
    • \(M_1\) has \(\theta \sim \text{Beta}(1,1)\) and \(k \sim \text{Binomial}(\theta, N)\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{\text{Binomial}(k,N,0.5)}{\int_0^1 \text{Beta}(\theta, 1, 1) \ \text{Binomial}(k,N, \theta) \text{ d}\theta} \\ & = \frac{{{N}\choose{k}} 0.5^{k} \, (1-0.5)^{N - k}}{\int_0^1 {{N}\choose{k}} \theta^{k} \, (1-\theta)^{N - k} \text{ d}\theta} \\ & = \frac{0.5^{k} \, (1-0.5)^{N - k}}{BetaFunction(k+1, N-k+1)} \approx 0.516 \end{align*} \]

Savage-Dickey method

Savage-Dickey method

let \(M_0\) be properly nested under \(M_1\) s.t. \(M_0\) fixes \(\theta_i = x_i, \dots, \theta_n = x_n\)

\[ \begin{align*} \text{BF}(M_0 > M_1) & = \frac{P(D \mid M_0)}{P(D \mid M_1)} \\ & = \frac{P(\theta_i = x_i, \dots, \theta_n = x_n \mid D, M_1)}{P(\theta_i = x_i, \dots, \theta_n = x_n \mid M_1)} \end{align*} \]

Lindley paradox

new example

k = 49581
N = 98451

\(p\)-value NHST

binom.test(k, N)$p.value
## [1] 0.02364686

Savage-Dickey BF

dbeta(0.5, k+1, N - k + 1)
## [1] 19.21139

overview

estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
method Bayes rule Bayes factor \(p\)-value
no. of models 1 2 1
\(H_0\) subset of \(\theta\) \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) \(P(\theta), P(D \mid \theta)\)
\(H_1\) \(P(\theta \mid M_1), P(D \mid \theta, M_1)\)
prerequisites \(P(\theta), \alpha \times P(D \mid \theta)\) test statistic
pros lean, easy intuitive, plausible, Ockham's razor absolute
cons vagueness in ROPE prior dependence, computational load sample space?

model criticism

motivation

  • parameter estimation: what \(\theta\) to believe in?
  • model comparison: which model is better than another?
  • model criticism: is a given model plausible (enough)?

dummy dummy

posterior predictive checks

graphically compare simulated observations with actual observation

Bayesian predictive \(p\)-values

measure surprise level of data under a model

posterior predictive checks

exponential forgetting model

y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100
model{
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  for (i in 1: 6){
    p[i] = min(max( a*exp(-t[i]*b), 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)    # condition on data
    obsRep[i] ~ dbinom(p[i], 100) # replicate fake data
  }
}

PPC: exponential model

  • black dots: data
  • blue dots: mean of replicated fake data
  • blue bars: 95% HDIs of replicated fake data

PPC: power model

  • black dots: data
  • blue dots: mean of replicated fake data
  • blue bars: 95% HDIs of replicated fake data

Bayesian predictive model criticism

  • fix a data set \(d^*\) from a set of possible observations \(D\)
  • fix a model with \(P(\theta)\) and \(P(D = d \mid \theta)\)
  • fix a test statistic \(t \colon D, \theta \rightarrow \mathbb{R}\)
    • test statistic may depend on parameters
  • Bayesian predictive \(p\)-value:

\[ \int \int \delta_{ t(d,\theta) \ge t(d^*,\theta) } \ P(d' \mid \theta) \ P(\theta \mid d^*) \text{ d}\theta \text{ d}d' \]

example

obs = c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0)
k = sum(obs) # 7
N = length(obs) #20

\(p\)-value NHST:

  • do not reject NH \(\theta = 0.5\)
binom.test(k, N, 0.5)$p.value
## [1] 0.263176

Bayesian posterior predictive \(p\)-value

  • test statistic: no. switches 1 <-> 0
  • \(t(d^*)\) = 2
  • PP-\(p\)-value \(\approx 0.028\)

pppvalue

Gelman et al. 2014, p.147–8

summary

overview

estimation comparison criticism
goal which \(\theta\), given \(M\) & \(D\)? which better: \(M_0\) or \(M_1\)? \(M\) good model of \(D\)?
method Bayes rule Bayes factor \(p\)-value
no. of models 1 2 1
\(H_0\) subset of \(\theta\) \(P(\theta \mid M_0), P(D \mid \theta, M_0)\) \(P(\theta), P(D \mid \theta)\)
\(H_1\) \(P(\theta \mid M_1), P(D \mid \theta, M_1)\)
prerequisites \(P(\theta), \alpha \times P(D \mid \theta)\) test statistic
pros lean, easy intuitive, plausible, Ockham's razor absolute
cons vagueness in ROPE prior dependence, computational load sample space?

next time

  • read Kruschke Chapter 14 in preparation
  • install STAN and JASP
  • think about project
  • finish homework 3