recap

Bayes rule for data analysis:

\[\underbrace{P(\theta \, | \, D)}_{posterior} \propto \underbrace{P(\theta)}_{prior} \times \underbrace{P(D \, | \, \theta)}_{likelihood}\]

Markov Chain Monte Carlo

  • sequence of representative samples from \(X \sim P\)

JAGS

  • declarative language to specify data-generating model
    • model = prior + likelihood function

model comparison

model_1 = prior_1 + likelihood function_1

model_2 = prior_2 + likelihood function_2

which model is better given data?

key notions

  • maximum likelihood estimation
  • Akaike's information criterion
    • free parameters
    • model complexity
  • Bayes factors
    • grid approximations
    • marginal likelihood MCMC estimate
    • transdimensional MCMC
    • Savage-Dickey method (next time!)

maximum likelihood

forgetting data

  • subjects each where asked to remember items (Murdoch 1961)
  • recall rates y for 100 subjects after time t in seconds
y = c(.94, .77, .40, .26, .24, .16)
t = c(  1,   3,   6,   9,  12,  18)
obs = y*100

example from Myung (2007), JoMP, tutorial on MLE

forgetting models

recall probabilities for different times \(t\)

exponential

\[P(t \ ; \ a, b) = a \exp (-bt)\]

\[\text{where } a,b>0 \]

power

\[P(t \ ; \ c, d) = ct^{-d}\]

\[\text{where } c,d>0 \]

likelihood

exponential

nLL.exp <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 | w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*exp(-w2*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, size = 100, log = T))
}
show(nLL.exp(1,1))
## [1] 1092.446

power

nLL.pow <- function(w1,w2) {
  if (w1 < 0 | w2 < 0 | w1 > 20 | w2 > 20) {
    return(NA)
  }
  p = w1*t^(-w2)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  - sum(dbinom(x = obs, prob = p, size = 100, log = T))
}
show(nLL.pow(1,1))
## [1] 142.1096

MLE

require('stats4')
bestExpo = mle(nLL.exp, start = list(w1=1,w2=1))
bestPow  = mle(nLL.pow, start = list(w1=1,w2=1))
MLEstimates = data.frame(model = rep(c("expo", "power"), each = 2),
                         parameter = c("a", "b", "c", "d"),
                         value = c(coef(bestExpo), coef(bestPow)))
show(MLEstimates)
##   model parameter     value
## 1  expo         a 1.0701173
## 2  expo         b 0.1308300
## 3 power         c 0.9531186
## 4 power         d 0.4979311

best fits visualization

model comparison

which model is better?

predExp = expo(t,a,b)
predPow = power(t,c,d)
modelStats = data.frame(model = c("expo", "power"),
                        logLike = round(c(logLik(bestExpo), logLik(bestPow)),3),
                        pData = exp(c(logLik(bestExpo), logLik(bestPow))),
                        r = round(c(cor(predExp, y), cor(predPow,y)),3),
                        LSE = round(c( sum((predExp-y)^2), sum((predPow-y)^2)),3))
show(modelStats)
##   model logLike        pData     r   LSE
## 1  expo -18.666 7.820887e-09 0.982 0.019
## 2 power -26.726 2.471738e-12 0.948 0.057

Akaike information criterion

Akaike information criterion

motivation

  • model is better, the higher \(P(D \mid \hat{\theta})\)
    • where \(\hat{\theta} \in \arg \max_\theta P(D \mid \theta)\)
  • model is worse, the more parameters it has
    • principle of parsimony (Ockham's razor)
  • information theoretic notion:
    • amount of information lost when we assume that the data was generated by model under \(\hat{\theta}\)

definition

Let \(M\) be a model with \(k\) parameters, and \(D\) be some data:

\[\text{AIC}(M, D) = 2k - \ln P(D \mid \hat{\theta})\]

The smaller the AIC, the better the model.

model comparison by AIC

##   model logLike        pData     r   LSE      AIC
## 1  expo -18.666 7.820887e-09 0.982 0.019 41.33294
## 2 power -26.726 2.471738e-12 0.948 0.057 57.45220
show(AIC(bestExpo, bestPow))
##          df      AIC
## bestExpo  2 41.33294
## bestPow   2 57.45220

remarks

  • given more and more data, repeated model selection by AIC does not guarantee ending up with the true model
  • "model" for AICs is just likelihood; no prior
  • discounting number of parameters, like AIC does, does not take effictive strength of parameters into account
  • there are other information criteria that overcome some of these problems:
    • Bayesian information criterion
    • deviance information criterion

Bayes factors

Bayes factors

  • take two models:
    • \(P(\theta_1 , M_1)\) and \(P(D \mid \theta_1, M_1)\)
    • \(P(\theta_2 , M_2)\) and \(P(D \mid \theta_2, M_2)\)
  • ideally, we'd want to know the absolute probability of \(M_i\) given the data
    • but then we'd need to know set of all models (for normalization)
  • alternatively, we take odds of models given the data:

\[\underbrace{\frac{P(M_1 \mid D)}{P(M_2 \mid D)}}_{\text{posterior odds}} = \underbrace{\frac{P(D \mid M_1)}{P(D \mid M_2)}}_{\text{Bayes factor}} \ \underbrace{\frac{P(M_1)}{P(M_2)}}_{\text{prior odds}}\]

The Bayes factor is the factor by which our prior odds are changed by the data.

evidence

Bayes factor in favor of model \(M_1\)

\[\text{BF}(M_1 > M_2) = \frac{P(D \mid M_1)}{P(D \mid M_2)}\]

evidence of model \(M_i\) (= marginal likelihood of data)

\[P(D \mid M_i) = \int P(\theta_i , M_i) \ P(D \mid \theta_i, M_i) \text{ d}\theta_i\]

The evidence marginalizes out parameters \(\theta_i\). Its a function of the prior and the likelihood.

how to interpret Bayes factors

BF(M1 > M2) interpretation
1 irrelevant data
1 - 3 hardly worth ink or breath
3 - 6 anecdotal
6 - 10 now we're talking: substantial
10 - 30 strong
30 - 100 very strong
100 + decisive (bye, bye \(M_2\)!)

how to caculate Bayes factors

  1. calculate each model's evidence
    • brute force clever math
    • grid approximation
    • by MCMC estimation
  2. calculate Bayes factor:
    • transdimensional MCMC
    • Savage-Dickey method (next time)

grid approximation

grid approximation

  • consider discrete values for \(\theta\)
  • compute evidence in terms of them
  • works well for low-dimensional \(\theta\)

example

priorExpo = function(a, b){
  dunif(a, 0, 1.5) * dunif(b, 0, 1.5)
}
lhExpo = function(a, b){
  p = a*exp(-b*t)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}
priorPow = function(c, d){
  dunif(c, 0, 1.5) * dunif(d, 0, 1.5)
}
lhPow = function(c, d){
  p = c*t^(-d)
  p[p <= 0.0] = 1.0e-5
  p[p >= 1.0] = 1-1.0e-5
  prod(dbinom(x = obs, prob = p, size = 100))
}

example

flat priors cancel out

grid = seq(0.005, 1.495, by = 0.01)
evidenceExp = sum(sapply(grid, function(a) 
      sum(sapply (grid, function(b) lhExpo(a,b)))) )
evidencePow = sum(sapply(grid, function(a) 
      sum(sapply (grid, function(b) lhPow(a,b)))) )
show(as.numeric(evidenceExp / evidencePow))
## [1] 1221.392

overwhelming evidence in favor of the exponential model

MCMC estimation

recap: why simulate?

generally:

\[\int f(\theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} f(\theta)\]

so:

\[P(D) = \int P(D \mid \theta) \ P(\theta) \ \text{d}\theta \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta)} P(D \mid \theta)\]

example

lhExpo = Vectorize(lhExpo)
lhPow  = Vectorize(lhPow)
nSamples = 40000
a = runif(nSamples, 0, 1.5)
b = runif(nSamples, 0, 1.5)
lhExpVec = lhExpo(a,b)
lhPowVec = lhPow(a,b)
BFVec = sapply(seq(1000,nSamples, by = 100), 
     function(i) sum(lhExpVec[1:i]) / sum(lhPowVec[1:i]))
ggplot(data.frame(i = seq(1000,nSamples, by = 100), BF = BFVec), aes(x=i, y=BF)) +
  geom_point()

example

trick

take an arbitrary function \(h(\theta)\) such that \(\int h(\theta) \text{d}\theta = 1\)

\[ \begin{align*} \frac{1}{D} & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)}\\ & = \frac{P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \int h(\theta) \text{d}\theta \\ & = \int \frac{h(\theta) P(\theta \mid D)}{P(D \mid \theta) P(\theta)} \text{d}\theta \\ & \approx \frac{1}{n} \sum^{n}_{\theta_i \sim P(\theta \mid D)} \frac{h(\theta)}{P(D \mid \theta) P(\theta)} \end{align*} \]

choose a \(h(\theta)\) that resembles the posterior (homework)

transimensional MCMC

idea

make model index a parameter

dummy dummy

probMod

JAGS model

model{
  m ~ dbern(0.5)
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  c ~ dunif(0,1.5)
  d ~ dunif(0,1.5)
  for (i in 1: 6){
    pT[i] = ifelse(m == 0, a*exp(-t[i]*b), c*t[i]^(-d))
    p[i] = min(max( pT[i], 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)
  }
}

posteriors

posterior of model 1

model{
  m = 0
  a ~ dunif(0,1.5)
  b ~ dunif(0,1.5)
  c ~ dunif(0,1.5)
  d ~ dunif(0,1.5)
  for (i in 1: 6){
    pT[i] = ifelse(m == 0, a*exp(-t[i]*b), c*t[i]^(-d))
    p[i] = min(max( pT[i], 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)
  }
}

posterior of model 1

getGammaApprox = function(samples){
  s = sd(samples)
  m = mean(samples)
  ra = ( m + sqrt( m^2 + 4*s^2 ) ) / ( 2 * s^2 )
  sh = 1 + m * ra
  return(c(shape = sh, rate = ra))
}
wideSamples = dcast(ms, Iteration + Chain ~ Parameter)
paramA = getGammaApprox(wideSamples$a)

posterior of model 2

##   parameter     shape      rate
## 1         a 1123.5593 1056.7669
## 2         b  204.3565 1557.6671
## 3         c 2339.2259 2471.4932
## 4         d  237.0689  475.1929

trickery

tweaking prior model odds & using pseudo-priors

model{
  m ~ dcat(c(0.001,0.999))
  a[1] ~ dunif(0,1.5)
  b[1] ~ dunif(0,1.5)
  c[2] ~ dunif(0,1.5)
  d[2] ~ dunif(0,1.5)
  a[2] ~ dgamma(1124.84336, 1058.83299)
  b[2] ~ dgamma(199.82804, 1527.78959)
  c[1] ~ dgamma(61.88311, 29.28403)
  d[1] ~ dgamma(115.44879, 126.33878)
  for (i in 1: 6){
    pT[i] = ifelse(m == 1, a[m]*exp(-t[i]*b[m]), c[m]*t[i]^(-d[m]))
    p[i] = min(max( pT[i], 0.0001), 0.9999)
    obs[i] ~ dbinom(p[i], 100)
  }
}

posteriors

summary

summary

  • Bayes factors are cool, but can be difficult to compute
    • approximate by brute-force grid computation
    • approximate by (clever) sampling
    • transdimensional MCMC
    • Savage-Dickey for nested models (next time)

homework for next week