Bayesian data analysis & cognitive modeling

What do you do when you don't know what's going on?

A Bayesian nonparametric model is a Bayesian model on an infinite-dimensional parameter space.

overview

Intro & Motivation
- motivation and misnomers
- Bayesian linear regression on non-linear data
Gaussian process regression
- weight-view & kernel trick
- function-view and Stan implementation
Dirichlet process mixture
- Finite mixture models
- Good science, bad science
- going infinity & modeling the distribution of $p$
Note: We'll skip most of the math!

nonparametric models

misnomer: they have parameters!
in fact, they have infinitely many (in the limit of $n \rightarrow \infty$)

dummy dummy

A Gaussian process regression is a Bayesian linear regression with infinitely many basis expansions.

data set

helper functions

library('rjags')

fit_model = function(ms, y, X, n.iter = 2000) {
  dat = list('y' = y, 'X' = X, 'n' = length(y), 'p' = ncol(X))
  model = jags.model(textConnection(ms), dat, n.chains = 3, quiet = TRUE)
  coda.samples(model, variable.names = 'beta', n.iter = n.iter)
}

extract_coefs = function(samples) apply(as.matrix(samples), 2, mean)
expand = function(x, powers) sapply(powers, function(i) x^i)

plot_fit = function(fn) {
  ggplot(dat, aes(x = Timebin, y = basePupil)) +
         theme_bw() + geom_line(colour = 'red1') +
         geom_point() + stat_function(fun = fn, colour = 'darkblue')
}

Bayesian linear regression

regression = '
model {
  sigma ~ dunif(0, 100)
  tau = pow(sigma, -2)

  for (i in 1:p) {
    beta[i] ~ dnorm(0, .001)
  }

  for (i in 1:n) {
    y[i] ~ dnorm(mu[i], tau)
    mu[i] = inprod(X[i, ], beta[])
  }
}'

Linear regression

Basis expansions

linear model is quite rigid

\[ RSS = \sum_{i=1}^n (y_i - f(x_i))^2 \]

can expand to model 'non-straight' lines
- instead of $f(x) = x\beta$, use $f(x) = \phi(x)\beta$
- where $\phi(x)$ is a basis expansion, e.g. $\phi(x_i) = (x_i, x_i^2, x_i^3)$

Quadratic regression

Cubic regression

I think we are onto something.

Octic regression

30th order polynomial regression

Are we explaining anything?

splines

smoothing splines

\[ PRSS = \sum_{i=1}^N (y_i - f(x_i) \beta)^2 + \lambda \int f''(t)^2 \mathrm{d}t \]

library('mgcv'); m = gam(basePupil ~ s(Timebin, k = 10), data = dat); plot(m)

smoothing splines: note

Baayen teaches a class on generalized additive models in the spring (Tuesday, 18-21)
I can wholeheartedly recommend it as a data analysis class (there's no math!)
if you need a teaser, read his recently arxived paper: Out of the Cage of Shadows

Mathematical note

\[ p(\beta|y, X) = \frac{\pi(\beta)\prod_{i=1}^n p(y_i|X, \beta)}{\int \pi(\beta)\prod_{i=1}^n p(y_i|X, \beta) \mathrm{d}\beta} \]

\[ \begin{align*} \beta &\sim \mathcal{N}(m_0, \Sigma_p) \\ y|\beta, X &\sim \mathcal{N}(X\beta, C := I\sigma^2) \\ \beta|y, X &\sim \mathcal{N}(\mu, A^{-1}) \end{align*} \]

where

\[ \begin{align*} A^{-1} &= nC^{-1} + \Sigma_p^{-1} \\[1.5ex] \mu &= A^{-1} (nC^{-1} \bar{\boldsymbol{y}} + \Sigma_p^{-1} m_0) \end{align*} \]

Mathematical note

basis expansion augments the design matrix X, such that $X^{n \times p} \stackrel{\phi}{=} \Phi^{n \times pm}$

\[ \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix} = \begin{pmatrix} \phi(x_{11}) & \phi(x_{12}) & \dots & \phi(x_{1p}) \\ \phi(x_{21}) & \phi(x_{22}) & \dots & \phi(x_{2p}) \\ \vdots & \vdots & \ddots & \vdots \\ \phi(x_{n1}) & \phi(x_{n2}) & \dots & \phi(x_{np}) \end{pmatrix} \]

Kernel trick

let $z = \phi(x)$
in linear regression, all our computations involving $z$ are inner products, $z^Tz'$
thus, even when $m \rightarrow \infty$, i.e. our feature space is infinite dimensional
inner product just returns a single number

\[ \begin{pmatrix} z_1, z_2, \ldots, z_m \end{pmatrix} \times \begin{pmatrix} z_1' \\ z_2' \\ \vdots \\ z_m' \end{pmatrix} = z_1\cdot z_1' + z_2 \cdot z_2' + \ldots + z_m \cdot z_m' \]

dummy dummy

(for gory details, see Rasmussen & Williams, 2006; p.12)

Before the feature space

Kernel trick

let $z^Tz' = K(x, x')$, call it Kernel
goal:
- instead of computing all the powers in feature space (possible infinite)
- we use the Kernel to do this computation implicitly
claim:
- $K(x, x') = (1 + x^T x')^d$ corresponds to an inner product
- $z^T z'$, where $z$ and $z'$ are the $d^{\text{th}}$ polynomial of $x$ and $x'$

Kernel trick

claim: $K(x, x') = (1 + x^T x')^d$ is the same as
- transforming $x$ and $x'$ to $z$ and $z'$, respectively
- and computing their (possibly very high dimensional) inner product
example: $x = (x_1, x_2)$, $x' = (x_1', x_2')$

Kernel trick

so instead of actually transforming our data, we simply use the Kernel
the Kernel becomes the object of primary interest
instead of reasoning about possible transformations (polynomial? sine?)
- reason in terms of the Kernel, and its properties; e.g.
- Radial Basis Function $K(x, x') = \alpha \exp \left (-\gamma||x - x'||^2 \right)$

A Gaussian process is an infinite dimensional multivariate Normal distribution.

key property

\[ \begin{align*} \begin{pmatrix} X_1 \\ X_2 \\ X_3 \end{pmatrix} &\sim N \begin{bmatrix} \begin{pmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{pmatrix}, \begin{pmatrix} \Sigma_{11} & \Sigma_{12} & \Sigma_{13} \\ \Sigma_{21} & \Sigma_{22} & \Sigma_{23} \\ \Sigma_{31} & \Sigma_{32} & \Sigma_{33} \end{pmatrix} \end{bmatrix} \\[2ex] \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} &\sim N \begin{bmatrix} \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix} \end{bmatrix} \\[2ex] X_1 &\sim N(\mu_1, \Sigma_{11}) \end{align*} \]

Gaussian process

is a stochastic process
- generalizes the notion of a probability distribution to functions
key idea: instead of restricting the set of functions, estimate them
concretely, specify a prior over the (infinite set) of functions
estimate them from the data

\[ f(x) \sim \mathcal{GP}(m(x), k(x, x')) \]
function is just a very long vector!

Gaussian process

probability distribution over $f(x)$ is a Gaussian process if the marginal density $p(x_1, x_2, ..., x_n)$ is Gaussian (cf., MacKay, 1998)

\[ \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_n \end{pmatrix} \sim N \begin{bmatrix} m(x), \begin{pmatrix} K(x_i, x_j) & K(x_i, x_{j+1}) & \ldots & K(x_i, x_n) \\ K(x_{i+1}, x_j) & K(x_{i+1}, x_{j+1}) & \ldots & K(x_{i+1}, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ K(x_n, x_j) & K(x_n, x_{j+1}) & \ldots & K(x_n, x_n) \end{pmatrix} \end{bmatrix} \]

Gaussian process

DEMO

Gaussian process in Stan: setup

gp = '
data {
  int N1;
  int N2;
  vector[N1] y1;
  vector[N1] x1;
  vector[N2] x2;
}

transformed data {
  int<lower=1> N;
  vector[N1 + N2] x;
  vector[N1 + N2] mu;
  N <- N1 + N2;
  for (i in 1:N) mu[i] <- 0;
  for (i in 1:N1) x[i] <- x1[i];
  for (i in 1:N2) x[N1 + i] <- x2[i];
}

...
'

Gaussian process in Stan: model

'parameters {
  vector[N2] y2;
  real<lower=0> eta_sq;
  real<lower=0> rho_sq;
  real<lower=0> sigma_sq;
}

model {
  vector[N] y;
  matrix[N, N] Sigma;
  eta_sq ~ cauchy(0,5);
  rho_sq ~ cauchy(0,5);
  sigma_sq ~ cauchy(0, 5);

  for (i in 1:N1) y[i] <- y1[i];
  for (i in 1:N2) y[N1 + i] <- y2[i];

  for (i in 1:N) {
    for (j in 1:N) {
      Sigma[i,j] <- eta_sq * exp(-rho_sq * pow(x[i] - x[j], 2)) +
                    if_else(i == j, sigma_sq, 0.0);
    }
  }

  y ~ multi_normal(mu, Sigma);
}'

Gaussian process in Stan

library('rstan')
library('parallel')
# maybe find a better data set??

N1 = 21
N = length(y)
N2 = N - N1
stan_dat = list('y1' = y[1:N1], 'y2' = y[(N1+1):N],
                'x1' = X[1:N1], 'x2' = X[(N1+1):N],
                'N' = N, 'N1' = N1, 'N2' = N2)
compfit = stan(model_code = gp, dat = stan_dat, chains = 0)

sflist = mclapply(1:4, mc.cores = 4,
           function(i) stan(fit = compfit, data = stan_dat, seed = runif(1),
                            chains = 1, chain_id = i, refresh = -1, iter = 8000))

fit = sflist2stanfit(sflist)

Gaussian process prediction

ypred = apply(rstan::extract(fit, 'y2')[['y2']], 2, mean)
dat[(N1+1):N, 2] = ypred # beware of the high variance!
ggplot(dat, aes(x = Timebin, y = basePupil)) + theme_bw() + geom_point() +
       geom_line(colour = 'blue') + ggtitle('Pupil dilation over time')

wrap up

distinguish between restriction bias and preference bias
- in 'standard' regression, we speficy a constraint in terms of the basis expansion
- in Gaussian process regression, we specify a preference via the Kernel
Gaussian process specifies a prior over functions
the Kernel encodes what type of functions we prefer
actually possible to compute all this in closed form (see RW, 2006, Ch. 2)
see here for a great podcast on Gaussian processes (episode 2)
see here for an amazing (and funny!) explanation of the Kernel trick
see here for the Stan manual (which includes Gaussian process regression)

Finite Mixtures

remember?

dummy dummy

TT practitioners sense energy fields
they sense which hand is near another persons body

(Kruschke, chapter 9.2.4)

remember?

Iris clustering

kmeans clustering

DEMO

Gaussian mixture model

library('ggmcmc')
library('mixtools')


fit_mixture = function(ms, y, k, n.iter = 2000) {
  dat = list('y' = y, 'n' = length(y), 'alpha' = rep(1, k), 'k' = k)
  model = jags.model(textConnection(ms), dat, n.chains = 3, quiet = TRUE)
  update(model, n.iter / 4)
  coda.samples(model, variable.names = c('mu', 'sigma'), n.iter = n.iter)
}

Gaussian mixture model

mix = '
model {
  theta[1:k] ~ ddirch(alpha[]) # mixing distribution

  # class conditional Gaussians
  for (i in 1:k) {
    mu[i] ~ dnorm(0, .0001)
    sigma[i] ~ dunif(0, 100)
    tau[i] = pow(sigma[i], -2)
  }

  # likelihood
  for (i in 1:n) {
    c[i] ~ dcat(theta[1:k])
    y[i] ~ dnorm(mu[c[i]], tau[c[i]])
  }
}'

Gaussian mixture

ss = ggs(fit_mixture(mix, faithful$waiting, 2))
ggs_density(ss)

a note on meta-analysis

science is a cumulative endeavour
no single experiment will be decisive (except if you're name is Stroop)
thus, we want to aggregate all the experiments done in a certain area
in order to uncover the truth, because every experiment is noisy
the enemy of all science: publication bias
- in short: crap in -> meta-analysis -> crap out
- there is no substitute for registered replications (van Elk et al., 2015)

a note on meta-analysis

instead of looking to uncover the truth, we can look if a research area has evidential value
similar to the distinction between model comparison and parameter estimation
- before we estimate an effect, let's see if we can trust the research
- the most popular method to do this is $p$ curve
- we will look at a recently proposed Bayesian $p$ curve

$p$ values drawn from a mixture

\[ p(p_i) = \alpha f_{H_0} + (1 - \alpha) f_{H_1} \]

distribution of $p$ values under $H_0$

x = replicate(10000, rnorm(100))
ps = apply(x, 2, function(dat) t.test(dat)[['p.value']])
hist(ps)

distribution of $p$ values under $H_1$

complex function of sample size, true effect, etc.
- any set of $p$ values we analyze will be heterogeneous
- and thus the distribution will be a mixture
don't want to use a restrictive parametric model here
- the resulting model mis-specification might temper with out inference
- instead, use a Dirichlet process mixture!

pcurve demo

DEMO

Chinese restaurant process

\[ p(c_n = k|c_{1:n-1}) = \begin{cases} \frac{m_k}{n - 1 + \alpha} & \mbox{if $k \le K_{+}$} \\ \frac{\alpha}{n - 1 + \alpha} & \mbox{otherwise} \end{cases} \]

dummy dummy

where $m_k$ is the number of people sitting at table $k$
$K_{+}$ is the number of tables for which $m_k > 0$
$\alpha$ is the concentration parameter; larger values imply more clusters
cool thing: this sequence is exchangeable! (for more, see Gershman & Blei, 2012)

Gronau et al.'s model

\[ \begin{align*} G &\sim DP(\cdot|G_0, \alpha) \\[1.5ex] \theta_i &\sim G(\cdot) \\[1.5ex] \Phi^{-1}(p_i^{H_1}) &\sim N(\cdot|\theta_i)_{T(, -1.64485)} \end{align*} \]

stick breaking

sampling from a Dirichlet process prior can be decomposed into two parts
- generating weights, $q_i \sim \mathcal{Beta}(1, \alpha)$
- generating the associated locations, $\pi_j = q_j \prod_{k=1}^{j-1} (1 - q_k)$
the obtain probability mass is one draw from the Dirichlet process prior
for more, see the short and spicy paper by Gronau et al. (submitted)

Likelihood

bayesmix = '
model{
  alpha ~ dunif(.5,7) # hyperprior
  phi ~ dbeta(1,1) # mixing weight H0 & H1

  for (i in 1:n) {
    ind[i] ~ dbern(phi)
    z[i] ~ dcat(proportionsStick[])
    z1[i] = z[i] + 1
    z2[i] = ifelse(ind[i] == 1, 1, z1[i])
    qp[i] ~ dnorm(mu[z2[i]], lambda[z2[i]])T(,-1.64485)
  }

  ...
}'

Stick breaking

'
  for (j in 1:(nmaxclust - 1)) {
    prop_remaining[j] ~ dbeta(1, alpha)
  }
  
  prop_stick[1] = prop_remaining[1]
  
  for (j in 2:(nmaxclust - 1)) {
    prop_stick[j] = prop_remaining[j] * prod(1 - prop_remaining[1:(j-1)])
  }
  
  prop_stick[nmaxclust] = 1 - sum(prop_stick[1:(nmaxclust - 1)])
  
  for (j in 1:nmaxclust) {
    mu_h1[j] ~ dnorm(0, 1)T(-6,0)
    lambda_h1[j] ~ dgamma(0.001, 0.001)T(,1)
  }
  
  mu[1] = 0
  lambda[1] = 1
  mu[2:(nmaxclust + 1)] = mu_h1
  lambda[2:(nmaxclust + 1)] = lambda_h1
}'

wrap up

science is a cumulative endeavour
publication bias is the enemy of all science (and the misuse of $p$ values)
model the distribution of $p$ values to get a sense of the evidential value
- use a Dirichlet process mixture to allow for a flexible model
- be Bayesian in order to get what you really want

final wrap up

Bayesian nonparametrics are a very flexible class of models
if you are still unsure about projects, you can look at their applications
- e.g., in category learning, intelligence research, topic modeling
next week: your turn!
prepare a short presentation about
- the current state of your project
- and what you plan to do before submission (2th March)
- send me slides / code per mail until Tuesday, 23:59

overview

nonparametric models

data set

helper functions

Bayesian linear regression

Linear regression

Basis expansions

Quadratic regression

Cubic regression

Octic regression

30th order polynomial regression

splines

smoothing splines

smoothing splines: note

Mathematical note

Mathematical note

Kernel trick

Before the feature space

Kernel trick

Kernel trick

Kernel trick

key property

Gaussian process

Gaussian process

Gaussian process

Gaussian process

Gaussian process in Stan: setup

Gaussian process in Stan: model

Gaussian process in Stan

Gaussian process prediction

wrap up

Finite Mixtures

remember?

remember?

Iris clustering

kmeans clustering

Gaussian mixture model

Gaussian mixture model

Gaussian mixture

a note on meta-analysis

a note on meta-analysis

\(p\) values drawn from a mixture

distribution of \(p\) values under \(H_0\)

distribution of \(p\) values under \(H_1\)

pcurve demo

Chinese restaurant process

Chinese restaurant process

Gronau et al.'s model

stick breaking

stick breaking

Likelihood

Stick breaking

wrap up

final wrap up