What do you do when you don't know what's going on?

A Bayesian nonparametric model is a Bayesian model on an infinite-dimensional parameter space.

overview

  • Intro & Motivation
    • motivation and misnomers
    • Bayesian linear regression on non-linear data
  • Gaussian process regression
    • weight-view & kernel trick
    • function-view and Stan implementation
  • Dirichlet process mixture
    • Finite mixture models
    • Good science, bad science
    • going infinity & modeling the distribution of \(p\)

  • Note: We'll skip most of the math!

nonparametric models

  • misnomer: they have parameters!
  • in fact, they have infinitely many (in the limit of \(n \rightarrow \infty\))

dummy dummy

A Gaussian process regression is a Bayesian linear regression with infinitely many basis expansions.

data set

helper functions

library('rjags')

fit_model = function(ms, y, X, n.iter = 2000) {
  dat = list('y' = y, 'X' = X, 'n' = length(y), 'p' = ncol(X))
  model = jags.model(textConnection(ms), dat, n.chains = 3, quiet = TRUE)
  coda.samples(model, variable.names = 'beta', n.iter = n.iter)
}

extract_coefs = function(samples) apply(as.matrix(samples), 2, mean)
expand = function(x, powers) sapply(powers, function(i) x^i)

plot_fit = function(fn) {
  ggplot(dat, aes(x = Timebin, y = basePupil)) +
         theme_bw() + geom_line(colour = 'red1') +
         geom_point() + stat_function(fun = fn, colour = 'darkblue')
}

Bayesian linear regression

regression = '
model {
  sigma ~ dunif(0, 100)
  tau = pow(sigma, -2)

  for (i in 1:p) {
    beta[i] ~ dnorm(0, .001)
  }

  for (i in 1:n) {
    y[i] ~ dnorm(mu[i], tau)
    mu[i] = inprod(X[i, ], beta[])
  }
}'

Linear regression

Basis expansions

  • linear model is quite rigid

\[ RSS = \sum_{i=1}^n (y_i - f(x_i))^2 \]

  • can expand to model 'non-straight' lines
    • instead of \(f(x) = x\beta\), use \(f(x) = \phi(x)\beta\)
    • where \(\phi(x)\) is a basis expansion, e.g. \(\phi(x_i) = (x_i, x_i^2, x_i^3)\)

Quadratic regression

Cubic regression

I think we are onto something.

Octic regression

30th order polynomial regression

Are we explaining anything?

splines

smoothing splines

\[ PRSS = \sum_{i=1}^N (y_i - f(x_i) \beta)^2 + \lambda \int f''(t)^2 \mathrm{d}t \]

library('mgcv'); m = gam(basePupil ~ s(Timebin, k = 10), data = dat); plot(m)

smoothing splines: note

  • Baayen teaches a class on generalized additive models in the spring (Tuesday, 18-21)
  • I can wholeheartedly recommend it as a data analysis class (there's no math!)
  • if you need a teaser, read his recently arxived paper: Out of the Cage of Shadows

Mathematical note

\[ p(\beta|y, X) = \frac{\pi(\beta)\prod_{i=1}^n p(y_i|X, \beta)}{\int \pi(\beta)\prod_{i=1}^n p(y_i|X, \beta) \mathrm{d}\beta} \]



\[ \begin{align*} \beta &\sim \mathcal{N}(m_0, \Sigma_p) \\ y|\beta, X &\sim \mathcal{N}(X\beta, C := I\sigma^2) \\ \beta|y, X &\sim \mathcal{N}(\mu, A^{-1}) \end{align*} \]

  • where

\[ \begin{align*} A^{-1} &= nC^{-1} + \Sigma_p^{-1} \\[1.5ex] \mu &= A^{-1} (nC^{-1} \bar{\boldsymbol{y}} + \Sigma_p^{-1} m_0) \end{align*} \]

Mathematical note

  • basis expansion augments the design matrix X, such that \(X^{n \times p} \stackrel{\phi}{=} \Phi^{n \times pm}\)

    \[ \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix} = \begin{pmatrix} \phi(x_{11}) & \phi(x_{12}) & \dots & \phi(x_{1p}) \\ \phi(x_{21}) & \phi(x_{22}) & \dots & \phi(x_{2p}) \\ \vdots & \vdots & \ddots & \vdots \\ \phi(x_{n1}) & \phi(x_{n2}) & \dots & \phi(x_{np}) \end{pmatrix} \]

Kernel trick

  • let \(z = \phi(x)\)
  • in linear regression, all our computations involving \(z\) are inner products, \(z^Tz'\)
  • thus, even when \(m \rightarrow \infty\), i.e. our feature space is infinite dimensional
  • inner product just returns a single number

    \[ \begin{pmatrix} z_1, z_2, \ldots, z_m \end{pmatrix} \times \begin{pmatrix} z_1' \\ z_2' \\ \vdots \\ z_m' \end{pmatrix} = z_1\cdot z_1' + z_2 \cdot z_2' + \ldots + z_m \cdot z_m' \]
dummy dummy

(for gory details, see Rasmussen & Williams, 2006; p.12)

Before the feature space

Kernel trick

  • let \(z^Tz' = K(x, x')\), call it Kernel
  • goal:
    • instead of computing all the powers in feature space (possible infinite)
    • we use the Kernel to do this computation implicitly

  • claim:
    • \(K(x, x') = (1 + x^T x')^d\) corresponds to an inner product
    • \(z^T z'\), where \(z\) and \(z'\) are the \(d^{\text{th}}\) polynomial of \(x\) and \(x'\)

Kernel trick

  • claim: \(K(x, x') = (1 + x^T x')^d\) is the same as
    • transforming \(x\) and \(x'\) to \(z\) and \(z'\), respectively
    • and computing their (possibly very high dimensional) inner product

  • example: \(x = (x_1, x_2)\), \(x' = (x_1', x_2')\)

Kernel trick

  • so instead of actually transforming our data, we simply use the Kernel
  • the Kernel becomes the object of primary interest
  • instead of reasoning about possible transformations (polynomial? sine?)
    • reason in terms of the Kernel, and its properties; e.g.
    • Radial Basis Function \(K(x, x') = \alpha \exp \left (-\gamma||x - x'||^2 \right)\)

A Gaussian process is an infinite dimensional multivariate Normal distribution.

key property

\[ \begin{align*} \begin{pmatrix} X_1 \\ X_2 \\ X_3 \end{pmatrix} &\sim N \begin{bmatrix} \begin{pmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{pmatrix}, \begin{pmatrix} \Sigma_{11} & \Sigma_{12} & \Sigma_{13} \\ \Sigma_{21} & \Sigma_{22} & \Sigma_{23} \\ \Sigma_{31} & \Sigma_{32} & \Sigma_{33} \end{pmatrix} \end{bmatrix} \\[2ex] \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} &\sim N \begin{bmatrix} \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix} \end{bmatrix} \\[2ex] X_1 &\sim N(\mu_1, \Sigma_{11}) \end{align*} \]

Gaussian process

  • is a stochastic process
    • generalizes the notion of a probability distribution to functions
  • key idea: instead of restricting the set of functions, estimate them
  • concretely, specify a prior over the (infinite set) of functions
  • estimate them from the data

    \[ f(x) \sim \mathcal{GP}(m(x), k(x, x')) \]

  • function is just a very long vector!

Gaussian process

  • probability distribution over \(f(x)\) is a Gaussian process if the marginal density \(p(x_1, x_2, ..., x_n)\) is Gaussian (cf., MacKay, 1998)

    \[ \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_n \end{pmatrix} \sim N \begin{bmatrix} m(x), \begin{pmatrix} K(x_i, x_j) & K(x_i, x_{j+1}) & \ldots & K(x_i, x_n) \\ K(x_{i+1}, x_j) & K(x_{i+1}, x_{j+1}) & \ldots & K(x_{i+1}, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ K(x_n, x_j) & K(x_n, x_{j+1}) & \ldots & K(x_n, x_n) \end{pmatrix} \end{bmatrix} \]

Gaussian process

Gaussian process

Gaussian process in Stan: setup

gp = '
data {
  int N1;
  int N2;
  vector[N1] y1;
  vector[N1] x1;
  vector[N2] x2;
}

transformed data {
  int<lower=1> N;
  vector[N1 + N2] x;
  vector[N1 + N2] mu;
  N <- N1 + N2;
  for (i in 1:N) mu[i] <- 0;
  for (i in 1:N1) x[i] <- x1[i];
  for (i in 1:N2) x[N1 + i] <- x2[i];
}

...
'

Gaussian process in Stan: model

'parameters {
  vector[N2] y2;
  real<lower=0> eta_sq;
  real<lower=0> rho_sq;
  real<lower=0> sigma_sq;
}

model {
  vector[N] y;
  matrix[N, N] Sigma;
  eta_sq ~ cauchy(0,5);
  rho_sq ~ cauchy(0,5);
  sigma_sq ~ cauchy(0, 5);

  for (i in 1:N1) y[i] <- y1[i];
  for (i in 1:N2) y[N1 + i] <- y2[i];

  for (i in 1:N) {
    for (j in 1:N) {
      Sigma[i,j] <- eta_sq * exp(-rho_sq * pow(x[i] - x[j], 2)) +
                    if_else(i == j, sigma_sq, 0.0);
    }
  }

  y ~ multi_normal(mu, Sigma);
}'

Gaussian process in Stan

library('rstan')
library('parallel')
# maybe find a better data set??

N1 = 21
N = length(y)
N2 = N - N1
stan_dat = list('y1' = y[1:N1], 'y2' = y[(N1+1):N],
                'x1' = X[1:N1], 'x2' = X[(N1+1):N],
                'N' = N, 'N1' = N1, 'N2' = N2)
compfit = stan(model_code = gp, dat = stan_dat, chains = 0)

sflist = mclapply(1:4, mc.cores = 4,
           function(i) stan(fit = compfit, data = stan_dat, seed = runif(1),
                            chains = 1, chain_id = i, refresh = -1, iter = 8000))

fit = sflist2stanfit(sflist)

Gaussian process prediction

ypred = apply(rstan::extract(fit, 'y2')[['y2']], 2, mean)
dat[(N1+1):N, 2] = ypred # beware of the high variance!
ggplot(dat, aes(x = Timebin, y = basePupil)) + theme_bw() + geom_point() +
       geom_line(colour = 'blue') + ggtitle('Pupil dilation over time')

wrap up

  • distinguish between restriction bias and preference bias
    • in 'standard' regression, we speficy a constraint in terms of the basis expansion
    • in Gaussian process regression, we specify a preference via the Kernel
  • Gaussian process specifies a prior over functions
  • the Kernel encodes what type of functions we prefer
  • actually possible to compute all this in closed form (see RW, 2006, Ch. 2)

  • see here for a great podcast on Gaussian processes (episode 2)
  • see here for an amazing (and funny!) explanation of the Kernel trick
  • see here for the Stan manual (which includes Gaussian process regression)

Finite Mixtures

remember?

therapeuticTouches

dummy dummy

  • TT practitioners sense energy fields
  • they sense which hand is near another persons body

(Kruschke, chapter 9.2.4)

remember?

Iris clustering

kmeans clustering

Gaussian mixture model

library('ggmcmc')
library('mixtools')


fit_mixture = function(ms, y, k, n.iter = 2000) {
  dat = list('y' = y, 'n' = length(y), 'alpha' = rep(1, k), 'k' = k)
  model = jags.model(textConnection(ms), dat, n.chains = 3, quiet = TRUE)
  update(model, n.iter / 4)
  coda.samples(model, variable.names = c('mu', 'sigma'), n.iter = n.iter)
}

Gaussian mixture model

mix = '
model {
  theta[1:k] ~ ddirch(alpha[]) # mixing distribution

  # class conditional Gaussians
  for (i in 1:k) {
    mu[i] ~ dnorm(0, .0001)
    sigma[i] ~ dunif(0, 100)
    tau[i] = pow(sigma[i], -2)
  }

  # likelihood
  for (i in 1:n) {
    c[i] ~ dcat(theta[1:k])
    y[i] ~ dnorm(mu[c[i]], tau[c[i]])
  }
}'

Gaussian mixture

ss = ggs(fit_mixture(mix, faithful$waiting, 2))
ggs_density(ss)

a note on meta-analysis

  • science is a cumulative endeavour
  • no single experiment will be decisive (except if you're name is Stroop)
  • thus, we want to aggregate all the experiments done in a certain area
  • in order to uncover the truth, because every experiment is noisy

  • the enemy of all science: publication bias
    • in short: crap in -> meta-analysis -> crap out
    • there is no substitute for registered replications (van Elk et al., 2015)

a note on meta-analysis

  • instead of looking to uncover the truth, we can look if a research area has evidential value
  • similar to the distinction between model comparison and parameter estimation
    • before we estimate an effect, let's see if we can trust the research
    • the most popular method to do this is \(p\) curve
    • we will look at a recently proposed Bayesian \(p\) curve

\(p\) values drawn from a mixture

\[ p(p_i) = \alpha f_{H_0} + (1 - \alpha) f_{H_1} \]

distribution of \(p\) values under \(H_0\)

x = replicate(10000, rnorm(100))
ps = apply(x, 2, function(dat) t.test(dat)[['p.value']])
hist(ps)

distribution of \(p\) values under \(H_1\)

  • complex function of sample size, true effect, etc.
    • any set of \(p\) values we analyze will be heterogeneous
    • and thus the distribution will be a mixture

  • don't want to use a restrictive parametric model here
    • the resulting model mis-specification might temper with out inference
    • instead, use a Dirichlet process mixture!

pcurve demo

Chinese restaurant process

Chinese restaurant process

\[ p(c_n = k|c_{1:n-1}) = \begin{cases} \frac{m_k}{n - 1 + \alpha} & \mbox{if $k \le K_{+}$} \\ \frac{\alpha}{n - 1 + \alpha} & \mbox{otherwise} \end{cases} \]

dummy dummy

  • where \(m_k\) is the number of people sitting at table \(k\)
  • \(K_{+}\) is the number of tables for which \(m_k > 0\)
  • \(\alpha\) is the concentration parameter; larger values imply more clusters
  • cool thing: this sequence is exchangeable! (for more, see Gershman & Blei, 2012)

Gronau et al.'s model

\[ \begin{align*} G &\sim DP(\cdot|G_0, \alpha) \\[1.5ex] \theta_i &\sim G(\cdot) \\[1.5ex] \Phi^{-1}(p_i^{H_1}) &\sim N(\cdot|\theta_i)_{T(, -1.64485)} \end{align*} \]

stick breaking

stick breaking

  • sampling from a Dirichlet process prior can be decomposed into two parts
    • generating weights, \(q_i \sim \mathcal{Beta}(1, \alpha)\)
    • generating the associated locations, \(\pi_j = q_j \prod_{k=1}^{j-1} (1 - q_k)\)
  • the obtain probability mass is one draw from the Dirichlet process prior
  • for more, see the short and spicy paper by Gronau et al. (submitted)

Likelihood

bayesmix = '
model{
  alpha ~ dunif(.5,7) # hyperprior
  phi ~ dbeta(1,1) # mixing weight H0 & H1

  for (i in 1:n) {
    ind[i] ~ dbern(phi)
    z[i] ~ dcat(proportionsStick[])
    z1[i] = z[i] + 1
    z2[i] = ifelse(ind[i] == 1, 1, z1[i])
    qp[i] ~ dnorm(mu[z2[i]], lambda[z2[i]])T(,-1.64485)
  }

  ...
}'

Stick breaking

'
  for (j in 1:(nmaxclust - 1)) {
    prop_remaining[j] ~ dbeta(1, alpha)
  }
  
  prop_stick[1] = prop_remaining[1]
  
  for (j in 2:(nmaxclust - 1)) {
    prop_stick[j] = prop_remaining[j] * prod(1 - prop_remaining[1:(j-1)])
  }
  
  prop_stick[nmaxclust] = 1 - sum(prop_stick[1:(nmaxclust - 1)])
  
  for (j in 1:nmaxclust) {
    mu_h1[j] ~ dnorm(0, 1)T(-6,0)
    lambda_h1[j] ~ dgamma(0.001, 0.001)T(,1)
  }
  
  mu[1] = 0
  lambda[1] = 1
  mu[2:(nmaxclust + 1)] = mu_h1
  lambda[2:(nmaxclust + 1)] = lambda_h1
}'

wrap up

  • science is a cumulative endeavour
  • publication bias is the enemy of all science (and the misuse of \(p\) values)
  • model the distribution of \(p\) values to get a sense of the evidential value
    • use a Dirichlet process mixture to allow for a flexible model
    • be Bayesian in order to get what you really want

final wrap up

  • Bayesian nonparametrics are a very flexible class of models
  • if you are still unsure about projects, you can look at their applications
    • e.g., in category learning, intelligence research, topic modeling

  • next week: your turn!
  • prepare a short presentation about
    • the current state of your project
    • and what you plan to do before submission (2th March)
    • send me slides / code per mail until Tuesday, 23:59