outline

  • recap of last time
    • likelihood principle
    • some notes on Bayes factor

  • linear regression
    • motivating examples
    • least squares estimation

outline

  • Bayesian linear regression
    • estimation using JAGS
    • regularization
    • Bayes factor

  • some cool R stuff
    • rshiny, rstudio, rmarkdown
    • magrittr, dplyr

the likelihood principle

  • all information in statistical inference is incorporated via the likelihood
  • classical statistics violates this
    • therefore, the way the data were collected matters
    • there is no optional stopping

likelihood principle

  • say I ask you single choice questions about Bayesian statistics
  • since it's early in the course, you only get 17 out of 24 correct
  • were you better than chance?

  • as always, we use a binomial model \[ f(k|\theta, n) = {n \choose k} \theta^k (1 - \theta)^{n - k} \]

likelihood principle

  • assume you got the last question correct
  • in carrying out the experiment, I could have had two distinct intentions:
    • ask you exactly 24 questions, and see how many you get right
    • ask you questions until you get 17 right

  • the second intention gives rise to a negative binomial model \[ f(n|\theta, k) = {n-1 \choose k-1} \theta^k (1 - \theta)^{n - k} \]

likelihood principle

  • Bayesians obey the likelihood principle \[ \begin{align*} P_{bin}(\theta|\textbf{y}) &= \frac{p(\textbf{y}|\theta)p(\theta)}{p(\textbf{y})} \\[2ex] &= \frac{{n \choose k} \theta^{k}(1-\theta)^{n-k}p(\theta)}{\int_0^1 {n \choose k} \theta^{n}(1-\theta)^{n-k}p(\theta)\mathrm{d}\theta} \\[2ex] &= \frac{{n \choose k} \theta^{k}(1-\theta)^{n-k}p(\theta)}{ {n \choose k } \int_0^1\theta^{k}(1-\theta)^{n-k}p(\theta)\mathrm{d}\theta} \\[2ex] &= \frac{\theta^{k}(1-\theta)^{n-k}p(\theta)}{\int_0^1\theta^{k}(1-\theta)^{n-k}p(\theta)\mathrm{d}\theta} \end{align*} \]

Bayes factor

  • is a generic model comparison method
    • relative measure of evidence
    • does not require nested models (unlike the \(p\) value)
    • instantiates an automatic Ockham's razor

  • goes beyond just counting parameters (like AIC and BIC)
  • incorporates the functional form of the model

readings

quick note

  • I want this session to be more interactive
    • if you don't understand some R code, SCREAM

  • the primary literature for this session is:
    • Bishop, C.M. (2006) Pattern Recognition and Machine Learning (ch 1)
    • Hastie, T. & Tibshirani, R (2009). Elements of Statistical Learning (ch 2+3)
    • Hastie, T. & Tibshirani, R (2013). An Introduction to Statistical Learning with Applications in R (ch 2+3)
    • Cohen, J. (1968). Multiple regression as a general data-analytical system.

three goals

  • get a feeling for how and why linear regression works

  • preview computational tools for Bayesian regression
    • see the beauty of this framework with respect to regularization

  • show some cool new developments in the R world

jumping in

primer

adv <- read.csv('data/Advertising.csv')[, -1]
head(adv)
##      TV Radio Newspaper Sales
## 1 230.1  37.8      69.2  22.1
## 2  44.5  39.3      45.1  10.4
## 3  17.2  45.9      69.3   9.3
## 4 151.5  41.3      58.5  18.5
## 5 180.8  10.8      58.4  12.9
## 6   8.7  48.9      75.0   7.2

primer

with(adv, plot(TV, Sales, col = 'red'))
abline(lm(Sales ~ TV, data = adv), col = 'blue', lwd = 2)

primer: linear regression

m <- lm(Sales ~ TV, data = adv)
summary(m)
## 
## Call:
## lm(formula = Sales ~ TV, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

primer: multiple regression

m <- lm(Sales ~ TV + Radio, data = adv)
summary(m)
## 
## Call:
## lm(formula = Sales ~ TV + Radio, data = adv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## Radio        0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

primer: t-test

set.seed(1774)

n <- 50
hat <- rnorm(n, mean = 50, sd = 5)
nohat <- rnorm(n, mean = 50, sd = 5)
dat <- data.frame(creativity = c(nohat, hat), hat = rep(0:1, each = n))

head(dat)
##   creativity hat
## 1   45.42617   0
## 2   56.49689   0
## 3   51.56313   0
## 4   48.18651   0
## 5   46.22818   0
## 6   49.85087   0

primer: t-test

m <- lm(creativity ~ hat, data = dat)
summary(m)
## 
## Call:
## lm(formula = creativity ~ hat, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1598  -2.7061  -0.1059   2.7183   9.8746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.2720     0.6617  74.467   <2e-16 ***
## hat           1.3873     0.9357   1.483    0.141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.679 on 98 degrees of freedom
## Multiple R-squared:  0.02194,    Adjusted R-squared:  0.01196 
## F-statistic: 2.198 on 1 and 98 DF,  p-value: 0.1414

general linear model

general linear model

  • denote \(y\) as our {dependent, output} variable and \(X\) as our {independent, predictor} variable, where

\[ y = f(X) + \epsilon \]

  • with \(\epsilon \sim {\mathcal{N}}(0, \sigma^2)\)
  • and \(f(X)\) being a linear function

simulate from linear model

set.seed(1774) # make example reproducible

n <- 100
x <- rnorm(n, mean = 50, sd = 10)
error <- rnorm(n, mean = 0, sd = 20)

f <- function(x, b0 = 10, b1 = 3) b0 + b1*x

y <- f(x) + error
head(cbind(y, x), 10)
##               y        x
##  [1,] 165.36433 57.97829
##  [2,] 120.13588 47.69310
##  [3,] 188.86584 55.68723
##  [4,] 134.62419 41.57817
##  [5,] 185.08043 50.68255
##  [6,] 134.88544 49.36588
##  [7,] 231.22224 63.81083
##  [8,] 143.19750 54.03224
##  [9,] 121.70756 37.80916
## [10,]  93.51492 39.35512

simulate from linear model

estimation

\[ f(X) = X\beta \]

  • which yields \[ y = X\beta + \epsilon \]

  • need to estimate the coefficient vector \(\hat \beta\) and \(\hat \sigma^2\) such that

\[ \hat \beta = \underset{\beta} {\text{argmin}} \, \sum_{i=1}^n (y_i - \hat y_i)^2 \]

  • why? what just happened?
  • why not just solve the equations as in high school?

least squares: start

least squares: end

least squares

\[ \begin{align*} 1b_1 + 1b_2 &= 1 \\ 1b_1 + 2b_2 &= 2 \\ 1b_1 + 3b_2 &= 2 \end{align*} \]

  • or more concisely \[ \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{pmatrix} \begin{pmatrix} b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \\ 2 \end{pmatrix} \]

  • this equation is overdetermined, and we cannot solve it
  • however, we want to find the best solution

best solution (squared loss)

\[ \begin{align*} RSS &= \sum_{i=1}^{n} (y_i - \hat y_i)^2 \\[2ex] &= (y - X\hat \beta)^2 \\[2ex] &= (y - X\hat \beta)^T(y - X\hat \beta) \\[2ex] &= ||y - X\hat \beta||^2 \\[2ex] &= ||\mathbf{e}||^2 \end{align*} \]

best solution (squared loss)

projection: simple case

projection: simple case

\[ \begin{align*} x^Te &= 0 \\[1ex] x^T(y - x \hat b) &= 0 \\[1ex] x^Ty &= x^Tx \hat b \\[1ex] \hat b &= \frac{x^Ty}{x^Tx} \end{align*} \]

projection: extension

projections: extension

\[ \begin{align*} X^Te &= 0 \\[1ex] X^T(y - X \hat b) &= 0 \\[1ex] X^T X \hat b &= X^Ty \\[1ex] \hat b &= (X^T X)^{-1} X^Ty \end{align*} \]

least squares: recap

  • can't solve

\[ X\beta = y \]

  • because \(y\) is not in the column space of \(X\)
  • project \(y\) onto the column space, resulting in \(X\hat \beta\), or \(\hat y\)

  • using least squares, solve for \(\hat \beta\)

\[ \begin{align*} X^T X \hat \beta &= X^T y \\[2ex] \hat \beta &= (X^TX)^{-1} X^T y \end{align*} \]

least squares

\[ \begin{align*} 1b_1 + 1b_2 &= 1 \\ 1b_1 + 2b_2 &= 2 \\ 1b_1 + 3b_2 &= 2 \end{align*} \]

  • or more concisely \[ \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{pmatrix} \begin{pmatrix} b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \\ 2 \end{pmatrix} \]

least squares

X <- unname(cbind(1, xx))

estimate <- function(X, y) {
  solve(t(X) %*% X) %*% (t(X) %*% y)
}

estimate(X, yy)
##           [,1]
## [1,] 0.6666667
## [2,] 0.5000000
coef(lm(yy ~ xx))
## (Intercept)          xx 
##   0.6666667   0.5000000

least squares: plugin in \(\hat \beta\)

\[ \begin{align*} 2/3 + 1/2 &\approx 1 \\ 2/3 + 2 \times 1/2 &\approx 2 \\ 2/3 + 3 \times 1/2 &\approx 2 \end{align*} \]

\[ \begin{align*} \text{-}1/6 &= \epsilon_1 \\ 1/3 &= \epsilon_2 \\ \text{-}1/6 & = \epsilon_3 \end{align*} \]

least squares: one more dimension

least squares: nothing changes

\[ \begin{align*} 1b_1 + 1b_2 + 0b_3 &= 1 \\ 1b_1 + 2b_2 + 2b_3 &= 2 \\ 1b_1 + 1b_2 + 0b_3 &= 2 \\ 1b_1 + 2b_2 - 2b_3 &= 3 \end{align*} \]

  • or more consicely

\[ \begin{pmatrix} 1 & 1 & 0 \\ 1 & 2 & 2 \\ 1 & 1 & 0 \\ 1 & 2 & \text{-} 2 \end{pmatrix} \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \\ 2 \\ 3 \end{pmatrix} \]

least squares: nothing changes

\[ \begin{align*} X \hat \beta &= y \\[2ex] X^T X \hat \beta &= X^T y \\[2ex] \hat \beta &= (X^TX)^{-1} X^T y \end{align*} \]

least squares: nothing changes

X <- unname(cbind(1, x1, x2))
estimate(X, yy2)
##       [,1]
## [1,]  0.50
## [2,]  1.00
## [3,] -0.25
coef(lm(yy2 ~ x1 + x2))
## (Intercept)          x1          x2 
##        0.50        1.00       -0.25

least squares: notes

  • is a completely generic estimator, with no assumptions
  • is also used to see if cognitive models fit well

  • to get the sampling properties of estimates \(\hat \beta\), we need to make some assumptions

general linear model: assumptions

  • recall \[ y = X\beta + \epsilon \]

  • \(\epsilon \sim \mathcal{N}(0, \sigma^2)\)
  • \(y_i\) are uncorrelated and have constant variance (homoscedasticity)
  • under these assumptions, the least squares estimator corresponds to the maximum likelihood estimator

more estimates

  • \(\hat \beta = (X^T X)^{-1} X^T y\)
  • \(\mathrm{Var}(\hat \beta) = (X^T X)^{-1}\sigma^2\)
  • \(\hat \sigma^2 = \frac{1}{n - p - 1} \sum_{i=1}^n (y_i - \hat y_i)^2\)

  • the coefficients are multivariate normals: \[ \hat \beta \sim \mathcal{N}(\beta, (X^T X)^{-1}\sigma^2) \]

significance testing

  • use a t-test to check if coefficients are significantly different from 0

\[ t_j = \frac{\hat \beta_j - 0}{\sqrt{\mathrm{Var}(\hat \beta)_{jj}}} \]

- use an F-test for the whole model \[ F = \frac{(RSS_0 - RSS_1) / (p_1 - p_0)}{\sigma_1^2} \]

wrapping up

estimate <- function(y, X) {
  p <- ncol(X)
  n <- length(y)
  df <- n - p - 1
  inv <- solve(t(X) %*% X)
  
  beta <- inv %*% (t(X) %*% y)
  err <- y - X %*% beta
  sigma2 <- 1/df * sum(err^2)
  
  covmat <- solve(t(X) %*% X) * as.numeric(sigma2)
  varbeta <- diag(covmat)
  
  # run t-tests
  tstats <- beta / sqrt(varbeta)
  pvals <- (1 - pt(tstats, df = df)) * 2
  
  # wrap it up
  res <- cbind(beta, tstats, pvals)
  colnames(res) <- c('beta', 'tstat', 'pval')
  res
}

general linear model: violations

  • non i.i.d \(y\)
    • e.g. nested in participants
  • heteroscedasticity
    • e.g. poisson response
  • collinearity
    • linear dependence among predictors
  • outlier
    • bias the estimates

Bayesian estimation

primer: graph theory

  • [on board]

Bayesian linear regression

glm.bayes <- function(y, X, n.iter = 10000) {
    
    ms <- '
    model {
        sigma ~ dunif(0, 100)
        tau <- pow(sigma, -2)

        for (i in 1:p) {
            b[i] ~ dnorm(0, .001)
        }
        mu <- X %*% b

        for (i in 1:n) {
            y[i] ~ dnorm(mu[i], tau)
        }
    }'
    params <- c('b', 'sigma')
    
    jags.dat <- list('n' = length(y), 'y' = y, 'X' = X, 'p' = ncol(X))
    model <- jags.model(textConnection(ms), data = jags.dat, n.chains = 3, quiet = TRUE)
    
    samples <- coda.samples(model, variable.names = params,
                            n.iter = n.iter, progress.bar = 'none')
    samples
}

MCMC samples

library('rjags')

X <- cbind(1, adv[, 1:2]) # TV, Radio
samples <- glm.bayes(adv$Sales, X)

head(samples)[[1]] # look at first chain
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1001 
## End = 1007 
## Thinning interval = 1 
##          b[1]       b[2]      b[3]    sigma
## [1,] 3.245187 0.04469467 0.1798017 1.567023
## [2,] 3.084721 0.04468296 0.1833741 1.619081
## [3,] 2.762136 0.04673752 0.1818689 1.750262
## [4,] 2.974022 0.04465194 0.1907283 1.681974
## [5,] 2.788118 0.04436529 0.1966705 1.715625
## [6,] 2.698615 0.04629846 0.1952554 1.694359
## [7,] 2.713838 0.04561158 0.2020620 1.700925

summary MCMC

summary(samples)
## 
## Iterations = 1001:11000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean       SD  Naive SE Time-series SE
## b[1]  2.91883 0.302523 1.747e-03      5.382e-03
## b[2]  0.04577 0.001427 8.236e-06      2.117e-05
## b[3]  0.18799 0.008219 4.745e-05      1.179e-04
## sigma 1.69201 0.085606 4.942e-04      6.426e-04
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%     50%     75%   97.5%
## b[1]  2.32780 2.71275 2.91880 3.12309 3.50848
## b[2]  0.04297 0.04482 0.04577 0.04673 0.04856
## b[3]  0.17171 0.18250 0.18798 0.19349 0.20417
## sigma 1.53555 1.63254 1.68827 1.74783 1.86926

convergence

library('coda') # for convergence checks

gelman.diag(samples)
## Potential scale reduction factors:
## 
##       Point est. Upper C.I.
## b[1]           1          1
## b[2]           1          1
## b[3]           1          1
## sigma          1          1
## 
## Multivariate psrf
## 
## 1

convergence

library('ggmcmc')

ss <- ggs(samples)
ggs_autocorrelation(ss)

posterior distributions

wait a minute …

regularization

maximum likelihood overfitting

  • assume 3 coin tosses, all three land heads
  • what's the prediction for head on the next flip?

  • assume 3 coin tosses, all three land tails
  • what's the prediction for head on the next flip?

variable selection

  • two reasons why we might not be happy with MLE
    • prediction accuracy: MLE has low bias but large variance
    • interpretability: want to find the best predictors

  • possible solutions:
    • backward selection
    • forward selection
    • combination of the two

  • all rather weird
  • better: penalize the loss function

penalized loss function

  • instead of simply using the residual sum of squares, we add some penalization: \[ \hat \beta = \underset{\beta} {\text{argmin}} \, \left( \sum_{i=1}^n (y_i - \hat y_i)^2 + \lambda \sum_{j=1}^p |\beta_j|^q \right) \]

  • q = 1 results in the lasso (\(L_1\) norm)
  • q = 2 results in ridge regression (\(L_2\) norm)

  • note: that's not important for this class
  • important: regularization is built into the Bayesian framework
    • MAP / mean estimate with normal prior gives ridge regression
    • MAP estimate with Laplace prior gives the lasso

bayesian lasso

lasso.bayes <- function(y, X, n.iter = 10000) {
    ms <- '
    model {
        sigma ~ dunif(0, 100)
        tau <- pow(sigma, -2)

        lambda ~ dunif(0, 10)
        b[1] ~ dnorm(0, .001)

        for (i in 2:p) {
            b[i] ~ ddexp(0, lambda)
        }
        mu <- X %*% b

        for (i in 1:n) {
            y[i] ~ dnorm(mu[i], tau)
        }
    }' # for the paper, see Park & Casella (2004)
    params <- c('b', 'sigma', 'lambda')
    
    jags.dat <- list('n' = length(y), 'y' = y, 'X' = X, 'p' = ncol(X))
    model <- jags.model(textConnection(ms), data = jags.dat, quiet = TRUE)
    
    samples <- coda.samples(model, variable.names = params,
                            n.iter = n.iter, progress.bar = 'none')
    samples
}

the wonderful Bayesian world

  • everything is structured
  • instead of fiddling around with different estimation procedures
  • we just write it down and be done with it

  • on the lasso we see how cool this is
    • we didn't change our framework, and it just worked
    • estimating \(\lambda\) using cross-validation is suboptimal
      • we get it easily using hyper priors
    • standard errors in the lasso are not easy to calculate
      • we just get them out of the Bayesian box

Bayes factor

Bayes factor

library('BayesFactor')

bf <- regressionBF(Sales ~ TV*Radio*Newspaper, data = adv, progress = FALSE)
bf
## Bayes factor analysis
## --------------
## [1] TV                     : 9.455818e+38 ±0%
## [2] Radio                  : 9.019635e+15 ±0%
## [3] Newspaper              : 22.92162     ±0%
## [4] TV + Radio             : 1.391438e+94 ±0%
## [5] TV + Newspaper         : 4.806461e+41 ±0%
## [6] Radio + Newspaper      : 1.131977e+15 ±0.01%
## [7] TV + Radio + Newspaper : 5.437431e+92 ±0%
## 
## Against denominator:
##   Intercept only 
## ---
## Bayes factor type: BFlinearModel, JZS

Bayes factor

post <- posterior(bf, index = 4, iterations = 10000, progress = FALSE)

head(post)
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1 
## End = 7 
## Thinning interval = 1 
##              TV     Radio     sig2         g
## [1,] 0.04885797 0.3061050 5.103436  1.298927
## [2,] 0.04563330 0.1842089 2.478365 10.798266
## [3,] 0.04478827 0.1928241 2.803676  6.213414
## [4,] 0.04558144 0.1813746 2.704055  5.332902
## [5,] 0.04567917 0.1926027 2.933725  3.606492
## [6,] 0.04645037 0.1959740 2.273238  1.335846
## [7,] 0.04468148 0.1888816 2.370580  3.017834

Bayes factor

plot(post[, 1:2])

cool R stuff

RShiny

RMarkdown

RStudio

magrittr %>%

magrittr %>%

library('dplyr')
library('ggplot2')
library('magrittr')
library('babynames')

babynames
## Source: local data frame [1,792,091 x 5]
## 
##    year sex      name    n       prop
## 1  1880   F      Mary 7065 0.07238359
## 2  1880   F      Anna 2604 0.02667896
## 3  1880   F      Emma 2003 0.02052149
## 4  1880   F Elizabeth 1939 0.01986579
## 5  1880   F    Minnie 1746 0.01788843
## 6  1880   F  Margaret 1578 0.01616720
## 7  1880   F       Ida 1472 0.01508119
## 8  1880   F     Alice 1414 0.01448696
## 9  1880   F    Bertha 1320 0.01352390
## 10 1880   F     Sarah 1288 0.01319605
## ..  ... ...       ...  ...        ...

magrittr %>%

babynames %>%
  filter(name %>% substr(1, 3) %>% equals("Mar")) %>%
  group_by(year, sex) %>%
  summarize(total = sum(n)) %>%
  qplot(year, total, color = sex, data = ., geom = "line") %>%
  add(ggtitle('Names starting with "Mar"'))

magrittr %>%

possible projects

  • more mathematical / simulation
    • write a paper on the Bayesian bootstrap
    • write a paper on error probabilities of the Bayes factor
    • write a paper on regularization from a Bayesian perspective
    • summarize the literature on Markov Chain Monte carlo techniques
    • summarize the literature on finding objective prior distributions
    • summarize the literature on ways to compute the Bayes factor

  • more foundational / philosophical
    • write a paper on the likelihood principle
    • write a paper summarizing the difference between frequentism and Bayes
    • write a paper on Popper and null hypothesis significance testing

possible projects

  • more programming
    • implement an R package to do Bayesian task X
    • implement classical tests in a Bayesian way (R package)
    • write an RShiny app doing some shiny stuff

  • more analyzing
    • meta-analyze a certain field of research using a Bayesian mixture model
    • re-analyze studies published in the Journal in Support of the Null Hypothesis