Bayesian data analysis: Homework set 2

This set is due 11/18 before class. The submission should be in Rmarkdown, and should go to Fabian with subject BDA: Homework 2 (submit both .Rmd and .html / .pdf). We encourage discussing exercises with fellow students, but only individual solutions will be accepted.

1. Why most published research findings are false (6)

Read the blog post on p-value logic.

Explain in your own words what “power” is in this context.

Statistical power is the probability of rejecting the null hypothesis given that it really is false. By probability we mean here a limiting relative frequency (i.e. repeating the experiment an infinite amount of time). Power depends on the size of the effect in question, the nominal \(\alpha\) value, and the sample size.

Write a function, ppv, that gives the posterior predictive value, and a function, fdr, that gives the false discovery rate. Both functions take as arguments the prior probability of \(H_1\), the nominal \(\alpha\) rate, and the statistical power \(1 - \beta\).

ppv <- function(prior, alpha, power) {
  power * prior / (power * prior + alpha * (1 - prior))
}

fdr <- function(prior, alpha, power) {
  1 - ppv(prior, alpha, power)
}

Plot the posterior predictive value for interestingly different values of \(P(H_1)\) and \(\beta\). What do you observe? How much do your results depend on \(P(H_1)\) and \(\beta\)? Why is most published research false? As hint, the posterior predictive value and fdr are intimitately related (see blog post) and the former is given by Bayes’ rule:

\[ P(H_1|p < \alpha) = \frac{P(p < \alpha|H_1)P(H_1)}{P(p < \alpha)} \]

prior <- c(.1, .3, .5, .7, .9)
power <- seq(.01, .99, length.out = 100)
ppvs <- do.call('cbind', lapply(prior, function(p) ppv(p, .05, power)))

matplot(ppvs, type = 'l', lty = 1, lwd = 2, main = 'PPV as a function of power and prior',
        ylab = 'posterior predictive value', xlab = 'power', col = 8:12)
abline(v = 21, col = 'red', lwd = 1.5, lty = 2)
legend('bottomright', inset = .02, legend = prior, col = 8:12,
       lty = 1, lwd = 2, title = 'prior')

We observe that the prior plausibility of \(H_1\) plays a crucial role. This, combined with low power, results in ’’most published research findings being false``. Especially in the neurosciences, where average power is about 21% (Button et al., 2013) – indicated by the dotted red line – \(p < \alpha\) does not lead to high credibility. This is especially true for social cognitive neuroscience, where researchers look at the neuronal correlates of, say, the number of facebook friends (Kanai et al., 2012; see also Boekel, et al. 2015). Apriori, any specific hypothesis in this field has a rather low credibility.

2. MLE vs. Bayes (4)

Given three coin flips, all coming up heads, what is the probability that the next coin flip will show heads? Give the prediction via maximum likelihood and – assuming an uninformative prior – Bayesian inference (use the mean, median, and mode of the posterior). Discuss your findings.

Note: Make every step explicit. How do you derive the MLE? How do you derive the posterior? Etc.

MLE

Since the log is a monotonically increasing function it does not change the maximum, but makes for easier computation.

\[ \begin{align*} f(y|\theta) &\propto \theta^y + (1 - \theta)^{n - y} \\ \ell(y|\theta) &= y \log(\theta) + (n - y) \log(1 - \theta) \\ \end{align*} \]

We set the first derivative to zero to get the maximum (we don’t have to check by using the second derivative because the likelihood surface is convex).

\[ \begin{align*} \frac{d}{d\theta}\ell(y|\theta) &= \frac{y}{\theta} - \frac{n - y}{1 - \theta} \stackrel{!}{=} 0 \\ 0 &= \frac{y}{\theta} - \frac{n - y}{1 - \theta} \\ n\theta - y\theta &= y - y\theta \\ \theta &= \frac{y}{n} = 3/3 = 1 \end{align*} \]

We plugin our value \(\theta = 1\) to get our prediction:

\[ f(y' = 1|\theta = 1, y = 3, n = 3) = {3 \choose 3}1^3 + (1 - \theta)^0 = 1 \]

Bayes

From conjugacy we get the updating rule for the posterior:

\[ p(\theta|y, n, a, b) \sim \mathcal{Beta}(a + y, b + n - y) \]

Plugging in our values gives

\[ p(\theta|k = 3, n = 3, a = 1, b = 1) \sim \mathcal{Beta}(4, 1) \]

1) maximum a posteriori (MAP, mode)

\[ \begin{align*} \text{MAP} = \underset{\theta} {\text{argmax}} \, \mathcal{Beta}(4, 1) = \frac{a - 1}{a + b - 2} = 1 \end{align*} \]

2) Median

\[ \begin{align*} \int_{0}^{1/2} \mathcal{Beta}(4, 1) \mathrm{d}\theta = 1 / 2 \end{align*} \]

Because I don’t know the formula, and I am too lazy to look it up, I use monte carlo methods:

median(rbeta(100000, 4, 1))

## [1] 0.8406909

3) Mean

\[ \mathbb{E}[p(\theta|y, n)] = \int_{0}^1 \theta \cdot p(\theta|\textbf{y}) \mathrm{d}\theta = \frac{a}{a + b} = .8 \]

Comparison

Maximum likelihood is very certain about its prediction, while the median and the mean estimates from the posterior incorporate some of the uncertainty. They are more cautious in their predictions. Given a uniform prior, however, the MAP estimate – that is, the mode of the posterior – coincides with the maximum likelihood estimate:

\[ \begin{align*} \theta_{MLE} &= \underset{\theta} {\text{argmax}} \, p(\textbf{y}|\theta) \\ \theta_{MAP} &= \underset{\theta} {\text{argmax}} \, p(\textbf{y}|\theta) \cdot p(\theta) \end{align*} \]

we don’t need the normalizing constant in calculating \(\theta_{MAP}\), because it does not shift the maximum. Additionally, if we have a uniform prior, \(\theta_{MAP} = \theta_{MLE}\) (given we obey Cromwell’s rule).

Additionally, note again that the posterior mean can be written as:

\[ \mathbb{E}[p(\theta|y, n)] = \frac{a + y}{a + y + b + n - y} = \frac{a + y}{a + b + n} \]

with some clever rearranging, we can write this as:

\[ \mathbb{E}[p(\theta|y, n)]= \frac{n}{a + b + n} (\frac{y}{n}) + \frac{a + b}{a + b + n}(\frac{a}{a + b}) \]

We see that the posterior mean is a weighted combination of the prior mean and the maximum likelihood estimate. This is termed shrinkage. The Bayesian point estimate is shrunk towards the prior mean, implementing some kind of regularization. We note additionally that

\[ \begin{align*} \lim_{n \to \infty} &= \frac{n}{a + b + n}(\frac{y}{n}) + \frac{a + b}{a + b + n}(\frac{a}{a + b}) \\[1.5ex] &= 1 \cdot \frac{y}{n} + 0 \cdot \frac{a}{a + b} = \frac{y}{n} \end{align*} \]

in the limit, the maximum likelihood estimate and the Bayesian posterior mean agree.

3. General Linear Model (6)

Let’s look at data for sale rates after spending various amounts on advertising on TV, radio or in newspapers. We are interested in the question which outlet for advertising “significantly” influences sale rates. Download the data from here, and import it like so:

dat <- read.csv('../Advertising.csv', header = TRUE)[, -1]

Use the built-in function lm from R to run a linear model that predicts sales, our dependent variable, as a function of money spend advertising on TV, radio and in newspapers, our independent variables. Inspect the output using function summary and explain the most relevant aspects of the returned output table. What do you conclude from that about the effects of different outlets for advertising on sales figures?

m <- lm(Sales ~ TV + Radio + Newspaper, data = dat)
summary(m)

## 
## Call:
## lm(formula = Sales ~ TV + Radio + Newspaper, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## Radio        0.188530   0.008611  21.893   <2e-16 ***
## Newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Our solution to the regression problem (using least squares) is:

\[ \begin{align*} y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \\ y &= 2.94 + .046x_1 + .189x_2 - .001x_3 \end{align*} \]

Because the predictors are not standardized (\(\mu = 0, \sigma^2 = 1\)), the intercept gives the prediction for sales given we spend no money on TV, radio, nor the newspapers. Even if this were the case, we would still sell \(2.94\) units in sales. Given the very low \(p\) values for TV and Radio, we conclude that increasing our advertisements on TV or radio would increase or sales; the coefficients are positive. For each unit increase in TV advertisement, we would get an increase of \(.046\) in sales. For each unit increase in tv advertisement, we would get an increase of \(.189\) in sales. although these two advertisement campaigns might interact, we have not tested this in this model. Newspaper, the dying medium, finally does not seem to show an effect on sales; at least we fail to find one. In general, our simple model here captures a huge amount of the variance in the dependent variable, \(R^2 = .897\)! Our final model (leaving out possible interactions etc.) would be:

m <- lm(Sales ~ TV + Radio, data = dat)

Use the BayesFactor package to run a Bayesian regression; predict Sales using TV. Print the model summary and interpret the resulting Bayes factor.

library('BayesFactor')

bf <- regressionBF(Sales ~ TV + Radio + Newspaper, data = dat, progress = FALSE)
bf <- sort(bf) # sort based on strength
bf

## Bayes factor analysis
## --------------
## [1] Newspaper              : 22.92162     ±0.01%
## [2] Radio + Newspaper      : 1.131977e+15 ±0%
## [3] Radio                  : 9.019635e+15 ±0%
## [4] TV                     : 9.455818e+38 ±0.01%
## [5] TV + Newspaper         : 4.806461e+41 ±0%
## [6] TV + Radio + Newspaper : 5.437431e+92 ±0%
## [7] TV + Radio             : 1.391436e+94 ±0%
## 
## Against denominator:
##   Intercept only 
## ---
## Bayes factor type: BFlinearModel, JZS

Instead of doing variable selection with \(p\) values, we are using the Bayes factor to select models which differ in their variables. For each model, the Bayes factor against the null model (no coefficients, prediction is the mean of the dependent variable) is computed. Via transitivity we are able to compare the specific models against each other. For example:

\[ BF_{76} = \frac{p_7(\textbf{y}|\phi)}{p_6(\textbf{y}|\phi)} = \frac{p_7(\textbf{y}|\phi)}{p_0(\textbf{y}|\phi)} \cdot \frac{p_0(\textbf{y}|\phi)}{p_6(\textbf{y}|\phi)} \]

bf[7] / bf[6]

## Bayes factor analysis
## --------------
## [1] TV + Radio : 25.58996 ±0%
## 
## Against denominator:
##   Sales ~ TV + Radio + Newspaper 
## ---
## Bayes factor type: BFlinearModel, JZS

The simpler model is favoured by a factor of about 26, meaning that we have evidence against newspaper being a relevant predictor. The simple summary from the regressionBF function does not allow us to make statements about the effect size of the individual predictors. To get this, we sample from the posterior distribution:

post <- posterior(bf, index = which.max(bf), iterations = 30000, progress = FALSE)
summary(post[, -4])

## 
## Iterations = 1:30000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 30000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##          Mean       SD  Naive SE Time-series SE
## TV    0.04566 0.001403 8.098e-06      8.099e-06
## Radio 0.18766 0.008085 4.668e-05      4.668e-05
## sig2  2.85666 0.290070 1.675e-03      1.719e-03
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%     50%     75%   97.5%
## TV    0.04292 0.04471 0.04566 0.04661 0.04841
## Radio 0.17178 0.18228 0.18767 0.19311 0.20345
## sig2  2.34319 2.65306 2.83592 3.03858 3.47546

Explain what the main conceptual difference is between the classical and the Bayesian approach. Hint: one is about factors within a model, another about comparison between models.

Bayes factor uses the marginal likelihood to compare models and do variable selection. Estimating the effect is again accomplished using Bayes’ rule, but at the parameter level, not the model level. We require a prior over each parameter, which in the regression problem is – per default – a Cauchy prior with width paramater \(r = \frac{\sqrt{2}}{2}\). Classical regression just results in point estimates; confidence intervals are based on normal approximations of the sampling distribution. These estimates neglect prior information, and are prone to overfitting (see exercise 1).

4. MCMC (6)

Look at this code whose MCMC sample chain output we inspected in class:

require('coda')

fakeData = rnorm(200, mean = 0, sd = 1)

# implements unnormalized Bayes' rule for a normal
# model with priors \mu = Unif(-4, 4) and \sigma = Unif(0, 4)
# who are assumed to be independent from another
f = function(mu, sigma){
  if (sigma <= 0 ){
    return(0)
  }
  priorMu = dunif(mu, min = -4, max = 4)
  priorSigma = dunif(sigma, min = 0, max = 4)
  likelihood =  prod(dnorm(fakeData, mean = mu, sd = sigma))
  return(priorMu * priorSigma * likelihood)
}

MH = function(f, iterations = 50, chains = 2, burnIn = 0){
  
  # initialize an array to store the results for each chain
  # and iteration of parameters \mu and \sigma
  out = array(0, dim = c(chains, iterations - burnIn, 2))
  dimnames(out) = list("chain" = 1:chains, 
                       "iteration" = 1:(iterations-burnIn), 
                       "variable" = c("mu", "sigma"))
  
  # for each chain
  for (c in 1:chains) {
    
    # we start the first iteration with a uniformly random
    # sample over the parameter's support
    mu = runif(1, min = -4, max = 4)
    sigma = runif(1, min = 0, max = 4)
    
    # for each chain, we have a bunch of iterations
    for (i in 1:iterations) {
      
      # in which the proposal distribution is just
      # a function of the current state, plus some random jittering
      muNext = mu + runif(1, min = -0.25, max = 0.25) # NOTE: changed from min = -1.25 to min = -.25
      sigmaNext = sigma + runif(1, min = -0.25, max = 0.25)
      rndm = runif(1, 0, 1)
      
      # if the proposed state is more likely, we definitely go there
      # if this is not the case (lazy evaluation of booleans here), we still
      # go there probabilistically
      if (f(mu, sigma) < f(muNext, sigmaNext) | f(muNext, sigmaNext) >= f(mu, sigma) * rndm) {
        mu = muNext
        sigma = sigmaNext
      }
      
      # finally, if we don't burn the sample away, we add it to the chain
      if (i >= burnIn) {
        out[c, i-burnIn, 1] = mu
        out[c, i-burnIn, 2] = sigma
      }
    }
  }
  
  # for future use, the thing we return is a coda mcmc.list
  return(mcmc.list(mcmc(out[1,,]), mcmc(out[2,,])))
}

out = MH(f, 60000,2,10000)

Explain what this code is doing. What algorithm is implemented here? How is it implemented? What is approximated here?

This implements a Metropolis-Hastings algorithm, where the we approximate the posterior distribution of \(\mu\) and \(\sigma^2\).

Does this code differ substantially from the version of the algorithm that we discussed in class?

Instead of one discrete parameter, we consider two continous ones. The core part seems to be the same. However, sometimes the algorithm seems to get stuck because the proposal density for \(\mu\) is not symmetric. Note that I have changed this here in the code.

Mess with the proposal distribution in such a way that the algorith becomes less efficient. Use appropriate plots and metrics (trace plots, R-hat, effective sample size) to show that, all else equal, your messed up proposal distribution is worse than what’s in the code above.

Changing the proposal distribution such that we ‘never go back’ pretty much kills the algorithm.

MHH = function(f, iterations = 50, chains = 2, burnIn = 0){
  
  out = array(0, dim = c(chains, iterations - burnIn, 2))
  dimnames(out) = list("chain" = 1:chains, 
                       "iteration" = 1:(iterations-burnIn), 
                       "variable" = c("mu", "sigma"))
  
  for (c in 1:chains) {
    
    mu = runif(1, min = -4, max = 4)
    sigma = runif(1, min = 0, max = 4)
    
    for (i in 1:iterations) {
      
      muNext = mu + runif(1, min = 0, max = 0.25)
      sigmaNext = sigma + runif(1, min = 0, max = 0.25)
      rndm = runif(1, 0, 1)
      
      if (f(mu, sigma) < f(muNext, sigmaNext) | f(muNext, sigmaNext) >= f(mu, sigma) * rndm) {
        mu = muNext
        sigma = sigmaNext
      }
      
      if (i >= burnIn) {
        out[c, i-burnIn, 1] = mu
        out[c, i-burnIn, 2] = sigma
      }
    }
  }
  
  return(mcmc.list(mcmc(out[1,,]), mcmc(out[2,,])))
}

out2 = MHH(f, 60000,2,10000)

plot(out)

plot(out2)

effectiveSize(out)

##        mu     sigma 
##  9253.851 11222.004

effectiveSize(out2)

##       mu    sigma 
## 9.459916 4.627615