Bayesian data analysis: Homework set 3

Please read this: Name your file homework3_lastname_firstname. This set is due 12/2 before class. The submission should be in Rmarkdown, and should go to Fabian with subject BDA: Homework 3 (submit both .Rmd and .pdf). We encourage discussing exercises with fellow students, but only individual solutions will be accepted.

1. Infering … yeah, what actually? (6)

Check out the following code.

require('rjags')
require('ggmcmc')
require('reshape2')

set.seed(1638) 
coolData = data.frame(A = rnorm(100, mean = 100, sd = 10),
                      B = rnorm(100, mean = 110, sd = 15))
coolData = melt(coolData, variable.name = "Drug", value.name = "HeartRate")

# specify model
modelString = "
model{
  # priors
  muA ~ dunif(30,200)
  muB ~ dunif(30,200)
  varA ~ dunif(0, 1000)
  varB ~ dunif(0, 1000)
  tauA = 1/varA
  tauB = 1/varB

  # likelihood
  for (i in 1:200){
    HeartRate[i] ~ dnorm(ifelse(Drug[i] == 0, muA, muB), 
                         ifelse(Drug[i] == 0, tauA, tauB))
  }

  muDiff = muB - muA
  varDiff = varB - varA
}
"

# prepare for JAGS
dataList = list(Drug = ifelse(coolData$Drug == "A", 0, 1),
                HeartRate = coolData$HeartRate)

# set up and run model
jagsModel = jags.model(file = textConnection(modelString), data = dataList, n.chains = 3, quiet = TRUE)
codaSamples = coda.samples(jagsModel, variable.names = c("muDiff", "varDiff"), n.iter = 10000)

Describe what is going on in a few words, and try to guess the intention of this analysis. I.e., what is most likely the research question that this analysis is trying to answer?

Before we even start, let us first plot the raw data to get a better grip of what is going on. This example is instructive, in that it points out the shortcomings of bar graphs, see also Weissgerber et al. (2015). Data visualization is extremely important.

ggplot(coolData, aes(x = Drug, y = HeartRate)) +
  geom_bar(stat = "summary", fun.y = mean) +
  stat_summary(fun.y = mean_cl_boot, fun.ymin = min,
               fun.ymax = max, colour = "red",
               geom = "errorbar", width = .2) # rather uninformative plot

ggplot(coolData, aes(x = Drug, y = HeartRate)) +
  geom_violin(aes(fill = Drug)) + geom_jitter() # great plot!

Based on the plots above, we immediately see that the biggest difference between the two groups is in dispersion. Concretely, Drug B seems to exhibit higher variance in heartrate than Drug A. It looks like Drug B also increases the mean heartrate more strongly, but given the high variability this is hard to pin down. The above JAGS code makes these intuitive statements more precise. What we attempt here is the solve a problem of parameter estimation. What variance and mean values are credible, after we see this data? Of particular interest are the differences in both mean and variance. Assuming that both are normally distributed, we specify non-informative priors (taking biologically unplausible values – we’re leaving money on the table!).

Run the analysis and report the result. Include (i) a summary of the sampling outcome, (ii) density plots of the relevant parameters (individually per chain), and (iii) checks on the quality of the sampling (trace plots, R-hat and effective sample size). Interpret your results (i.e., should we think that we have approximated the true posterior? is our sampling efficient?).

summary(codaSamples)

## 
## Iterations = 1001:11000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##            Mean     SD Naive SE Time-series SE
## muDiff    7.607  1.791  0.01034        0.01301
## varDiff 150.136 37.041  0.21386        0.29266
## 
## 2. Quantiles for each variable:
## 
##           2.5%     25%     50%    75%  97.5%
## muDiff   4.064   6.408   7.623   8.81  11.11
## varDiff 85.377 124.056 147.523 173.05 230.66

ss = ggs(codaSamples)
ggs_density(ss)

ggs_autocorrelation(ss)

gelman.diag(codaSamples)

## Potential scale reduction factors:
## 
##         Point est. Upper C.I.
## muDiff           1          1
## varDiff          1          1
## 
## Multivariate psrf
## 
## 1

effectiveSize(codaSamples)

##   muDiff  varDiff 
## 18996.43 16044.75

Given your answer to the question above about the likely research question behind this analysis, what would be your conclusion given this analysis?

To get a better grip of the analysis problem, I have computed 95% credible intervals below, and included a standardized difference measure. Based on them, we can credibly believe that the mean effect and the variance of Drug B is higher than that of Drug A. In effect, the drug seems to increase the mean heartrate – but it might have quite different effects on the individual participants, thus creating high variability. However, note that we are being unprincipled here: estimating parameters is not testing hypotheses. If we were to specifically ask: Does Drug B increase heartrate compared to Drug A? Does Drug B further increase dispersion around the mean compared to Drug A? If we were interested in that, we would need to specify a prior distribution over this difference, and then, for each model, marginalize out its parameters. The ratio of this marginal likelihood yields the Bayes factor, and provides us with a principled mechanism of answering the hypothesis testing question.

combined = do.call('rbind', codaSamples)
combined = cbind(combined, combined[, 1] / sqrt(combined[, 2])) # standardize difference
coda::HPDinterval(coda::mcmc(combined)) # combine all samples and return 95% HDI

##              lower       upper
## muDiff   4.1582310  11.1878245
## varDiff 80.2233850 223.5609278
##          0.3079799   0.9749059
## attr(,"Probability")
## [1] 0.95

2. Simple linear regression (8)

We will work with data about the speed of a car and the distance needed to stop.

# install.packages('car') # if you don't have this package yet
require('car')
head(cars) # this is the data set we work with

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

It is intuitive that there should be some kind of functional dependency between the speed that a car travels at and the distance it takes for it to come to a halt when breaking. Indeed, we can verify this by plotting ‘distance’ against ‘speed’ and then adding a simple linear regression line, obtained here with the R function lm.

We could get this regression line directly from ggplot, but then we might not realize that this is actually a regression analysis. We also want to know the values of slope and intercept of the regression line later on.

coeff = coef(lm(formula = dist ~ speed, data = cars)) # get slope & intercept of linear model
qplot(cars[,'speed'], cars[,'dist']) + xlab('speed') + ylab('distance') + theme_bw() + 
  geom_abline(intercept = coeff[1], slope = coeff[2], color = "skyblue")

We are going to write a simple Bayesian linear estimator for ‘distance’ as a function of ‘speed’. Concretely, we want to implement the following Bayesian model, where \(x_{i}\) is the \(i\)-th entry in the data frame for variable speed and \(y_{i}\) is the \(i\)-th entry for variable distance:

\[\beta_0 \sim \text{Norm}(0, 1000)\] \[\beta_1 \sim \text{Norm}(0, 1000)\] \[\sigma^2_{\epsilon} \sim \text{Unif}(0, 1000)\]

\[\mu_i = \beta_0 + \beta_1 x_i\] \[y_i \sim \text{Norm}(\mu_i, \sigma^2_{\epsilon})\]

To explain, \(\beta_1\) is the slope, \(\beta_0\) is the intercept. We assume that these give us a mean \(\mu_i\) for each \(x_i\). The value \(y_i\) is then assumed to be normally distributed around this \(\mu_i\).

Implement this model in JAGS. Rember that JAGS wants precision not variance as the second argument to a normal distribution!

# specify model
modelString2 = "
model{
  # priors
  var_e ~ dunif(0, 1000) 
  tau_e = 1/var_e

  b0 ~ dnorm(0, 1/1000)
  b1 ~ dnorm(0, 1/1000)

  for (i in 1:50){
    mu[i] = b0 + b1 * speed[i]
    dist[i] ~ dnorm(mu[i], tau_e)
  }

}
"
# prepare for JAGS
dataList = as.list(cars)

# set up and run model
params <- c('b0', 'b1', 'var_e')
jagsModel = jags.model(file = textConnection(modelString2), data = dataList, n.chains = 3, quiet = TRUE)
update(jagsModel, 5000) # 5000 burn-in samples
codaSamples = coda.samples(jagsModel, variable.names = params, n.iter = 20000)

ms = ggs(codaSamples)

Run the model with sufficient burn-in and samples. Report the usual diagnostics on the samples (summary, traceplots, R-hat, efficient sample size) to make sure that your analysis is sound.

ggs_autocorrelation(ms)

ggs_traceplot(ms)

gelman.diag(codaSamples)

## Potential scale reduction factors:
## 
##       Point est. Upper C.I.
## b0             1          1
## b1             1          1
## var_e          1          1
## 
## Multivariate psrf
## 
## 1

effectiveSize(codaSamples)

##        b0        b1     var_e 
##  3541.576  3534.060 23705.202

All convergence checks give satisfying indication that our chains have converged to the stationary distribution (the posterior distribution). However, the autocorrelation for our predictors is quite high, and the effective sample size thus quite low.

Plot the density of the posteriors of \(\beta_0\) and \(\beta_1\) and get the 95% HDIs for both variables. Compare your results with the coefficients returned by lm. (If you get completely different results, there’s something wrong.)

ggs_density(ms)

summary(codaSamples)

## 
## Iterations = 6001:26000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 20000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean     SD Naive SE Time-series SE
## b0    -16.67  6.793  0.02773       0.114219
## b1      3.88  0.419  0.00171       0.007057
## var_e 257.67 55.902  0.22822       0.364124
## 
## 2. Quantiles for each variable:
## 
##          2.5%     25%     50%    75%   97.5%
## b0    -29.943 -21.198 -16.690 -12.16  -3.260
## b1      3.054   3.599   3.881   4.16   4.695
## var_e 170.296 218.008 250.069 289.24 386.707

library('broom')

m = lm(dist ~ speed, data = cars)
tidy(m)

##          term   estimate std.error statistic      p.value
## 1 (Intercept) -17.579095 6.7584402 -2.601058 1.231882e-02
## 2       speed   3.932409 0.4155128  9.463990 1.489836e-12

It seems that the Bayesian analysis and the frequentist analysis give similar results. Their differ, however, because our prior specification is not quite so uninformative. If we were to increase the variance of the predictors by a factor of 10, the mean of the posterior and the classical estimate would agree. However, if we have prior knowledge, we would be silly not to include it. Including it would increase prediction accuracy.

Which information do you get from the Bayesian analysis that you do not get from calling lm?

We get a whole posterior distribution quantifying our uncertainty with respect to the parameter, instead of a point estimate and a confidence interval. We can answer several inference questions that are unanswerable within the frequentist paradigm (see here).

3. Discrete therapeutic touchers (8)

We introduced a latent-mixture model for the therapeutic touch data in class (following Kruschke). Here, we would like to explore a related model, which assumes that there are three types of touchers: (i) suckers, (ii) guessers, (iii) sensors. Suckers have a low chance of success, guessers operate at around chance level, and sensors have an above average chance. We would like to use the data to infer the latent touch-type of each subject.

More concretely, for each subject i we are going to infer a latent type type[i]. The prior for type type[i] is an unbiased categorical distribution (each type is equally likely): type[i] ~ dcat(c(1/3, 1/3, 1/3)). Touch-types come from three disinct populations, each with its own mode, for \(j \in \{1,2,3\}\) the type of toucher, omega[j] of a beta distribution for the success chance theta. Suckers have omega[1] = 0.25, guessers have omega[2] = 0.5 and sensors have omega[3] = 0.75. The type of a subject thus determines the mode of a beta distribution: if i is of type type[i] then the mode of the beta distribution from which i’s theta[i] is drawn is omega[type[i]]. We assume that there is a single concentration kappa that is shared among all three type-populations. Given mode (for each subject) and concentration parameter, we determine a theta[i] (for each subject) as in the previous touch-model from class. The theta[i] feeds into a binomial likelihood function as before.

Load the from file TherapeuticTouchData.csv, which is part of DBDA2Eprograms.zip, which is in our Dropbox in folder Kruschke_2015. Reformat it just like we did in class.

TTdata = read.csv('TherapeuticTouchData.csv') %>% 
         group_by(s) %>%  dplyr::summarize(k = sum(y), N = length(y))

Draw the hierarchical model described above. To save time, feel free to draw by hand (legibly, please!), take a picture, and include it in your file. (drawn from a student solution I quite liked)

graph

Implement the hierarchical model in JAGS. Run it (with sufficient burn-in to ensure convergence), and use the code below to plot your outcome:

modelString = "
model{
  omega[1] = 0.25 # suckers
  omega[2] = 0.5  # guessers
  omega[3] = 0.75 # sensors
  kappaMin2 ~ dgamma(0.01,0.01)
  kappa = kappaMin2+2

  for (i in 1:28){
    type[i] ~ dcat(c(1/3,1/3,1/3))

    a[i] = omega[type[i]]*kappaMin2+1
    b[i] = (1-omega[type[i]])*kappaMin2+1

    theta[i] ~ dbeta(a[i],b[i])
    k[i] ~ dbinom(theta[i], N[i])
  }

}"
# prepare for JAGS
dataList = as.list(TTdata)
# set up and run model
jagsModel = jags.model(file = textConnection(modelString), 
                       data = dataList, n.chains = 2, quiet = TRUE)
update(jagsModel, 25000)
codaSamples = coda.samples(jagsModel, n.iter = 50000,
                           variable.names = c("theta","type", "kappa"))

ggplot(TTdata, aes(x = s, y = k/N)) + geom_histogram(stat='identity') # raw data

ms = ggs(codaSamples)
msType = ms %>% filter(substr(x = Parameter, start = 1, stop = 4) == "type") %>%
    mutate(subject = factor(regmatches(Parameter, regexpr("[0-9]+", Parameter)), 
                              levels = 1:28))
ggplot(msType, aes(x = subject, fill = factor(value))) + geom_bar()

What would you conclude from this analysis about whether therapeutic touch works? Do you think that this is a good analysis (why/why not)? (You could think about stuff like: shrinkage, over-fitting, model complexity, …)

I think this is pretty cool. The posterior probabilities of belonging to the individual groups (suckers, guessers, sensors) does make sense, given the plot of the raw data. Specifically, the higher the amount of correct answers the therapeutic toucher makes, the higher the probability is that he belongs to the “sensors”.

I wonder if the individual therapeutic touchers would object to hierarchical modeling. It is true that, when estimating multiple means (at least more than 3, to be precise), that hierarchical modeling is more efficient than simply taking the mean (this is called Stein’s paradox, e.g. Efron & Morris, 1977); however, similar to research in psychophysics, we are primary interested in the individual. More self-servingly, the “sensors” might object to be modeled jointly with the “suckers”, shrinking their means towards the grand mean.

Further, we have to acknowledge that this is a problem of hypothesis testing. Does theurapeutic touching exist? Clearly, by positing three different latent groups and estimating the touchers posterior probability of belonging to them, we are not answering this question. A skeptic wouldn’t be convinced – we have presupposed the existance of therapeutic touching! A skeptic might use an uninformative prior for the alternative hypothesis, and then compute Bayes factor:

\[ \begin{align*} p(y|H_1) &= \int {N \choose k} \theta^k (1 - \theta)^{n-k} \cdot \frac{1}{Beta(1, 1)} \theta^{1 - 1} (1 - \theta)^{1 - 1} d\mathrm{\theta} \\[1.5ex] &= {N \choose k} \cdot \frac{Beta(k + 1, N - k + 1)}{Beta(1, 1)} \end{align*} \]

binom_BF <- function(k, N) {
  H0 <- dbinom(k, N, .5) # likelihood of null hypothesis
  H1 <- choose(N, k) * beta(k + 1, N - k + 1)
  
  H1 / H0
}

TTdata = TTdata %>% mutate(BF10 = binom_BF(k, N))
ggplot(TTdata, aes(x = k/N, y = BF10)) + geom_histogram(stat = 'identity')

We see that even the “strongest sensor” has only about a Bayes factor of 2.5 in his favour. Interestingly, the “weakest sucker” has a higher Bayes factor! This is because we did an undirected test. If we were to test the directed hypothesis that therapeutic touchers operate above chance, we need to test exactly that! Without going to far (see e.g. here), the Bayes factor in favour of the directed hypothesis is just the proportion of posterior samples greater than \(\theta = .5\).

Noting that the posterior for our weakest sucker is

plot(function(x) dbeta(x, 2, 10), col = 'blue', lwd = 2, main = 'posterior weakest sucker')

We calculate the Bayes factor in favour of the hypothesis \(H_1: \theta > .5\)

BF10 = integrate(function(x) dbeta(x, 2, 10), .5, Inf)

1 / BF10$value # in favour of the null

## [1] 170.6667

Also note that the Bayes factor is just the evidence that is in the data. To make a conclusion, I would need to specify my prior belief about therapeutic touching; how certain am I that it exists? Concretely,

\[ \frac{p(H_1|D)}{p(H_0|D)} = \frac{P(D|H_1)}{P(D|H_0)} \frac{P(H_1)}{P(H_0)} \]

which for me personally yields \(\frac{1}{1700000} = \frac{1}{170} \frac{1}{10000}\) odds in favour of the alternative hypothesis, i.e. that therapeutic touching exists (is not merely placebo etc.).

Wrapping up, I very much like the flexibility of JAGS, allowing us to easily estimate these kinds of latent classes models. However, in this specific case, I think they answer the wrong question, and Bayes factors would be more appropriate. The question also is, of course, what it even means to perform worse than chance.