Bayesian data analysis: Homework set 1

This set is due 11/4 before class. Any form of readable and intelligible presentation will be accepted. Electronic submissions should go to Fabian with subject BDA: Homework 1. We encourage discussing exercises with fellow students, but only individual solutions will be accepted.

1. Basic notions of probability (2)

Consider the following two-dimensional matrix:

##       blond brown  red black
## blue   0.22  0.21 0.00  0.01
## green  0.00  0.14 0.06  0.01
## brown  0.16  0.15 0.00  0.04

Calculate the marginal probabilities.

##       blond brown  red black     
## blue   0.22  0.21 0.00  0.01 0.44
## green  0.00  0.14 0.06  0.01 0.21
## brown  0.16  0.15 0.00  0.04 0.35
##        0.38  0.50 0.06  0.06 1.00

Calculate the conditional probability of eye color given blond hair.

\[ p(\text{blue} \mid \text{blond}) = \frac{p(\text{blond}, \text{blue})}{p(\text{blond})} = \frac{0.22}{0.38} \\[3ex] p(\text{green} \mid \text{blond}) = \frac{p(\text{blond}, \text{green})}{p(\text{blond})} = \frac{0}{0.38} \\[3ex] p(\text{brown} \mid \text{blond}) = \frac{p(\text{blond}, \text{brown})}{p(\text{blond})} = \frac{0.16}{0.38} \]

2. Intuitions about (conditional) probability (5)

Imagine that there are three cards: one is red on either side, the other is white on either side, and the third is red on one side and white on the other. Suppose a confederate draws a card from this set of three totally at random, and shows a totally random side of that card to you.

##      rr    rw    ww
## r 0.333 0.167 0.000
## w 0.000 0.167 0.333

What is the probability that what you see is red?

\[ p(r) = p(r \mid rr)p(rr) + p(r \mid rw)p(rw) + p(r \mid ww)p(ww) \\[3ex] p(r) = 1 \cdot \frac{1}{3} + \frac{1}{2} \cdot \frac{1}{3} + 0 \cdot \frac{1}{3} = \frac{1}{2} \]

What is the probability that, given that you see red, the other side of the card that the confederate holds is white?

\[ p(rw \mid r) = \frac{p(rw, r)}{p(r)} = \frac{1/6}{1/2} = \frac{1}{3} \]

Write down a two-dimensional probability matrix to support your answers. See above
Now imagine that the confederate draws a card at random but then presents “red” with probability \(\frac{1}{4}\) if there is a red side. What is the conditional probability that the other side of the card is white if you are shown red? (Best use a new two-dimensional probability matrix in support of your answer.)

x[1, 2] <- 1/4 # update p(r, rw) = 1/4
x[2, 2] <- 3/4 # update p(w, rw) = 3/4
x[, 2] <- x[, 2] * 1/3 # renormalize
x

##      rr         rw    ww
## r 0.333 0.08333333 0.000
## w 0.000 0.25000000 0.333

\[ p(rw \mid r) = \frac{p(rw, r)}{p(r)} = \frac{1/12}{5/12} = \frac{1}{5} \]

3. Compare likelihood functions for coin flips (2)

We saw two different likelihood functions for coin flips in the second lecture. The first one is the binomial distribution:

\[P_{\text{binom}}(\langle n_h, n_t \rangle \, | \, \theta) = {n \choose n_h} \theta^{n_h} \, (1-\theta)^{n_t}\]

The second one was Kruschke’s generalization of the Bernoulli distribution:

\[P_{\text{Bern}}(\langle n_h, n_t \rangle \, | \, \theta) = \theta^{n_h} \, (1-\theta)^{n_t}\]

Prove that no matter what the priors \(P(\theta)\) are and no matter what \(n_h\) and \(n_t\) we observe, the posterior \(P_{\text{binom}}(\theta \, | \, \langle n_h, n_t \rangle)\) derived from the first likelihood function will be identical to the posterior \(P_{\text{Bern}}(\theta \, | \, \langle n_h, n_t \rangle)\) derived from the second.

Please spell out and comment/explain each relevant derivation step. Pay good attention to spelling out and manipulating the normalizing constants.

\[ \begin{align*} P_\text{Bern}(\theta \mid \langle n_h, n_t \rangle) &= \frac{P_\text{Bern}(\langle n_h, n_t \rangle \mid \theta)p(\theta)}{p(\langle n_h, n_t \rangle)} = \frac{\theta^{n_h}(1-\theta)^{n_t}p(\theta)}{\int_0^1\theta^{n_h}(1-\theta)^{n_t}p(\theta)\mathrm{d}\theta} \\[3ex] &= \frac{{n \choose n_h} \theta^{n_h}(1-\theta)^{n_t}p(\theta)}{ {n \choose n_h } \int_0^1\theta^{n_h}(1-\theta)^{n_t}p(\theta)\mathrm{d}\theta} = \frac{{n \choose n_h} \theta^{n_h}(1-\theta)^{n_t}p(\theta)}{\int_0^1 {n \choose n_h} \theta^{n_h}(1-\theta)^{n_t}p(\theta)\mathrm{d}\theta} = P_\text{binom}(\theta \mid \langle n_h, n_t \rangle) \end{align*} \]

4. Wagenmakers’ critique of p-value logic (6)

Read Wagenmakers (2007) and answer the following questions:

List the three problems of p-values Wagenmakers discusses. Explain them in your own words. Which one do you think is most severe?

The sampling distribution of an estimator, say the mean, describes the variability of its estimate, the specific value, under repeated sampling. Frequentists need the notion of repeated sampling due to their definition of probability as a limiting relative frequency. To arrive at the notion of uncertainty, then, a frequentist will imagine repeating the experiment an infinite amount of time, each time applying the estimator (a function) to the resulting data set. The \(p\) value operates on this distribution. Since the probability of any data set is virtually or exactly zero (cf. probability density), Fisher needed a way out. The \(p\) value is a tail integral, summing the probability of all datasets that are at least as extreme as the one actually observed. Resultingly, \(p\) values depend on data not observed.

Additionally, because the data generating process matters for the notion of repeated sampling – how exactly should I imagine my experiment be replicated? – the possible intentions of the researcher become crucial. Did she use a binomial or negative binomial sampling plan? Would she have collected more participants if she would have received that grant? This opens a pandoras box of paradoxes. \(p\) values depend on the intentions of the researcher.

Both criticisms are objectively true of \(p\) values; however, from the perspective of a pragmatic researcher, they might be considered as nitpicking. The third, and most damning shortcoming is that \(p\) values do not quantify statistical evidence. A minimum requirement is that identical \(p\) values convey the same evidence, regardless of sample size. \(p\) values, however, strongly depend on sample size. So is, for example, \(p = .04, n = 100\) more evidence against \(H_0\) then \(p = .04, n = 1000\). In fact, the latter is support for \(H_0\) (cf. Lindley’s paradox)!

Think of three possible cases – in science or real life – where strong relience on p-value significance testing might be harmful.

Power and \(p\) values. If \(p > \alpha\), we cannot say whether the null hypothesis is true, or the data are uninformative. However, people often draw the former conclusion, which can lead to serious problems across a range of domains. For example, assume that I test a specific traffic intervention and compare the number of deaths before and after. Misunderstanding basic statistical concepts (nearly everybody does!), given that \(p > \alpha\) I draw the conclusion that there is no increase in deaths. I go ahead implementing the change, and people die (turns out my study was underpowered, and there really was an effect!).

Sequential testing. \(p\) values violate the likelihood principle, and thus are subject to optional stopping. In other words, \(p\) values require a fixed N design. However, data might be informative much more early, thus potentially wasting the participant’s time. This is even more problematic in medical trials, where a failure to stop early might put people’s life at risk.

Invariances. Most of science is poisoned by a search for significance. Does this intervention work? Are there gender differences in this domain? Do video games make children more violent? Using \(p\) value significance testing, we can only either reject \(H_0\), or fail to reject \(H_0\). If the latter occurs, that is, if \(p > \alpha\), researchers often attribute this to low statistical power, issues with the experimental setup or participants, and in general try to rescue their initial hypothesis (often via verbiage). Because of this asymmetry with \(p\) values, the focus is shifted away from invariances, i.e. statements of equivalence. I submit that those are equivally important, especially in theory testing (cf. strong inference).

What is the definition of a p-value? You can draw from this and this. Note, however, that Gelman misses something crucial in his definition of the p-value (second link)

Strictly speaking, the \(p\) value is the probability of obtaining at least as extreme data as the one observed, given that the null hypothesis is true and the data was generating according to a specific sampling plan. It is the universally most popular method of drawing inferences from single studies. Conventionally, \(p < \alpha\), with \(\alpha = .05\) or \(\alpha = .001\), is considered statistically significant, and merits rejection the null hypothesis. The reasoning is that, either something very rare has occured, or the null hypothesis is false.