Methods for estimating Bayes factors

Friel & Wyse (2012) report on a number of methods for estimating Bayes factors.

Winter is coming, but what’s it gonna bring?

A few enthusiasts have used Bayesian inference for guessing about what is going to happen in Georg R.R. Martin’s A Song of Ice and Fire.

  1. Allen Downey reports a student project for his class that look at a Bayesian survival analysis: which characters are most likely to survive, given their house, gender …?
  2. Richard Vale has concocted a Poisson model, aiming to guesstimate the likely number of chapters in books 6 and 7 told from the point of view of which major character.

You could do something like this:

Salience priors in reference games

Frank & Goodman’s (2012) short and cool paper introduces the rational speech-act model, which we also touched upon in class. The paper feeds data from a so-called salience prior estimation condition into a probabilistic model of language use. That means that the data from that condition is treated as unexplained input. Your project could try to shed light on how choices in this condition came about by formulating and testing a data-generating model for salience data.

Comparing model selection criteria

We have discussed some model selection criteria in class, among them ‘Akaike’s Information Criterion’ (AIC; Akaike, 1974), the ‘Bayesian Information Criterion’ (BIC; Schwarz, 1978), the ‘Deviance Information Criterion’ (DIC; Spiegelhalter et al. 2014), and of course the Bayes factor. Recently, a ‘Widely Applicable (Bayesian) Information Criterion’ has been proposed (WAIC; Watanabe, 2013).

Gelman et al. (2013) provide a great review of those methods. Discuss them with respect to their philosophy, assumptions, computation, and specific problems. You might also take a look at Aho et al. 2014 and Vandekerckhove et al. (2015),

Regularization from a Bayesian standpoint

We can scrutinize models and estimators along two dimensions: bias and variance. While least squares is unbiased (when the assumptions are met, of course), it exhibits high variance. To increase prediction accuracy, we can introduce bias to decrease variance. To key is to find a good tradeoff between the two. Ridge regression and the lasso are two popular ways of doing so.

\[ \begin{align} \hat \beta &= \underset{\beta} {\text{argmin}} \, \left( \sum_{i=1}^n (y_i - \beta_0 - \beta^T x_i)^2 \right) \, \, \, \ldots \, \text{least squares }\\[1.5ex] \hat \beta &= \underset{\beta} {\text{argmin}} \, \left( \sum_{i=1}^n (y_i - \beta_0 - \beta^T x_i)^2 + \lambda \sum_{j=1}^p |\beta_j| \right) \, \, \, \ldots \, \text{lasso} \\[1.5ex] \hat \beta &= \underset{\beta} {\text{argmin}} \, \left( \sum_{i=1}^n (y_i - \beta_0 - \beta^T x_i)^2 + \lambda \sum_{j=1}^p \beta_j^2 \right) \, \, \, \ldots \, \text{ridge} \\[1.5ex] \end{align} \]

Compare these regularization methods – what do they do? – and discuss them from a Bayesian standpoint. What is the advantage of the Bayesian lasso? The lasso was proposed by Tibshirani (1996). Park & Casella (2008) discuss the ‘Bayesian lasso’ (is there such a thing?). An application example in bioinformatics is given in Li et al. (2010).

Hierarchical Signal Detection Theory

Signal detection theory is widely used in psychophysics and cognitive science. However, the same caveats apply as with experimental data: trials are nested within participants, violating the i.i.d. assumption. Additionally, say in memory research, stimuli are drawn from a broader universe of possible stimuli. Both facts call for hierarchical modeling. If these unmodelled sources of variance are present, because the signal detection model is non-linear, classical estimation is asymptotically biased.

Discuss these issues and implement Bayesian hierarchical signal detection models. Show the advantages by means of a simulation study. Additionally, you can look at a real data set, taken for example from the reproducibility project, and apply your fancy model. Finally, discuss the issues with the Bayesian model.

The extension of signal detection models was proposed by Rouder & Lu (2005) and Rouder et al. (2007). There is also a chapter about signal detection theory in Lee & Wagenmakers (2013), our cognitive modeling book.

Multinomial processing trees (MPTs)

MPTs are simple, yet powerful tools. The multinomial distribution is an extension of the binomial distribution to \(p > 2\) parameters.

\[ p(y_1, \ldots, y_p|N, \theta_1, \ldots, \theta_p) = \frac{N!}{y_1! \ldots y_p!} \theta_1^{y_1} \ldots \theta_p^{y_p} \]

A picture might help (taken from Lee & Wagenmakers). Below you see a MPT model for pair-clustering effects in recall from memory.

Each parameter has a psychological interpretation; \(c\) is the cluster-storage, \(r\) is the cluster-retrieval, and \(u\) is the unique storage-retrieval. Specific combinations lead to certain outcomes; the processes are modeled as being independent, and subsequently estimated.

You can take two routes from here on:

‘Informative hypotheses’ and directed testing

In one of the homeworks, we were interested in whether therapeutic touching exists; that is, we should have tested \(H_0: \theta = .5\) against \(H_1: \theta > .5\). You can generalize this to more complex designs, say in ANOVA, where you want to test a specific contrast. This is highly awkward and complicated from a classical standpoint, but easy from a Bayesian standpoint (for a good overview, see Kluglist, 2005; Hoijtink, 2011). In fact, the Bayes factor is trivial to compute, using only the prior and posterior distributions (no computation of the marginal likelihood required!).

The proof is in Kluglist et al. (2007). Wetzels et al. (2011) draw the connection to Savage-Dickey.

Discuss the advantages of (Bayesian) ‘informative hypotheses’, both from a theoretical standpoint (it makes for stronger inference), and from a computational standpoint (why does classical testing fail here?). A nice introduction is given by Richard Morey on his blog.

Be critical

This project idea is tailored towards ‘substantive’ researchers. Research your research topic, and scrutinize the data analysis methods employed in this field. Did they use the correct model, e.g. Jaeger (2008), or should they have used a tailor-made cognitive model, making for stronger inference (Franke, 2016)? Do they test what they say they test, e.g. Nieuwenhuis et al. (2011) and here. Did they overstate the evidence against the null, by employing \(p\) value testing (Berger & Delampardy, 1987)? Is their favoured hypothesis the null, and did they try to support it with \(p\) values? Etc.

Be a data scientist

Get a mac. Write a well-documented, tested, nice to read R, Python, or Julia that solves a task that is related to Bayesian inference. If you want to use any other programming language, talk to us (but neither Java nor Matlab are acceptable). Before you start, pitch your package to us.

Be a shiny data scientist

Similar to the project before, but write a shiny app.