Part 6 - MCMC for Bayesian Inference

Introduction

In this part we will introduce how MCMC can be used from a Bayesian perspective when doing inference.

Discrete-Time Markov Chains with Continuous State Space

Definition 1 (Discrete-Time Markov Chains with Continuous State Space)

Our so far basic definitions are the same. {Xi}i=0,1,\{X_i\}_{i = 0, 1, \ldots} is a set of continuous random satisfying the Markov property,

π(Xn+1X0,X1,,Xn)=π(Xn+1Xn)\pi(X_{n + 1} \mid X_0, X_1, \ldots, X_n) = \pi(X_{n + 1} \mid X_n)

for all nn.

The limiting distribution has the same definition. A stationary distribution is a density f(x)f(x) satisfying,

f(xn+1)=π(xn+1xn)f(xn) dxn.f(x_{n + 1}) = \int \pi(x_{n + 1} \mid x_n) f(x_n) \ dx_n.

Ergodicity still means (with adjusted definitions) that the Markov chain is irreducible, aperiodic, and positive recurrent.

Ergodic Markov chains have a unique limiting distribution; Theorem same as before.

The Strong Law of Large Numbers for Markov Chains also holds in this setting. Lastly, the theory for Metropolis-Hastings is also OK (although, some technical complications in the proof arise).

Bayesian Inference

Intuition (MCMC for Bayesian Inference)

Consider that we have some data y1,,yny_1, \ldots, y_n, and we want to make a probability prediction for ynewy_{\text{new}}.

We (often) define a parameter θ\theta, and a probabilistic model so that y1,,yn,ynewy_1, \ldots, y_n, y_{\text{new}} are all conditionally independent given θ\theta,

π(y1,,yn,ynew,θ)=[i=1nπ(yiθ)]π(ynewθπ(θ).\pi(y_1, \ldots, y_n, y_{\text{new}}, \theta) = \left[\prod_{i=1}^{n} \pi(y_i \mid \theta) \right] \pi(y_{\text{new} \mid \theta} \pi(\theta).

Then,

π(ynewy1,,yn)=θπ(ynewθ)π(θy1,,yn) dθ=Eθy1,,yn[π(ynewθ)]\begin{align*} \pi(y_{\text{new}} \mid y_1, \ldots, y_n) & = \int_{\theta} \pi(y_{\text{new}} \mid \theta) \pi(\theta \mid y_1, \ldots, y_n) \ d\theta \newline & = \mathbb{E}_{\theta \mid y_1, \ldots, y_n}[\pi(y_{\text{new}} \mid \theta)] \newline \end{align*}

Using strong law of large numbers, we can make predictions by,

  • Simulating θ1,,θk\theta_1, \ldots, \theta_k from the posterior π(θy1,,yn)\pi(\theta \mid y_1, \ldots, y_n).
  • Averaging,
Eθy1,,yn[π(ynewθ)]1ki=1kπ(ynewθi)\mathbb{E}_{\theta \mid y_1, \ldots, y_n}[\pi(y_{\text{new}} \mid \theta)] \approx \frac{1}{k} \sum_{i = 1}^{k} \pi(y_{\text{new}} \mid \theta_i)

However, simulating from the posterior is often difficult. MCMC can help us here by simulating from a Markov chain having the posterior as its limiting distribution.

By the strong law of large numbers for Markov chains, we still have (as kk \to \infty),

Eθy1,,yn[π(ynewθ)]=limk1ki=1kπ(ynewθi)\mathbb{E}_{\theta \mid y_1, \ldots, y_n}[\pi(y_{\text{new}} \mid \theta)] = \lim_{k \to \infty} \frac{1}{k} \sum_{i = 1}^{k} \pi(y_{\text{new}} \mid \theta_i)

when θ1,,θk\theta_1, \ldots, \theta_k is a realization from a Markov chain with the posterior π(θy1,,yn)\pi(\theta \mid y_1, \ldots, y_n) as its limiting distribution.

As π(θy1,,yn)i=1nπ(yiθ)π(θ)\pi(\theta \mid y_1, \ldots, y_n) \propto \prod_{i = 1}^{n} \pi(y_i \mid \theta) \pi(\theta), we can use Metropolis-Hastings with some proposal distribution q(θθ)q(\theta^{\star} \mid \theta) and acceptance probability,

a=min(1,π(θy1,,yn)q(θθ)π(θy1,,yn)q(θθ))a = \min\left(1, \frac{\pi(\theta^{\star} \mid y_1, \ldots, y_n) q(\theta \mid \theta^{\star})}{\pi(\theta \mid y_1, \ldots, y_n) q(\theta^{\star} \mid \theta)}\right)

To avoid underflow we generate UUniform(0,1)U \sim \mathrm{Uniform}(0, 1) and accept if,

U>exp(i=1n(logπ(yiθ)logπ(yiθ))+logπ(θ)logπ(θ)+logq(θθ)logq(θθ))U > \exp\left(\sum_{i = 1}^{n} \left(\log \pi(y_i \mid \theta^{\star}) - \log \pi(y_i \mid \theta) \right) + \log \pi(\theta^{\star}) - \log \pi(\theta) + \log q(\theta \mid \theta^{\star}) - \log q(\theta^{\star} \mid \theta)\right)
Example 1 (Toy Example)

Consider the following,

ypBinomial(n=17,p)pBeta(α=2.3,β=4.1)ynewpBinomial(n=3,p)\begin{align*} y \mid p & \sim \mathrm{Binomial}(n = 17, p) \newline p & \sim \mathrm{Beta}(\alpha = 2.3, \beta = 4.1) \newline y_{\text{new}} \mid p & \sim \mathrm{Binomial}(n = 3, p) \newline \end{align*}

We would like to compute P(ynew=1y=4)P(y_{\text{new}} = 1 \mid y = 4).

We first note that,

pyBeta(α=2.3+y,β=4.1+ny)p \mid y \sim \mathrm{Beta}(\alpha = 2.3 + y, \beta = 4.1 + n - y)

Thus, the predictive distribution,

π(yp)=π(yp)π(p)π(py)=Binomial(yn=17,p)Beta(pα=2.3,β=4.1)Beta(pα=2.3+y,β=4.1+ny)=(ny)py(1p)nypα1(1p)β1B(α,β)pα+y1(1p)β+ny1B(α+y,β+ny)=(ny)B(α+y,β+ny)B(α,β)\begin{align*} \pi(y \mid p) & = \frac{\pi(y \mid p) \pi(p)}{\pi(p \mid y)} \newline & = \frac{\mathrm{Binomial}(y \mid n = 17, p) \mathrm{Beta}(p \mid \alpha = 2.3, \beta = 4.1)}{\mathrm{Beta}(p \mid \alpha = 2.3 + y, \beta = 4.1 + n - y)} \newline & = \frac{\binom{n}{y} p^y (1 - p)^{n - y} \cdot \frac{p^{\alpha - 1} (1 - p)^{\beta - 1}}{B(\alpha, \beta)}}{\frac{p^{\alpha + y - 1} (1 - p)^{\beta + n - y - 1}}{B(\alpha + y, \beta + n - y)}} \newline & = \binom{n}{y} \frac{B(\alpha + y, \beta + n - y)}{B(\alpha, \beta)} \newline \end{align*}

Since y=4y = 4 and n=17n = 17, the posterior is Beta(6.3,17.1)\mathrm{Beta}(6.3, 17.1), predicting 1 success among 3 trials gives,

(31)B(6.3+1,17.1+2)B(6.3,17.1)=0.3933\binom{3}{1} \frac{B(6.3 + 1, 17.1 + 2)}{B(6.3, 17.1)} = 0.3933

Alternatively, we can use MCMC to estimate this value.