Part 5 - Branching Processes and MCMC

Introduction

In this part, we will introduce branching processes, their properties, some useful theorems, and applications. Then, we will introduce Markov Chain Monte Carlo (MCMC) and some algorithms related to it.

Branching Processes

Intuition (Branching Processes)

Many real-world phenomena can be described as developing with a tree-like structure, for example,

  • Growth of cells.

  • Spread of viruses or other pathogens in a population.

  • Nuclear chain reactions.

  • Spread of funny cat videos on the internet.

  • Spread of a surname over generations. 1This was the original motivation for studying branching processes by Francis Galton and Henry William Watson in the 19th century.

The process with which one node gives rise to “children” can be described as random. We will assume the probabilistic properties of this process is the same for all nodes.

Further, we will assume all nodes are organized into generatiomns, we are only concerned with the size of each generation.

Lastly, the important questions we are interested in are; How large are the generations? How much does the size vary? Will the process die out eventually (i.e., reach a generation with size 0)?

Definitions and Properties

Definition 1 (Galton-Watson Branching Process)

A (Galton-Watson) branching process is a discrete Markov chain Z0,Z1,,Zn,Z_0, Z_1, \ldots, Z_n, \ldots where,

  • The state space is the non-negative integers.

  • Z0=1Z_0 = 1 (i.e., the process starts with one individual).

  • ZnZ_n is the sum X1+X2++XZn1X_1 + X_2 + \ldots + X_{Z_{n - 1}}, where XiX_i are independent random non-negative integers all with the same offspring distribution,

Zn=i=1Zn1XiZ_n = \sum_{i = 1}^{Z_{n - 1}} X_i

Connecting each of the ZnZ_n individuals in generation nn with their offspring in generation n+1n + 1 we get a tree illustrating the branching process.

The offspring distribution is described by the probability vector a=(a0,a1,)a = (a_0, a_1, \ldots) where aj=P(Xi=j)a_j = P(X_i = j).

We will mainly focus on the interesting cases where we assume a0>0a_0 > 0 (i.e., there is a positive probability of having no offspring) and a0+a1<1a_0 + a_1 < 1 (i.e., there is a positive probability of having more than one offspring).

Definition 2 (Expected Size of Generations in a Branching Process and Classification of Branching Processes)
Note

Note that the state 0 is abosribing, this absorbtion is called extinction. Also, as a0>0a_0 > 0, all nonzero states are transient. Let μ=E[Xi]=j=0jaj\mu = \mathbb{E}[X_i] = \sum_{j=0}^{\infty} j \cdot a_j, i.e., the expected number of offspring per individual.

Then, we may compute that,

E[Zn]=E(i=1Zn1)=E(E(i=1Zn1XiZn1))==E[Zn1]μ\begin{align*} \mathbb{E}[Z_n] & = \mathbb{E}\left(\sum_{i = 1}^{Z_{n-1}}\right) \newline & = \mathbb{E}\left(\mathbb{E}\left(\sum_{i = 1}^{Z_{n-1}} X_i \mid Z_{n - 1} \right)\right) & = \ldots & = \mathbb{E}[Z_{n - 1}] \cdot \mu \newline \end{align*}

Thus, we directly get that,

E[Zn]=μnE[Z0]1=μn\mathbb{E}[Z_n] = \mu^n \cdot \underbrace{\mathbb{E}[Z_0]}_{\coloneqq 1} = \mu^n

Based on the value of μ\mu, we classify branching processes into three categories:

  1. Subcritical branching processes: μ<1\mu < 1. The expected generation size decreases over time, limnE[Zn]=0\lim_{n \to \infty} \mathbb{E}[Z_n] = 0.

  2. Critical branching processes: μ=1\mu = 1. The expected generation size remains constant over time, limnE[Zn]=1\lim_{n \to \infty} \mathbb{E}[Z_n] = 1.

  3. Supercritical branching processes: μ>1\mu > 1. The expected generation size increases over time, limnE[Zn]=\lim_{n \to \infty} \mathbb{E}[Z_n] = \infty.

Note

One can prove that if limnZn=0\lim_{n \to \infty} Z_n = 0, then the probability of extinction is 1. We will do this later.

Definition 3 (Variance of Generation Sizes in a Branching Process)

Let μ\mu be defined as above, and let σ2=Var(Xi)\sigma^2 = \mathrm{Var}(X_i) be the variance of the number of children.

Using the law of total variance we get,

Var(Zn)=Var(E[ZnZn1])+E[Var(ZnZn1)]=Var(E(i=1Zn1XiZn1))+E[Var(i=1Zn1XiZn1)]=Var(μZn1)+E[σ2Zn1]=μ2Var(Zn1)+σ2E[Zn1]\begin{align*} \mathrm{Var}(Z_n) & = \mathrm{Var}(\mathbb{E}[Z_n \mid Z_{n - 1}]) + \mathbb{E}[\mathrm{Var}(Z_n \mid Z_{n - 1})] \newline & = \mathrm{Var}\left(\mathbb{E}\left(\sum_{i = 1}^{Z_{n - 1}} X_i \mid Z_{n - 1}\right)\right) + \mathbb{E}\left[\mathrm{Var}\left(\sum_{i = 1}^{Z_{n - 1}} X_i \mid Z_{n - 1}\right)\right] \newline & = \mathrm{Var}(\mu Z_{n - 1}) + \mathbb{E}[\sigma^2 Z_{n - 1}] \newline & = \mu^2 \mathrm{Var}(Z_{n - 1}) + \sigma^2 \mathbb{E}[Z_{n - 1}] \newline \end{align*}

From this, we can prove by induction that, for n1n \geq 1,

Var(Zn)=σ2μn1k=0n1μk={nσ2if μ=1σ2μn1μn1μ1if μ1\mathrm{Var}(Z_n) = \sigma^2 \mu^{n - 1} \sum_{k = 0}^{n - 1} \mu^k = \begin{cases} n \sigma^2 & \text{if } \mu = 1 \newline \sigma^2 \mu^{n - 1} \frac{\mu^n - 1}{\mu - 1} & \text{if } \mu \neq 1 \end{cases}
Definition 4 (Extinction Probability of a Branching Process)

Let XX denote an offspring variable and ee the probability of extinction.

Using first-step analysis,

e=k=0P(X=k)P(extinction of k processes)=k=0P(X=k)ek=EX[eX]=GX(e)\begin{align*} e & = \sum_{k = 0}^{\infty} P(X = k) P(\text{extinction of } k \text{ processes}) \newline & = \sum_{k = 0}^{\infty} P(X = k) e^k = \mathbb{E}_X[e^X] = G_X(e) \end{align*}

where we define the probability generating function (PGF) GX(s)G_X(s) as,

GX(s)=E[sX]G_X(s) = \mathbb{E}[s^X]

Let ene_n be the probability of extinction by generation nn. Using first-step analysis again, we get,

en=k=0P(X=k)P(extinction by generation n1 of k processes)=k=0P(X=k)en1k=EX[en1X]=GX(en1)\begin{align*} e_n & = \sum_{k = 0}^{\infty} P(X = k) P(\text{extinction by generation } n - 1 \text{ of } k \text{ processes}) \newline & = \sum_{k = 0}^{\infty} P(X = k) e_{n - 1}^k = \mathbb{E}_X[e_{n - 1}^X] = G_X(e_{n - 1}) \end{align*}
Theorem 1 (Extinction Probability Theorem)

Let GXG_X be the probability generating function for the offspring distribution for a branching process. The probability of eventual extinction is the smallest positive root of the equation GX(s)=sG_X(s) = s.

Proof

If s=xs = x is any positive solution to GX(s)=sG_X(s) = s, then exe \leq x. We have 0=e0<x0 = e_0 < x. We have that GXG_X is an increasing function, as,

s0<s1    k=0P(X=k)s0k<k=0P(X=k)s1k    GX(s0)<GX(s1)s_0 < s_1 \implies \sum_{k = 0}^{\infty} P(X = k) s_0^k < \sum_{k = 0}^{\infty} P(X = k) s_1^k \implies G_X(s_0) < G_X(s_1)

Thus, applying GXG_X repeatedly we get en<xe_n < x and thus e=limnenxe = \lim_{n \to \infty} e_n \leq x.

Probability Generating Functions

Intuition (Probability Generating Functions)

For any discrete random variable XX taking values in {0,1,2,}\{0, 1, 2, \ldots\}, define the probability function GX(s)G_X(s) as,

GX(s)=E[sX]=k=0P(X=k)skG_X(s) = \mathbb{E}[s^X] = \sum_{k = 0}^{\infty} P(X = k) s^k

The series converges absolutely for s1\Vert s \Vert \leq 1. We assume ss is a real number in [0,1][0, 1].

We get a 1-1 correspondence between probability vectors on {0,1,2,}\{0, 1, 2, \ldots\} and functions represented by a series where the coefficients sum to 1.

Specifically, if GX(s)=GY(s)G_X(s) = G_Y(s) for all ss for random variables XX and YY, then XX and YY have the same distribution.

Note (Properties of Probability Generating Functions)
  • To go from XX to GX(s)G_X(s), we compute the (in)finite sum.

  • To go from GX(s)G_X(s) to XX, we use that we have,

P(X=k)=GX(k)(0)k!P(X = k) = \frac{G_X^{(k)}(0)}{k!}
  • If XX and YY are independent random variables, then,
GX+Y(s)=E[sX+Y]=E[sXsY]=E[sX]E[sY]GX(s)GY(s)G_{X + Y}(s) = \mathbb{E}[s^{X + Y}] = \mathbb{E}[s^X s^Y] = \mathbb{E}[s^X] \mathbb{E}[s^Y] \eqqcolon G_X(s) G_Y(s)
  • E[X]=GX(1)\mathbb{E}[X] = G_X^{\prime}(1)

  • E[X(X1)]=GX(1)\mathbb{E}[X(X - 1)] = G_X^{\prime \prime}(1)

  • As a consequence, Var(X)=GX(1)+GX(1)(GX(1))2\mathrm{Var}(X) = G_X^{\prime \prime}(1) + G_X^{\prime}(1) - (G_X^{\prime}(1))^2

Further, assume we have a branching process Z0,Z1,Z_0, Z_1, \ldots with independent offspring variables XiX_i all with the same distribution as XX. By writing Gn(s)=GZn(s)=E[sZn]G_n(s) = G_{Z_n}(s) = \mathbb{E}[s^{Z_n}] and GX(s)=E[sX]G_X(s) = \mathbb{E}[s^X], we get,

Gn(s)=E[sk=1Zn1Xk]=E[E[sk=1Zn1XkZn1]]=E[E[k=1Zn1sXkZn1]]=E[G(s)Zn1]=Gn1(GX(s))\begin{align*} G_n(s) & = \mathbb{E}[s^{\sum_{k = 1}^{Z_{n - 1}} X_k}] \newline & = \mathbb{E}[\mathbb{E}[s^{\sum_{k = 1}^{Z_{n - 1}} X_k} \mid Z_{n - 1}]] \newline & = \mathbb{E}\left[\mathbb{E}\left[\prod_{k = 1}^{Z_{n - 1}} s^{X_k} \mid Z_{n - 1}\right]\right] \newline & = \mathbb{E}[G(s)^{Z_{n - 1}}] \newline & = G_{n - 1}(G_X(s)) \newline \end{align*}

As G0(s)=E[sZ0]=sG_0(s) = \mathbb{E}[s^{Z_0}] = s, it follows that,

Gn(s)=GX(GX(GX(s)))n timesG_n(s) = \underbrace{G_X(G_X(\ldots G_X(s) \ldots))}_{n \text{ times}}
Theorem 2 (Extinction Probability Theorem (Addition))

Let GXG_X be the probability generating function for the offspring distribution for a branching process. The probability of eventual extinction is the smallest positive root of the equation GX(s)=sG_X(s) = s. Also, if the process is critical (μ=1\mu = 1) then the extinction probability is 1.

Proof (Of the last sentence)

In the critical case,

GX(1)=E[X]=μ=1G_X^{\prime}(1) = \mathbb{E}[X] = \mu = 1

As GX(s)>0G_X^{\prime \prime}(s) > 0 for s(0,1)s \in (0, 1), we get that GX(s)<1G_X^{\prime}(s) < 1 for s(0,1)s \in (0, 1), and dds(G(s)s)<0\frac{d}{ds}(G(s) - s) < 0 for s(0,1)s \in (0, 1). As G(1)1=0G(1) - 1 = 0 for any probability generating function, we get that G(s)=sG(s) = s has its smallest positive root at s=1s = 1.

Markov Chain Monte Carlo (MCMC)

Intuition (Markov Chain Monte Carlo (MCMC))

So far we have used Markov chains in the following way

  1. Define a Markov chain

  2. Find its limiting distribution, if it exists

Now, we want to,

  1. Start with a desired distribution

Construct a Markov chain having this limiting distribution

In many contexts we want to compute the expecation E[r(X)]\mathbb{E}[r(X)] for some random variable XX and some function rr.

One way to approximate this is to generate samples x1,,xNx_1, \ldots, x_N from XX and use,

E[r(X)]=limN1Nk=1Nr(xk)1Nk=1Nr(xk)\mathbb{E}[r(X)] = \lim_{N \to \infty} \frac{1}{N} \sum_{k = 1}^{N} r(x_k) \approx \frac{1}{N} \sum_{k = 1}^{N} r(x_k)

Thus, the idea is simulating from a Markov chain having XX as its limiting distribution generates an approximate sample. Average over it as above.

Intuition (Is an approximate sample good enough?)

We know this works in classical statistics.

Strong law of large numbers of samples: If Y1,Y2,,YmY_1, Y_2, \ldots, Y_m and YY are i.i.d random variables from a distribution with finite mean, and if E[r(Y)]\mathbb{E}[r(Y)] exists, then, with probability 1,

limmr(Y1)+r(Y2)++r(Ym)m=E[r(Y)]\lim_{m \to \infty} \frac{r(Y_1) + r(Y_2) + \ldots + r(Y_m)}{m} = \mathbb{E}[r(Y)]

There is a corresponding result for Markov chains.

Strong law of large numbers for Markov chains: If X0,X1,X_0, X_1, \ldots is an ergodic Markov chain with stationary distribution π\pi, and if EXπ[r(X)]\mathbb{E}_{X \sim \pi}[r(X)] exists, then, with probability 1,

limmr(X0)+r(X1)++r(Xm)m=EXπ[r(X)]\lim_{m \to \infty} \frac{r(X_0) + r(X_1) + \ldots + r(X_m)}{m} = \mathbb{E}_{X \sim \pi}[r(X)]

where XX has distribution π\pi.

Note

When using this theorem in practice, one might improve accuracy by throwing away the first sequence X1,,XsX_1, \ldots, X_s for s<ms < m before computing the average. The first sequence is then called the burn-in.

Examples

Example 1 (Toy Example of MCMC)

Consider the Markov chain X0,X1,X_0, X_1, \ldots with states {0,1,2}\{0, 1, 2\} and with,

P=[0.990.01000.90.10.200.8]P = \begin{bmatrix} 0.99 & 0.01 & 0 \newline 0 & 0.9 & 0.1 \newline 0.2 & 0 & 0.8 \end{bmatrix}

One can find that the limiting distribution is v=(2023,223,123)v = (\frac{20}{23}, \frac{2}{23}, \frac{1}{23}).

Consider the function r(x)=x5r(x) = x^5. If XX is a random variable with the limiting distribution,

E[r(X)]=052023+15223+25123=3423=1.4783\mathbb{E}[r(X)] = 0^5 \cdot \frac{20}{23} + 1^5 \cdot \frac{2}{23} + 2^5 \cdot \frac{1}{23} = \frac{34}{23} = 1.4783

If Y1,,YnY_1, \ldots, Y_n are all i.i.d variables with the limiting distribution, we can check numerically that,

limnr(Y1)+r(Y2)++r(Yn)n=1.4783\lim_{n \to \infty} \frac{r(Y_1) + r(Y_2) + \ldots + r(Y_n)}{n} = 1.4783

We also get for X0,X1,X_0, X_1, \ldots that,

limnr(X1)+r(X2)++r(Xn)n=1.4783\lim_{n \to \infty} \frac{r(X_1) + r(X_2) + \ldots + r(X_n)}{n} = 1.4783

but in this case the limit is approached more slowly.

The Metropolis-Hastings Algorithm, Gibbs Sampling, and The Ising Model

Definition 5 (The Metropolis-Hastings Algorithm)

The Metropolis-Hastings algorithm is a method to construct a Markov chain having a desired limiting distribution. Let θ\theta be a random variable with probability mass function π(θ)\pi( \theta). We also assume given a proposal distribution T(θnewθ)T(\theta_{\text{new}} \mid \theta), which, for every given θ\theta, provides a PMF for a new θnew\theta_{\text{new}}.

Finally, we define, for θ\theta and θnew\theta_{\text{new}}, the acceptance probability,

a=min(1,π(θnew)T(θθnew)π(θ)T(θnewθ))a = \min\left(1, \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}\right)

The Metropolis-Hastings algorithm is: Starting with some initial value θ0\theta_0, generate θ1,θ2,\theta_1, \theta_2, \ldots by, at each step, proposing a new θ\theta based on the old using the proposal function and accepting it with probability aa. If it is not accepted, the old value is used again.

If this defines an ergodic Markov chain, its unique stationary distribution is π(θ)\pi(\theta).

Proof

Let PP be the transition matrix of the Markov chain defined by the Metropolis-Hastings algorithm. We need to show that πP=π\pi P = \pi. It suffices to show that the detailed balance equations are satisfied, i.e., that,

π(θ)Pθθnew=π(θnew)Pθnewθ\pi(\theta) P_{\theta \theta_{\text{new}}} = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta}

for all states θ\theta and θnew\theta_{\text{new}}. We have,

Pθθnew=T(θnewθ)a=T(θnewθ)min(1,π(θnew)T(θθnew)π(θ)T(θnewθ))P_{\theta \theta_{\text{new}}} = T(\theta_{\text{new}} \mid \theta) \cdot a = T(\theta_{\text{new}} \mid \theta) \cdot \min\left(1, \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}\right)

and,

Pθnewθ=T(θθnew)a=T(θθnew)min(1,π(θ)T(θnewθ)π(θnew)T(θθnew))P_{\theta_{\text{new}} \theta} = T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} = T(\theta \mid \theta_{\text{new}}) \cdot \min\left(1, \frac{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}\right)

If π(θ)T(θnewθ)π(θnew)T(θθnew)\pi(\theta) T(\theta_{\text{new}} \mid \theta) \leq \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}), then,

π(θ)Pθθnew=π(θ)T(θnewθ)1=π(θnew)T(θθnew)π(θ)T(θnewθ)π(θnew)T(θθnew)=π(θnew)T(θθnew)a=π(θnew)Pθnewθ\begin{align*} \pi(\theta) P_{\theta \theta_{\text{new}}} & = \pi(\theta) T(\theta_{\text{new}} \mid \theta) \cdot 1 \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot \frac{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})} \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} \newline & = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} \end{align*}

If π(θ)T(θnewθ)>π(θnew)T(θθnew)\pi(\theta) T(\theta_{\text{new}} \mid \theta) > \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}), then,

π(θ)Pθθnew=π(θ)T(θnewθ)π(θnew)T(θθnew)π(θ)T(θnewθ)=π(θnew)T(θθnew)1=π(θnew)T(θθnew)a=π(θnew)Pθnewθ\begin{align*} \pi(\theta) P_{\theta \theta_{\text{new}}} & = \pi(\theta) T(\theta_{\text{new}} \mid \theta) \cdot \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)} \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot 1 \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} \newline & = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} \end{align*}

Thus, in both cases, the detailed balance equations are satisfied, so π\pi is the stationary distribution of the chain.

Definition 6 (Gibbs Sampling)

For any probability model over a vector θ=(θ1,θ2,,θn)\theta = (\theta_1, \theta_2, \ldots, \theta_n), consider a Metropolis-Hastings proposal function changing only one coordinate, with the value of this coordinate simulated from the conditional distribution given the remaining coordinates. The acceptance probability is always 1, so the proposal is always accepted.

Proof

Let θ=(θ1,θ2,,θn)\theta = (\theta_1, \theta_2, \ldots, \theta_n) and θnew=(θ1,,θi1,θi,new,θi+1,,θn)\theta_{\text{new}} = (\theta_1, \ldots, \theta_{i - 1}, \theta_{i,\text{new}}, \theta_{i + 1}, \ldots, \theta_n) differ only in coordinate ii. The proposal distribution is,

T(θnewθ)=P(θi=θi,newθ1,,θi1,θi+1,,θn)T(\theta_{\text{new}} \mid \theta) = P(\theta_i = \theta_{i,\text{new}} \mid \theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)

and,

T(θθnew)=P(θi=θiθ1,,θi1,θi+1,,θn)T(\theta \mid \theta_{\text{new}}) = P(\theta_i = \theta_i \mid \theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)

Using the definition of conditional probability, we have,

T(θnewθ)=P(θ1,,θi1,θi,new,θi+1,,θn)P(θ1,,θi1,θi+1,,θn) =π(θnew)xπ(θ1,,θi1,x,θi+1,,θn),\begin{align*} T(\theta_{\text{new}} \mid \theta) & = \frac{P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i,\text{new}}, \theta_{i + 1}, \ldots, \theta_n)} {P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)} \ & = \frac{\pi(\theta_{\text{new}})} {\sum_x \pi(\theta_1, \ldots, \theta_{i - 1}, x, \theta_{i + 1}, \ldots, \theta_n)} , \end{align*}

and,

T(θθnew)=P(θ1,,θi1,θi,θi+1,,θn)P(θ1,,θi1,θi+1,,θn) =π(θ)xπ(θ1,,θi1,x,θi+1,,θn),\begin{align*} T(\theta \mid \theta_{\text{new}}) & = \frac{P(\theta_1, \ldots, \theta_{i - 1}, \theta_i, \theta_{i + 1}, \ldots, \theta_n)} {P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)} \ & = \frac{\pi(\theta)} {\sum_x \pi(\theta_1, \ldots, \theta_{i - 1}, x, \theta_{i + 1}, \ldots, \theta_n)} , \end{align*}

Thus, we have,

π(θnew)T(θθnew)π(θ)T(θnewθ)=π(θnew)π(θ)xπ(θ1,,θi1,x,θi+1,,θn)π(θ)π(θnew)xπ(θ1,,θi1,x,θi+1,,θn) =π(θnew)π(θ)π(θ)π(θnew)=1,\begin{align*} \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})} {\pi(\theta) T(\theta_{\text{new}} \mid \theta)} & = \frac{ \pi(\theta_{\text{new}}) \frac{\pi(\theta)}{\sum_x \pi(\theta_1,\ldots,\theta_{i-1},x,\theta_{i+1},\ldots,\theta_n)} }{ \pi(\theta) \frac{\pi(\theta_{\text{new}})}{\sum_x \pi(\theta_1,\ldots,\theta_{i-1},x,\theta_{i+1},\ldots,\theta_n)} } \ & = \frac{\pi(\theta_{\text{new}})\pi(\theta)} {\pi(\theta)\pi(\theta_{\text{new}})} = 1 , \end{align*}

Thus, the proposal is always accepted.

Putting together an algorithm updating different coordinates in different steps may create an ergodic Markov chain.

This is called Gibbs sampling. Sometimes the conditional distributions are easy to derive. Then this is an easy-to-use version of Metropolis-Hastings.

Definition 7 (The Ising Model)

Uses a grid of vertices; We will assume an n×nn \times n grid. Two vertices vv and ww are neighbours, denoted vwv \sim w, if they are next to each other in the grid.

Each vertex vv can have value +1+1 or 1-1 (called its “spin”); We denote this by σv=1\sigma_{v} = 1 or σv=1\sigma_{v} = -1.

A configuration σ\sigma consists of a choice of +1+1 or 1-1 for each vertex. Thus, the set Ω\Omega of possible configurations has 2(n2)2^(n^2) elements. We define the energy of a configuration as E(σ)=vwσvσwE(\sigma) = - \sum_{v \sim w} \sigma_v \sigma_w.

The Gibbs distribution is the probability mass function on Ω\Omega defined by,

π(σ)exp(βE(σ))\pi(\sigma) \propto \exp(-\beta E(\sigma))

where β\beta is a parameter of the model; 1β\frac{1}{\beta} is called the temperature.

It turns out that when the temperature is high, samples from the model will show a chaotic pattern of spins, but when the temperature sinks below the phase transition value, samples will show chunks of neighbouring vertices with the same spin, i.e., the system will be “magnetized”.

Intuition (Simulating from the Ising Model using Gibbs Sampling)

For a vertex configuration σ\sigma and a vertex vv, let σv\sigma_{-v} denote the part of σ\sigma that does not involve vv. Prpose a new configuration σ\sigma^{\star} given an old configuration σ\sigma by first choosing a vertex vv, then, let σ\sigma^{\star} be identical to σ\sigma except possibly at vv. Decide the spin at vv using the conditional distribution given σv\sigma_{-v},

π(σσv)=π(σv=1,σv)π(σv)=π(σv=1,σv)π(σv=1,σv)+π(σv=1,σv)=11+π(σv=1,σv)π(σv=1,σv)=11+exp(βE(σv=1,σv)+βE(σv=1,σv))=11+exp(βvwσvσwσv=1βvwσvσwσv=1)=11+exp(2βvwσw)\begin{align*} \pi(\sigma^{\star} \mid \sigma_{-v}) & = \frac{\pi(\sigma_v = 1, \sigma_{-v})}{\pi(\sigma_{-v})} \newline & = \frac{\pi(\sigma_v = 1, \sigma_{-v})}{\pi(\sigma_v = 1, \sigma_{-v}) + \pi(\sigma_v = -1, \sigma_{-v})} \newline & = \frac{1}{1 + \frac{\pi(\sigma_v = -1, \sigma_{-v})}{\pi(\sigma_v = 1, \sigma_{-v})}} \newline & = \frac{1}{1 + \exp(-\beta E(\sigma_v = -1, \sigma_{-v}) + \beta E(\sigma_v = 1, \sigma_{-v}))} \newline & = \frac{1}{1 + \exp(\beta \sum_{v \sim w} \sigma_v \sigma_w \mid \sigma_v = -1 - \beta \sum_{v \sim w} \sigma_v \sigma_w \mid \sigma_v = 1)} \newline & = \frac{1}{1 + \exp(-2 \beta \sum_{v \sim w} \sigma_w)} \newline \end{align*}

This works, but we will dicuss an even better approach “perfect sampling” later.

Perfect Sampling

So far, we have seen that MCMC means,

  • We want to find the expected value of some expression under some distribution.
  • The distribution is difficult to sample from, so instead we simulate from a Markov chain having the given distribution as its limiting distribution.
  • Using the strong law of large numbers for Markov chains, we can approximate the expected value as an average over expressions computed on your simulated values.

When we sample directly from a distribution, the rate of convergence of the averages is governed by the central limit theorem. When using a Markov chain, it is generally very difficult to find the rate of convergence.

There are some exceptions to this, we will now look at Perfect Sampling, where you prove that you have reached the limiting distribution exactly.

Intuition (Perfect Sampling)

Given an ergodic Markov chain with finite state space of size kk and limiting distribution π\pi. The idea is that, given nn prove that XnX_n actually has reached the limit distribution π\pi.

If we can rpove that the distribution at XnX_n is independent of the starting state X0X_0, then we might be able to conclude that XnX_n has the limiting distribution π\pi.

Construct kk Markov chains that are dependent (“coupled”) but which are marginally Markov chains as above. If they start at the kk possible value at X0X_0 but have identical values at XnX_n, then we are done.

Note

nn cannot be determined as the first value where the kk chains meet, it must be determined independently of such information. Thus, usually one wants to generate chains Xn,Xn+1,,X0X_{-n}, X_{-n + 1}, \ldots, X_0 where X0X_0 has the limiting distribution, and we stepwise increase nn to make all chains coalesce to one chain at time 0.

Intuition (Using same source of randomness for all chains)

Consider the chains Xn(j),,X0(j)X^{(j)}_{-n}, \ldots, X^{(j)}_0 for j=1,2,,kj = 1, 2, \ldots, k. Instead of simulating Xi+1(j)X^{(j)}_{i + 1} based on Xi(j)X^{(j)}_i independently for each j, we define a function gg so that Xi+1(j)=g(Xi(j),Ui)X^{(j)}_{i + 1} = g(X^{(j)}_i, U_i) for all jj where UiUniform(0,1)U_i \sim \mathrm{Uniform}(0, 1).

Thus, if two chains have identical values in XiX_i, they will also be identical in Xi+1X_{i + 1}.

Thus, for a particular nn, if all chains have not converged at X0X_0, we simulate kk chains from X2nX_{-2n} to XnX_{-n}. They might only hit a subset of the kk states at XnX_{-n} and thus might coalesce to one state at X0X_0 using the old simulations. If not, double nn again.

Definition 8 (Monotonicity)

One smart question is, do we need to keep track of all kk chains? We define a partial ordering on as et as a relation xyx \leq y between some pairs xx and yy in the set, such that,

  • If xyx \leq y and yxy \leq x then x=yx = y.

  • If xyx \leq y and yzy \leq z then xzx \leq z.

We will need that our partial ordering has a minimal element (an mm such that mxm \leq x for all xx) and a maximal element (an MM such that xMx \leq M for all xx).

If we have a partial ordering on the state space of the Markov chain and if xyx \leq y implies g(x,U)g(y,U)g(x, U) \leq g(y, U), then gg is monotone.

We can then prove that we only need to keep track of the chain starting at mm and the chain starting at MM.

Proof

If the chains starting at mm and MM coalesce at time 0, then for any other chain starting at xx, we have mxMm \leq x \leq M. By monotonicity, we get that the chain starting at mm is less than or equal to the chain starting at xx at all times, and the chain starting at MM is greater than or equal to the chain starting at xx at all times. Thus, if the chains starting at mm and MM are equal at time 0, the chain starting at xx must also be equal to them at time 0.

Example 2 (Simulating from the Ising Model using Perfect Sampling)

We saw above how to use Gibbs sampling to simulate from the Ising model. We then simulated from the conditional distributions, using,

π(σv=1σv)==11+exp(2βvwσw)\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) = \ldots = \frac{1}{1 + \exp(-2 \beta \sum_{v \sim w} \sigma_w)}

Now, we extend this method into a perfect sampling.

Define the partial ordering on configurations σ,τ\sigma, \tau be defining,

στ    σvτv for all vertices v\sigma \leq \tau \iff \sigma_v \leq \tau_v \text{ for all vertices } v

We have minimal and maximal elements m=(1,,1)m = (-1, \ldots, -1) and M=(1,,1)M = (1, \ldots, 1).

Note

iF στ\sigma \leq \tau, then for all vv π(σv=1σv)π(τv=1τv)\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) \leq \pi(\tau_v^{\star} = 1 \mid \tau_{-v}).

Defining g(σ,U)v=2I(Uπ(σv=1σv))1g(\sigma, U)_v = 2 I (U \leq \pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) ) - 1, makes gg monotone.

Given an Ising model with β>0\beta > 0, start with choosing an integer n>0n > 0 and setting configurations,

σ(n)=m=(1,,1)τ(n)=M=(1,,1)\begin{align*} \sigma^{(-n)} & = m = (-1, \ldots, -1) \newline \tau^{(-n)} & = M = (1, \ldots, 1) \newline \end{align*}
Algorithm (Perfect Sampling for the Ising Model)
  • For i=n+1,,0i = -n +1, \ldots, 0 set σ(i)=σ(i1)\sigma^{(i)} = \sigma^{(i - 1)}, τ(i)=τ(i1)\tau^{(i)} = \tau^{(i - 1)} and then,

    • Looping through all vertices vv:
      • Compute π(σv=1σv(i))\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}^{(i)}) and π(τv=1τv(i))\pi(\tau_v^{\star} = 1 \mid \tau_{-v}^{(i)}).
      • Simulate UUniform(0,1)U \sim \mathrm{Uniform}(0, 1).
      • Update σv(i)=g(σ(i),U)v\sigma_v^{(i)} = g(\sigma^{(i)}, U)_v and τv(i)=g(τ(i),U)v\tau_v^{(i)} = g(\tau^{(i)}, U)_v.
  • If σ(0)=τ(0)\sigma^{(0)} = \tau^{(0)}, this is the result. Otherwise, double nn and repeat.