Part 5 - Branching Processes and MCMC

MVE550_5

Introduction

In this part, we will introduce branching processes, their properties, some useful theorems, and applications. Then, we will introduce Markov Chain Monte Carlo (MCMC) and some algorithms related to it.

Branching Processes

Intuition: Branching Processes

Many real-world phenomena can be described as developing with a tree-like structure, for example,

Growth of cells.
Spread of viruses or other pathogens in a population.
Nuclear chain reactions.
Spread of funny cat videos on the internet.
Spread of a surname over generations. ¹

The process with which one node gives rise to “children” can be described as random. We will assume the probabilistic properties of this process is the same for all nodes.

Further, we will assume all nodes are organized into generatiomns, we are only concerned with the size of each generation.

Lastly, the important questions we are interested in are; How large are the generations? How much does the size vary? Will the process die out eventually (i.e., reach a generation with size 0)?

Definitions and Properties

Definition: Galton-Watson Branching Process

A (Galton-Watson) branching process is a discrete Markov chain $Z_0, Z_1, \ldots, Z_n, \ldots$ where,

The state space is the non-negative integers.
$Z_0 = 1$ (i.e., the process starts with one individual).
$Z_n$ is the sum $X_1 + X_2 + \ldots + X_{Z_{n - 1}}$, where $X_i$ are independent random non-negative integers all with the same offspring distribution, $$ Z_n = \sum_{i = 1}^{Z_{n - 1}} X_i $$ Connecting each of the $Z_n$ individuals in generation $n$ with their offspring in generation $n + 1$ we get a tree illustrating the branching process.

The offspring distribution is described by the probability vector $a = (a_0, a_1, \ldots)$ where $a_j = P(X_i = j)$.

We will mainly focus on the interesting cases where we assume $a_0 > 0$ (i.e., there is a positive probability of having no offspring) and $a_0 + a_1 < 1$ (i.e., there is a positive probability of having more than one offspring).

Definition: Expected Size of Generations in a Branching Process and Classification of Branching Processes

Note

Note that the state 0 is abosribing, this absorbtion is called extinction. Also, as $a_0 > 0$, all nonzero states are transient. Let $\mu = \mathbb{E}[X_i] = \sum_{j=0}^{\infty} j \cdot a_j$, i.e., the expected number of offspring per individual.

Then, we may compute that, $$ \begin{align*} \mathbb{E}[Z_n] & = \mathbb{E}\left(\sum_{i = 1}^{Z_{n-1}}\right) \newline & = \mathbb{E}\left(\mathbb{E}\left(\sum_{i = 1}^{Z_{n-1}} X_i \mid Z_{n - 1} \right)\right) & = \ldots & = \mathbb{E}[Z_{n - 1}] \cdot \mu \newline \end{align*} $$ Thus, we directly get that, $$ \mathbb{E}[Z_n] = \mu^n \cdot \underbrace{\mathbb{E}[Z_0]}_{\coloneqq 1} = \mu^n $$ Based on the value of $\mu$, we classify branching processes into three categories:

Subcritical branching processes: $\mu < 1$. The expected generation size decreases over time, $\lim_{n \to \infty} \mathbb{E}[Z_n] = 0$.
Critical branching processes: $\mu = 1$. The expected generation size remains constant over time, $\lim_{n \to \infty} \mathbb{E}[Z_n] = 1$.
Supercritical branching processes: $\mu > 1$. The expected generation size increases over time, $\lim_{n \to \infty} \mathbb{E}[Z_n] = \infty$.

Note

One can prove that if $\lim_{n \to \infty} Z_n = 0$, then the probability of extinction is 1. We will do this later.

Definition: Variance of Generation Sizes in a Branching Process

Let $\mu$ be defined as above, and let $\sigma^2 = \mathrm{Var}(X_i)$ be the variance of the number of children.

Using the law of total variance we get, $$ \begin{align*} \mathrm{Var}(Z_n) & = \mathrm{Var}(\mathbb{E}[Z_n \mid Z_{n - 1}]) + \mathbb{E}[\mathrm{Var}(Z_n \mid Z_{n - 1})] \newline & = \mathrm{Var}\left(\mathbb{E}\left(\sum_{i = 1}^{Z_{n - 1}} X_i \mid Z_{n - 1}\right)\right) + \mathbb{E}\left[\mathrm{Var}\left(\sum_{i = 1}^{Z_{n - 1}} X_i \mid Z_{n - 1}\right)\right] \newline & = \mathrm{Var}(\mu Z_{n - 1}) + \mathbb{E}[\sigma^2 Z_{n - 1}] \newline & = \mu^2 \mathrm{Var}(Z_{n - 1}) + \sigma^2 \mathbb{E}[Z_{n - 1}] \newline \end{align*} $$ From this, we can prove by induction that, for $n \geq 1$, $$ \mathrm{Var}(Z_n) = \sigma^2 \mu^{n - 1} \sum_{k = 0}^{n - 1} \mu^k = \begin{cases} n \sigma^2 & \text{if } \mu = 1 \newline \sigma^2 \mu^{n - 1} \frac{\mu^n - 1}{\mu - 1} & \text{if } \mu \neq 1 \end{cases} $$

Definition: Extinction Probability of a Branching Process

Let $X$ denote an offspring variable and $e$ the probability of extinction.

Using first-step analysis, $$ \begin{align*} e & = \sum_{k = 0}^{\infty} P(X = k) P(\text{extinction of } k \text{ processes}) \newline & = \sum_{k = 0}^{\infty} P(X = k) e^k = \mathbb{E}_X[e^X] = G_X(e) \end{align*} $$ where we define the probability generating function (PGF) $G_X(s)$ as, $$ G_X(s) = \mathbb{E}[s^X] $$

Let $e_n$ be the probability of extinction by generation $n$. Using first-step analysis again, we get, $$ \begin{align*} e_n & = \sum_{k = 0}^{\infty} P(X = k) P(\text{extinction by generation } n - 1 \text{ of } k \text{ processes}) \newline & = \sum_{k = 0}^{\infty} P(X = k) e_{n - 1}^k = \mathbb{E}_X[e_{n - 1}^X] = G_X(e_{n - 1}) \end{align*} $$

Theorem: Extinction Probability Theorem

Let $G_X$ be the probability generating function for the offspring distribution for a branching process. The probability of eventual extinction is the smallest positive root of the equation $G_X(s) = s$.

Proof

If $s = x$ is any positive solution to $G_X(s) = s$, then $e \leq x$. We have $0 = e_0 < x$. We have that $G_X$ is an increasing function, as, $$ s_0 < s_1 \implies \sum_{k = 0}^{\infty} P(X = k) s_0^k < \sum_{k = 0}^{\infty} P(X = k) s_1^k \implies G_X(s_0) < G_X(s_1) $$ Thus, applying $G_X$ repeatedly we get $e_n < x$ and thus $e = \lim_{n \to \infty} e_n \leq x$.

Probability Generating Functions

Intuition: Probability Generating Functions

For any discrete random variable $X$ taking values in $\{0, 1, 2, \ldots\}$, define the probability function $G_X(s)$ as, $$ G_X(s) = \mathbb{E}[s^X] = \sum_{k = 0}^{\infty} P(X = k) s^k $$ The series converges absolutely for $\Vert s \Vert \leq 1$. We assume $s$ is a real number in $[0, 1]$.

We get a 1-1 correspondence between probability vectors on $\{0, 1, 2, \ldots\}$ and functions represented by a series where the coefficients sum to 1.

Specifically, if $G_X(s) = G_Y(s)$ for all $s$ for random variables $X$ and $Y$, then $X$ and $Y$ have the same distribution.

Note: Properties of Probability Generating Functions

To go from $X$ to $G_X(s)$, we compute the (in)finite sum.
To go from $G_X(s)$ to $X$, we use that we have, $$ P(X = k) = \frac{G_X^{(k)}(0)}{k!} $$
If $X$ and $Y$ are independent random variables, then, $$ G_{X + Y}(s) = \mathbf{E}[s^{X + Y}] = \mathbb{E}[s^X s^Y] = \mathbb{E}[s^X] \mathbb{E}[s^Y] \eqqcolon G_X(s) G_Y(s) $$
$\mathbb{E}[X] = G_X^{\prime}(1)$
$\mathbb{E}[X(X - 1)] = G_X^{\prime \prime}(1)$
As a consequence, $\mathrm{Var}(X) = G_X^{\prime \prime}(1) + G_X^{\prime}(1) - (G_X^{\prime}(1))^2$

Further, assume we have a branching process $Z_0, Z_1, \ldots$ with independent offspring variables $X_i$ all with the same distribution as $X$. By writing $G_n(s) = G_{Z_n}(s) = \mathbb{E}[s^{Z_n}]$ and $G_X(s) = \mathbb{E}[s^X]$, we get, $$ \begin{align*} G_n(s) & = \mathbb{E}[s^{\sum_{k = 1}^{Z_{n - 1}} X_k}] \newline & = \mathbb{E}[\mathbb{E}[s^{\sum_{k = 1}^{Z_{n - 1}} X_k} \mid Z_{n - 1}]] \newline & = \mathbb{E}\left[\mathbb{E}\left[\prod_{k = 1}^{Z_{n - 1}} s^{X_k} \mid Z_{n - 1}\right]\right] \newline & = \mathbb{E}[G(s)^{Z_{n - 1}}] \newline & = G_{n - 1}(G_X(s)) \newline \end{align*} $$ As $G_0(s) = \mathbb{E}[s^{Z_0}] = s$, it follows that, $$ G_n(s) = \underbrace{G_X(G_X(\ldots G_X(s) \ldots))}_{n \text{ times}} $$

Theorem: Extinction Probability Theorem (Addition)

Proof: Of the last sentence

In the critical case, $$ G_X^{\prime}(1) = \mathbb{E}[X] = \mu = 1 $$ As $G_X^{\prime \prime}(s) > 0$ for $s \in (0, 1)$, we get that $G_X^{\prime}(s) < 1$ for $s \in (0, 1)$, and $\frac{d}{ds}(G(s) - s) < 0$ for $s \in (0, 1)$. As $G(1) - 1 = 0$ for any probability generating function, we get that $G(s) = s$ has its smallest positive root at $s = 1$.

Markov Chain Monte Carlo (MCMC)

Intuition: Markov Chain Monte Carlo (MCMC)

So far we have used Markov chains in the following way

Define a Markov chain
Find its limiting distribution, if it exists

Now, we want to,

Start with a desired distribution

Construct a Markov chain having this limiting distribution

In many contexts we want to compute the expecation $\mathbb{E}[r(X)]$ for some random variable $X$ and some function $r$.

One way to approximate this is to generate samples $x_1, \ldots, x_N$ from $X$ and use, $$ \mathbb{E}[r(X)] = \lim_{N \to \infty} \frac{1}{N} \sum_{k = 1}^{N} r(x_k) \approx \frac{1}{N} \sum_{k = 1}^{N} r(x_k) $$ Thus, the idea is simulating from a Markov chain having $X$ as its limiting distribution generates an approximate sample. Average over it as above.

Intuition: Is an approximate sample good enough?

We know this works in classical statistics.

Strong law of large numbers of samples: If $Y_1, Y_2, \ldots, Y_m$ and $Y$ are i.i.d random variables from a distribution with finite mean, and if $\mathbb{E}[r(Y)]$ exists, then, with probability 1, $$ \lim_{m \to \infty} \frac{r(Y_1) + r(Y_2) + \ldots + r(Y_m)}{m} = \mathbb{E}[r(Y)] $$ There is a corresponding result for Markov chains.

Strong law of large numbers for Markov chains: If $X_0, X_1, \ldots$ is an ergodic Markov chain with stationary distribution $\pi$, and if $\mathbb{E}_{X \sim \pi}[r(X)]$ exists, then, with probability 1, $$ \lim_{m \to \infty} \frac{r(X_0) + r(X_1) + \ldots + r(X_m)}{m} = \mathbb{E}_{X \sim \pi}[r(X)] $$ where $X$ has distribution $\pi$.

Note

When using this theorem in practice, one might improve accuracy by throwing away the first sequence $X_1, \ldots, X_s$ for $s < m$ before computing the average. The first sequence is then called the burn-in.

Examples

Example: TOy Example of MCMC

Consider the Markov chain $X_0, X_1, \ldots$ with states $\{0, 1, 2\}$ and with, $$ P = \begin{bmatrix} 0.99 & 0.01 & 0 \newline 0 & 0.9 & 0.1 \newline 0.2 & 0 & 0.8 \end{bmatrix} $$ One can find that the limiting distribution is $v = (\frac{20}{23}, \frac{2}{23}, \frac{1}{23})$.

Consider the function $r(x) = x^5$. If $X$ is a random variable with the limiting distribution, $$ \mathbb{E}[r(X)] = 0^5 \cdot \frac{20}{23} + 1^5 \cdot \frac{2}{23} + 2^5 \cdot \frac{1}{23} = \frac{34}{23} = 1.4783 $$ If $Y_1, \ldots, Y_n$ are all i.i.d variables with the limiting distribution, we can check numerically that, $$ \lim_{n \to \infty} \frac{r(Y_1) + r(Y_2) + \ldots + r(Y_n)}{n} = 1.4783 $$ We also get for $X_0, X_1, \ldots$ that, $$ \lim_{n \to \infty} \frac{r(X_1) + r(X_2) + \ldots + r(X_n)}{n} = 1.4783 $$ but in this case the limit is approached more slowly.

The Metropolis-Hastings Algorithm, Gibbs Sampling, and The Ising Model

Definition: The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm is a method to construct a Markov chain having a desired limiting distribution. Let $\theta$ be a random variable with probability mass function $\pi( \theta)$. We also assume given a proposal distribution $T(\theta_{\text{new}} \mid \theta)$, which, for every given $\theta$, provides a PMF for a new $\theta_{\text{new}}$.

Finally, we define, for $\theta$ and $\theta_{\text{new}}$, the acceptance probability, $$ a = \min\left(1, \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}\right) $$ The Metropolis-Hastings algorithm is: Starting with some initial value $\theta_0$, generate $\theta_1, \theta_2, \ldots$ by, at each step, proposing a new $\theta$ based on the old using the proposal function and accepting it with probability $a$. If it is not accepted, the old value is used again.

If this defines an ergodic Markov chain, its unique stationary distribution is $\pi(\theta)$.

Proof

Let $P$ be the transition matrix of the Markov chain defined by the Metropolis-Hastings algorithm. We need to show that $\pi P = \pi$. It suffices to show that the detailed balance equations are satisfied, i.e., that, $$ \pi(\theta) P_{\theta \theta_{\text{new}}} = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} $$ for all states $\theta$ and $\theta_{\text{new}}$. We have, $$ P_{\theta \theta_{\text{new}}} = T(\theta_{\text{new}} \mid \theta) \cdot a = T(\theta_{\text{new}} \mid \theta) \cdot \min\left(1, \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}\right) $$ and, $$ P_{\theta_{\text{new}} \theta} = T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} = T(\theta \mid \theta_{\text{new}}) \cdot \min\left(1, \frac{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}\right) $$ If $\pi(\theta) T(\theta_{\text{new}} \mid \theta) \leq \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})$, then, $$ \begin{align*} \pi(\theta) P_{\theta \theta_{\text{new}}} & = \pi(\theta) T(\theta_{\text{new}} \mid \theta) \cdot 1 \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot \frac{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})} \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} \newline & = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} \end{align*} $$ If $\pi(\theta) T(\theta_{\text{new}} \mid \theta) > \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})$, then, $$ \begin{align*} \pi(\theta) P_{\theta \theta_{\text{new}}} & = \pi(\theta) T(\theta_{\text{new}} \mid \theta) \cdot \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)} \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot 1 \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} \newline & = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} \end{align*} $$ Thus, in both cases, the detailed balance equations are satisfied, so $\pi$ is the stationary distribution of the chain.

Definition: Gibbs Sampling

For any probability model over a vector $\theta = (\theta_1, \theta_2, \ldots, \theta_n)$, consider a Metropolis-Hastings proposal function changing only one coordinate, with the value of this coordinate simulated from the conditional distribution given the remaining coordinates. The acceptance probability is always 1, so the proposal is always accepted.

Proof

Let $\theta = (\theta_1, \theta_2, \ldots, \theta_n)$ and $\theta_{\text{new}} = (\theta_1, \ldots, \theta_{i - 1}, \theta_{i,\text{new}}, \theta_{i + 1}, \ldots, \theta_n)$ differ only in coordinate $i$. The proposal distribution is, $$ T(\theta_{\text{new}} \mid \theta) = P(\theta_i = \theta_{i,\text{new}} \mid \theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n) $$ and, $$ T(\theta \mid \theta_{\text{new}}) = P(\theta_i = \theta_i \mid \theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n) $$ Using the definition of conditional probability, we have, $$ \begin{align*} T(\theta_{\text{new}} \mid \theta) & = \frac{P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i,\text{new}}, \theta_{i + 1}, \ldots, \theta_n)} {P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)} \ & = \frac{\pi(\theta_{\text{new}})} {\sum_x \pi(\theta_1, \ldots, \theta_{i - 1}, x, \theta_{i + 1}, \ldots, \theta_n)} , \end{align*} $$ and, $$ \begin{align*} T(\theta \mid \theta_{\text{new}}) & = \frac{P(\theta_1, \ldots, \theta_{i - 1}, \theta_i, \theta_{i + 1}, \ldots, \theta_n)} {P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)} \ & = \frac{\pi(\theta)} {\sum_x \pi(\theta_1, \ldots, \theta_{i - 1}, x, \theta_{i + 1}, \ldots, \theta_n)} , \end{align*} $$ Thus, we have, $$ \begin{align*} \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})} {\pi(\theta) T(\theta_{\text{new}} \mid \theta)} & = \frac{ \pi(\theta_{\text{new}}) \frac{\pi(\theta)}{\sum_x \pi(\theta_1,\ldots,\theta_{i-1},x,\theta_{i+1},\ldots,\theta_n)} }{ \pi(\theta) \frac{\pi(\theta_{\text{new}})}{\sum_x \pi(\theta_1,\ldots,\theta_{i-1},x,\theta_{i+1},\ldots,\theta_n)} } \ & = \frac{\pi(\theta_{\text{new}})\pi(\theta)} {\pi(\theta)\pi(\theta_{\text{new}})} = 1 , \end{align*} $$ Thus, the proposal is always accepted.

Putting together an algorithm updating different coordinates in different steps may create an ergodic Markov chain.

This is called Gibbs sampling. Sometimes the conditional distributions are easy to derive. Then this is an easy-to-use version of Metropolis-Hastings.

Definition: The Ising Model

Uses a grid of vertices; We will assume an $n \times n$ grid. Two vertices $v$ and $w$ are neighbours, denoted $v \sim w$, if they are next to each other in the grid.

Each vertex $v$ can have value $+1$ or $-1$ (called its “spin”); We denote this by $\sigma_{v} = 1$ or $\sigma_{v} = -1$.

A configuration $\sigma$ consists of a choice of $+1$ or $-1$ for each vertex. Thus, the set $\Omega$ of possible configurations has $2^(n^2)$ elements. We define the energy of a configuration as $E(\sigma) = - \sum_{v \sim w} \sigma_v \sigma_w$.

The Gibbs distribution is the probability mass function on $\Omega$ defined by, $$ \pi(\sigma) \propto \exp(-\beta E(\sigma)) $$ where $\beta$ is a parameter of the model; $\frac{1}{\beta}$ is called the temperature.

It turns out that when the temperature is high, samples from the model will show a chaotic pattern of spins, but when the temperature sinks below the phase transition value, samples will show chunks of neighbouring vertices with the same spin, i.e., the system will be “magnetized”.

Intuition: Simulating from the Ising Model using Gibbs Sampling

For a vertex configuration $\sigma$ and a vertex $v$, let $\sigma_{-v}$ denote the part of $\sigma$ that does not involve $v$. Prpose a new configuration $\sigma^{\star}$ given an old configuration $\sigma$ by first choosing a vertex $v$, then, let $\sigma^{\star}$ be identical to $\sigma$ except possibly at $v$. Decide the spin at $v$ using the conditional distribution given $\sigma_{-v}$, $$ \begin{align*} \pi(\sigma^{\star} \mid \sigma_{-v}) & = \frac{\pi(\sigma_v = 1, \sigma_{-v})}{\pi(\sigma_{-v})} \newline & = \frac{\pi(\sigma_v = 1, \sigma_{-v})}{\pi(\sigma_v = 1, \sigma_{-v}) + \pi(\sigma_v = -1, \sigma_{-v})} \newline & = \frac{1}{1 + \frac{\pi(\sigma_v = -1, \sigma_{-v})}{\pi(\sigma_v = 1, \sigma_{-v})}} \newline & = \frac{1}{1 + \exp(-\beta E(\sigma_v = -1, \sigma_{-v}) + \beta E(\sigma_v = 1, \sigma_{-v}))} \newline & = \frac{1}{1 + \exp(\beta \sum_{v \sim w} \sigma_v \sigma_w \mid \sigma_v = -1 - \beta \sum_{v \sim w} \sigma_v \sigma_w \mid \sigma_v = 1)} \newline & = \frac{1}{1 + \exp(-2 \beta \sum_{v \sim w} \sigma_w)} \newline \end{align*} $$ This works, but we will dicuss an even better approach “perfect sampling” later.

Perfect Sampling

So far, we have seen that MCMC means,

We want to find the expected value of some expression under some distribution.
The distribution is difficult to sample from, so instead we simulate from a Markov chain having the given distribution as its limiting distribution.
Using the strong law of large numbers for Markov chains, we can approximate the expected value as an average over expressions computed on your simulated values.

When we sample directly from a distribution, the rate of convergence of the averages is governed by the central limit theorem. When using a Markov chain, it is generally very difficult to find the rate of convergence.

There are some exceptions to this, we will now look at Perfect Sampling, where you prove that you have reached the limiting distribution exactly.

Intuition: Perfect Sampling

Given an ergodic Markov chain with finite state space of size $k$ and limiting distribution $\pi$. The idea is that, given $n$ prove that $X_n$ actually has reached the limit distribution $\pi$.

If we can rpove that the distribution at $X_n$ is independent of the starting state $X_0$, then we might be able to conclude that $X_n$ has the limiting distribution $\pi$.

Construct $k$ Markov chains that are dependent (“coupled”) but which are marginally Markov chains as above. If they start at the $k$ possible value at $X_0$ but have identical values at $X_n$, then we are done.

Note

$n$ cannot be determined as the first value where the $k$ chains meet, it must be determined independently of such information. Thus, usually one wants to generate chains $X_{-n}, X_{-n + 1}, \ldots, X_0$ where $X_0$ has the limiting distribution, and we stepwise increase $n$ to make all chains coalesce to one chain at time 0.

Intuition: Using same source of randomness for all chains

Consider the chains $X^{(j)}_{-n}, \ldots, X^{(j)}_0$ for $j = 1, 2, \ldots, k$. Instead of simulating $X^{(j)}_{i + 1}$ based on $X^{(j)}_i$ independently for each j, we define a function $g$ so that $X^{(j)}_{i + 1} = g(X^{(j)}_i, U_i)$ for all $j$ where $U_i \sim \mathrm{Uniform}(0, 1)$.

Thus, if two chains have identical values in $X_i$, they will also be identical in $X_{i + 1}$.

Thus, for a particular $n$, if all chains have not converged at $X_0$, we simulate $k$ chains from $X_{-2n}$ to $X_{-n}$. They might only hit a subset of the $k$ states at $X_{-n}$ and thus might coalesce to one state at $X_0$ using the old simulations. If not, double $n$ again.

Definition: Monotonicity

One smart question is, do we need to keep track of all $k$ chains? We define a partial ordering on as et as a relation $x \leq y$ between some pairs $x$ and $y$ in the set, such that,

If $x \leq y$ and $y \leq x$ then $x = y$.
If $x \leq y$ and $y \leq z$ then $x \leq z$.

We will need that our partial ordering has a minimal element (an $m$ such that $m \leq x$ for all $x$) and a maximal element (an $M$ such that $x \leq M$ for all $x$).

If we have a partial ordering on the state space of the Markov chain and if $x \leq y$ implies $g(x, U) \leq g(y, U)$, then $g$ is monotone.

We can then prove that we only need to keep track of the chain starting at $m$ and the chain starting at $M$.

Proof

If the chains starting at $m$ and $M$ coalesce at time 0, then for any other chain starting at $x$, we have $m \leq x \leq M$. By monotonicity, we get that the chain starting at $m$ is less than or equal to the chain starting at $x$ at all times, and the chain starting at $M$ is greater than or equal to the chain starting at $x$ at all times. Thus, if the chains starting at $m$ and $M$ are equal at time 0, the chain starting at $x$ must also be equal to them at time 0.

Example: Simulating from the Ising Model using Perfect Sampling

We saw above how to use Gibbs sampling to simulate from the Ising model. We then simulated from the conditional distributions, using, $$ \pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) = \ldots = \frac{1}{1 + \exp(-2 \beta \sum_{v \sim w} \sigma_w)} $$ Now, we extend this method into a perfect sampling.

Define the partial ordering on configurations $\sigma, \tau$ be defining, $$ \sigma \leq \tau \iff \sigma_v \leq \tau_v \text{ for all vertices } v $$ We have minimal and maximal elements $m = (-1, \ldots, -1)$ and $M = (1, \ldots, 1)$.

Note

iF $\sigma \leq \tau$, then for all $v$ $\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) \leq \pi(\tau_v^{\star} = 1 \mid \tau_{-v})$.

Defining $g(\sigma, U)_v = 2 I (U \leq \pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) ) - 1$, makes $g$ monotone.

Given an Ising model with $\beta > 0$, start with choosing an integer $n > 0$ and setting configurations, $$ \begin{align*} \sigma^{(-n)} & = m = (-1, \ldots, -1) \newline \tau^{(-n)} & = M = (1, \ldots, 1) \newline \end{align*} $$

Algorithm: Perfect Sampling for the Ising Model

For $i = -n +1, \ldots, 0$ set $\sigma^{(i)} = \sigma^{(i - 1)}$, $\tau^{(i)} = \tau^{(i - 1)}$ and then,
- Looping through all vertices $v$:
  - Compute $\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}^{(i)})$ and $\pi(\tau_v^{\star} = 1 \mid \tau_{-v}^{(i)})$.
  - Simulate $U \sim \mathrm{Uniform}(0, 1)$.
  - Update $\sigma_v^{(i)} = g(\sigma^{(i)}, U)_v$ and $\tau_v^{(i)} = g(\tau^{(i)}, U)_v$.
If $\sigma^{(0)} = \tau^{(0)}$, this is the result. Otherwise, double $n$ and repeat.

Part 5 - Branching Processes and MCMC

Introduction

Branching Processes

Definitions and Properties

Probability Generating Functions

Markov Chain Monte Carlo (MCMC)

Examples

The Metropolis-Hastings Algorithm, Gibbs Sampling, and The Ising Model

Perfect Sampling

Margin Notes