In this part, we will introduce branching processes, their properties, some useful theorems, and applications.
Then, we will introduce Markov Chain Monte Carlo (MCMC) and some algorithms related to it.
Branching Processes
Intuition (Branching Processes)
Many real-world phenomena can be described as developing with a tree-like structure, for example,
Growth of cells.
Spread of viruses or other pathogens in a population.
Nuclear chain reactions.
Spread of funny cat videos on the internet.
Spread of a surname over generations. 1This was the original motivation for studying branching processes by Francis Galton and Henry William Watson in the 19th century.
The process with which one node gives rise to “children” can be described as random.
We will assume the probabilistic properties of this process is the same for all nodes.
Further, we will assume all nodes are organized into generatiomns, we are only concerned with the size of each generation.
Lastly, the important questions we are interested in are; How large are the generations? How much does the size vary? Will the process die out eventually (i.e., reach a generation with size 0)?
Definitions and Properties
Definition 1 (Galton-Watson Branching Process)
A (Galton-Watson) branching process is a discrete Markov chain Z0,Z1,…,Zn,… where,
The state space is the non-negative integers.
Z0=1 (i.e., the process starts with one individual).
Zn is the sum X1+X2+…+XZn−1, where Xi are independent random non-negative integers all with the same offspring distribution,
Zn=i=1∑Zn−1Xi
Connecting each of the Zn individuals in generation n with their offspring in generation n+1 we get a tree illustrating the branching process.
The offspring distribution is described by the probability vector a=(a0,a1,…) where aj=P(Xi=j).
We will mainly focus on the interesting cases where we assume a0>0 (i.e., there is a positive probability of having no offspring) and a0+a1<1 (i.e., there is a positive probability of having more than one offspring).
Definition 2 (Expected Size of Generations in a Branching Process and Classification of Branching Processes)Note
Note that the state 0 is abosribing, this absorbtion is called extinction.
Also, as a0>0, all nonzero states are transient.
Let μ=E[Xi]=∑j=0∞j⋅aj, i.e., the expected number of offspring per individual.
From this, we can prove by induction that, for n≥1,
Var(Zn)=σ2μn−1k=0∑n−1μk={nσ2σ2μn−1μ−1μn−1if μ=1if μ=1Definition 4 (Extinction Probability of a Branching Process)
Let X denote an offspring variable and e the probability of extinction.
Using first-step analysis,
e=k=0∑∞P(X=k)P(extinction of k processes)=k=0∑∞P(X=k)ek=EX[eX]=GX(e)
where we define the probability generating function (PGF) GX(s) as,
GX(s)=E[sX]
Let en be the probability of extinction by generation n. Using first-step analysis again, we get,
en=k=0∑∞P(X=k)P(extinction by generation n−1 of k processes)=k=0∑∞P(X=k)en−1k=EX[en−1X]=GX(en−1)Theorem 1 (Extinction Probability Theorem)
Let GX be the probability generating function for the offspring distribution for a branching process.
The probability of eventual extinction is the smallest positive root of the equation GX(s)=s.
Proof
If s=x is any positive solution to GX(s)=s, then e≤x.
We have 0=e0<x. We have that GX is an increasing function, as,
Thus, applying GX repeatedly we get en<x and thus e=limn→∞en≤x.
Probability Generating Functions
Intuition (Probability Generating Functions)
For any discrete random variable X taking values in {0,1,2,…}, define the probability function GX(s) as,
GX(s)=E[sX]=k=0∑∞P(X=k)sk
The series converges absolutely for ∥s∥≤1. We assume s is a real number in [0,1].
We get a 1-1 correspondence between probability vectors on {0,1,2,…} and functions represented by a series where the coefficients sum to 1.
Specifically, if GX(s)=GY(s) for all s for random variables X and Y, then X and Y have the same distribution.
Note (Properties of Probability Generating Functions)
To go from X to GX(s), we compute the (in)finite sum.
To go from GX(s) to X, we use that we have,
P(X=k)=k!GX(k)(0)
If X and Y are independent random variables, then,
GX+Y(s)=E[sX+Y]=E[sXsY]=E[sX]E[sY]=:GX(s)GY(s)
E[X]=GX′(1)
E[X(X−1)]=GX′′(1)
As a consequence, Var(X)=GX′′(1)+GX′(1)−(GX′(1))2
Further, assume we have a branching process Z0,Z1,… with independent offspring variables Xi all with the same distribution as X.
By writing Gn(s)=GZn(s)=E[sZn] and GX(s)=E[sX], we get,
Gn(s)=n timesGX(GX(…GX(s)…))Theorem 2 (Extinction Probability Theorem (Addition))
Let GX be the probability generating function for the offspring distribution for a branching process.
The probability of eventual extinction is the smallest positive root of the equation GX(s)=s.
Also, if the process is critical (μ=1) then the extinction probability is 1.
Proof (Of the last sentence)
In the critical case,
GX′(1)=E[X]=μ=1
As GX′′(s)>0 for s∈(0,1), we get that GX′(s)<1 for s∈(0,1), and dsd(G(s)−s)<0 for s∈(0,1).
As G(1)−1=0 for any probability generating function, we get that G(s)=s has its smallest positive root at s=1.
Markov Chain Monte Carlo (MCMC)
Intuition (Markov Chain Monte Carlo (MCMC))
So far we have used Markov chains in the following way
Define a Markov chain
Find its limiting distribution, if it exists
Now, we want to,
Start with a desired distribution
Construct a Markov chain having this limiting distribution
In many contexts we want to compute the expecation E[r(X)] for some random variable X and some function r.
One way to approximate this is to generate samples x1,…,xN from X and use,
E[r(X)]=N→∞limN1k=1∑Nr(xk)≈N1k=1∑Nr(xk)
Thus, the idea is simulating from a Markov chain having X as its limiting distribution generates an approximate sample. Average over it as above.
Intuition (Is an approximate sample good enough?)
We know this works in classical statistics.
Strong law of large numbers of samples: If Y1,Y2,…,Ym and Y are i.i.d random variables from a distribution with finite mean, and if E[r(Y)] exists, then, with probability 1,
m→∞limmr(Y1)+r(Y2)+…+r(Ym)=E[r(Y)]
There is a corresponding result for Markov chains.
Strong law of large numbers for Markov chains: If X0,X1,… is an ergodic Markov chain with stationary distribution π, and if EX∼π[r(X)] exists, then, with probability 1,
m→∞limmr(X0)+r(X1)+…+r(Xm)=EX∼π[r(X)]
where X has distribution π.
Note
When using this theorem in practice, one might improve accuracy by throwing away the first sequence X1,…,Xs for s<m before computing the average. The first sequence is then called the burn-in.
Examples
Example 1 (Toy Example of MCMC)
Consider the Markov chain X0,X1,… with states {0,1,2} and with,
P=0.9900.20.010.9000.10.8
One can find that the limiting distribution is v=(2320,232,231).
Consider the function r(x)=x5. If X is a random variable with the limiting distribution,
E[r(X)]=05⋅2320+15⋅232+25⋅231=2334=1.4783
If Y1,…,Yn are all i.i.d variables with the limiting distribution, we can check numerically that,
n→∞limnr(Y1)+r(Y2)+…+r(Yn)=1.4783
We also get for X0,X1,… that,
n→∞limnr(X1)+r(X2)+…+r(Xn)=1.4783
but in this case the limit is approached more slowly.
The Metropolis-Hastings Algorithm, Gibbs Sampling, and The Ising Model
Definition 5 (The Metropolis-Hastings Algorithm)
The Metropolis-Hastings algorithm is a method to construct a Markov chain having a desired limiting distribution.
Let θ be a random variable with probability mass function π(θ).
We also assume given a proposal distribution T(θnew∣θ), which, for every given θ, provides a PMF for a new θnew.
Finally, we define, for θ and θnew, the acceptance probability,
a=min(1,π(θ)T(θnew∣θ)π(θnew)T(θ∣θnew))
The Metropolis-Hastings algorithm is: Starting with some initial value θ0, generate θ1,θ2,… by, at each step, proposing a new θ based on the old using the proposal function and accepting it with probability a.
If it is not accepted, the old value is used again.
If this defines an ergodic Markov chain, its unique stationary distribution is π(θ).
Proof
Let P be the transition matrix of the Markov chain defined by the Metropolis-Hastings algorithm.
We need to show that πP=π.
It suffices to show that the detailed balance equations are satisfied, i.e., that,
Thus, in both cases, the detailed balance equations are satisfied, so π is the stationary distribution of the chain.
Definition 6 (Gibbs Sampling)
For any probability model over a vector θ=(θ1,θ2,…,θn), consider a Metropolis-Hastings proposal function changing only one coordinate, with the value of this coordinate simulated from the conditional distribution given the remaining coordinates.
The acceptance probability is always 1, so the proposal is always accepted.
Proof
Let θ=(θ1,θ2,…,θn) and θnew=(θ1,…,θi−1,θi,new,θi+1,…,θn) differ only in coordinate i.
The proposal distribution is,
T(θnew∣θ)=P(θi=θi,new∣θ1,…,θi−1,θi+1,…,θn)
and,
T(θ∣θnew)=P(θi=θi∣θ1,…,θi−1,θi+1,…,θn)
Using the definition of conditional probability, we have,
Putting together an algorithm updating different coordinates in different steps may create an ergodic Markov chain.
This is called Gibbs sampling.
Sometimes the conditional distributions are easy to derive. Then this is an easy-to-use version of Metropolis-Hastings.
Definition 7 (The Ising Model)
Uses a grid of vertices; We will assume an n×n grid.
Two vertices v and w are neighbours, denoted v∼w, if they are next to each other in the grid.
Each vertex v can have value +1 or −1 (called its “spin”); We denote this by σv=1 or σv=−1.
A configuration σ consists of a choice of +1 or −1 for each vertex. Thus, the set Ω of possible configurations has 2(n2) elements.
We define the energy of a configuration as E(σ)=−∑v∼wσvσw.
The Gibbs distribution is the probability mass function on Ω defined by,
π(σ)∝exp(−βE(σ))
where β is a parameter of the model; β1 is called the temperature.
It turns out that when the temperature is high, samples from the model will show a chaotic pattern of spins, but when the temperature sinks below the phase transition value, samples will show chunks of neighbouring vertices with the same spin, i.e., the system will be “magnetized”.
Intuition (Simulating from the Ising Model using Gibbs Sampling)
For a vertex configuration σ and a vertex v, let σ−v denote the part of σ that does not involve v.
Prpose a new configuration σ⋆ given an old configuration σ by first choosing a vertex v, then, let σ⋆ be identical to σ except possibly at v.
Decide the spin at v using the conditional distribution given σ−v,
This works, but we will dicuss an even better approach “perfect sampling” later.
Perfect Sampling
So far, we have seen that MCMC means,
We want to find the expected value of some expression under some distribution.
The distribution is difficult to sample from, so instead we simulate from a Markov chain having the given distribution as its limiting distribution.
Using the strong law of large numbers for Markov chains, we can approximate the expected value as an average over expressions computed on your simulated values.
When we sample directly from a distribution, the rate of convergence of the averages is governed by the central limit theorem.
When using a Markov chain, it is generally very difficult to find the rate of convergence.
There are some exceptions to this, we will now look at Perfect Sampling, where you prove that you have reached the limiting distribution exactly.
Intuition (Perfect Sampling)
Given an ergodic Markov chain with finite state space of size k and limiting distribution π.
The idea is that, given n prove that Xn actually has reached the limit distribution π.
If we can rpove that the distribution at Xn is independent of the starting state X0, then we might be able to conclude that Xn has the limiting distribution π.
Construct k Markov chains that are dependent (“coupled”) but which are marginally Markov chains as above.
If they start at the k possible value at X0 but have identical values at Xn, then we are done.
Note
n cannot be determined as the first value where the k chains meet, it must be determined independently of such information.
Thus, usually one wants to generate chains X−n,X−n+1,…,X0 where X0 has the limiting distribution, and we stepwise increase n to make all chains coalesce to one chain at time 0.
Intuition (Using same source of randomness for all chains)
Consider the chains X−n(j),…,X0(j) for j=1,2,…,k.
Instead of simulating Xi+1(j) based on Xi(j) independently for each j, we define a function g so that Xi+1(j)=g(Xi(j),Ui) for all j where Ui∼Uniform(0,1).
Thus, if two chains have identical values in Xi, they will also be identical in Xi+1.
Thus, for a particular n, if all chains have not converged at X0, we simulate k chains from X−2n to X−n.
They might only hit a subset of the k states at X−n and thus might coalesce to one state at X0 using the old simulations. If not, double n again.
Definition 8 (Monotonicity)
One smart question is, do we need to keep track of all k chains?
We define a partial ordering on as et as a relation x≤y between some pairs x and y in the set, such that,
If x≤y and y≤x then x=y.
If x≤y and y≤z then x≤z.
We will need that our partial ordering has a minimal element (an m such that m≤x for all x) and a maximal element (an M such that x≤M for all x).
If we have a partial ordering on the state space of the Markov chain and if x≤y implies g(x,U)≤g(y,U), then g is monotone.
We can then prove that we only need to keep track of the chain starting at m and the chain starting at M.
Proof
If the chains starting at m and M coalesce at time 0, then for any other chain starting at x, we have m≤x≤M.
By monotonicity, we get that the chain starting at m is less than or equal to the chain starting at x at all times, and the chain starting at M is greater than or equal to the chain starting at x at all times.
Thus, if the chains starting at m and M are equal at time 0, the chain starting at x must also be equal to them at time 0.
Example 2 (Simulating from the Ising Model using Perfect Sampling)
We saw above how to use Gibbs sampling to simulate from the Ising model. We then simulated from the conditional distributions, using,
π(σv⋆=1∣σ−v)=…=1+exp(−2β∑v∼wσw)1
Now, we extend this method into a perfect sampling.
Define the partial ordering on configurations σ,τ be defining,
σ≤τ⟺σv≤τv for all vertices v
We have minimal and maximal elements m=(−1,…,−1) and M=(1,…,1).
Note
iF σ≤τ, then for all vπ(σv⋆=1∣σ−v)≤π(τv⋆=1∣τ−v).
Defining g(σ,U)v=2I(U≤π(σv⋆=1∣σ−v))−1, makes g monotone.
Given an Ising model with β>0, start with choosing an integer n>0 and setting configurations,
σ(−n)τ(−n)=m=(−1,…,−1)=M=(1,…,1)Algorithm (Perfect Sampling for the Ising Model)
For i=−n+1,…,0 set σ(i)=σ(i−1), τ(i)=τ(i−1) and then,
Looping through all vertices v:
Compute π(σv⋆=1∣σ−v(i)) and π(τv⋆=1∣τ−v(i)).
Simulate U∼Uniform(0,1).
Update σv(i)=g(σ(i),U)v and τv(i)=g(τ(i),U)v.
If σ(0)=τ(0), this is the result. Otherwise, double n and repeat.