Overview

Part 16 - Summary

Table of Contents

Theory

Definition 1 (Stochastic Process)

A stochastic process is a collection of random variables, {Xt,tI}\{X_t, t \in \mathcal{I} \}. where I\mathcal{I} is the index set of the process and S\mathcal{S} is the common state space of the random variables XtX_t.

Definition 2 (Markov Property)

A process fulfills the Markov property if, for any t0It_0 \in \mathcal{I}, whenever Xt0X_{t_0} is known, XtX_t (with t>t0t > t_0) is independent of the values for XsX_s for all s<t0s < t_0.

Recall (Conditional Probability and Independence)

Given the events AA and BB, the conditional probability of AA given BB is defined as,

P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

The events AA and BB are independent if P(AB)=P(A)P(B)P(A \cap B) = P(A) P(B), or equivalently, if P(AB)=P(A)P(A \mid B) = P(A).

The law of total probability states that if B1,B2,,BnB_1, B_2, \ldots, B_n is a sequence of events that partitions S. Then,

P(A)=i=1nP(ABi)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^{n} P(A \cap B_i) = \sum_{i=1}^{n} P(A \mid B_i) P(B_i)

Thus, Bayes’ law follow directly from the definition of conditional probability and the law of total probability,

P(BA)=P(AB)P(B)P(A)P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)}
Definition 3 (Law of Total Expectation)E[Y]=E[E[YX]]\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y \mid X]]
Definition 4 (Law of Total Variance)

Recall that, by definition,

Var(Y)E[(YE[Y])2]=E[Y2](E[Y])2\mathrm{Var}(Y) \coloneqq \mathbb{E}\left[(Y - \mathbb{E}[Y])^2\right] = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2

Similarly, we have for the conditional variance,

Var(YX)EYX=x[(YE[YX])2X]\mathrm{Var}(Y \mid X) \coloneqq \mathbb{E}_{Y \mid X = x}\left[(Y - \mathbb{E}[Y \mid X])^2 \mid X\right]

With these definitions, we can state the law of total variance as,

Var(Y)=E[Var(YX)]+Var(E[YX])\mathrm{Var}(Y) = \mathbb{E}[\mathrm{Var}(Y \mid X)] + \mathrm{Var}(\mathbb{E}[Y \mid X])
Definition 5 (Markov Chain)

Let SS be a discrete set (not necessarily finite), called the state space. A Markov Chain is a sequence of random variables X0,X1,X2,X_0, X_1, X_2, \ldots taking values in SS, with the property,

π(Xn+1X0,X1,,Xn)=π(Xn+1Xn)\pi(X_{n+1} \mid X_0, X_1, \ldots, X_n) = \pi(X_{n+1} \mid X_n)

for all n1n \geq 1.

The chain is said to be time-homogeneous if, for all n0n \geq 0 1We generally assume this unless otherwise stated.,

π(Xn+1Xn)=π(X1X0)\pi(X_{n+1} \mid X_n) = \pi(X_1 \mid X_0)

The transition matrix is defined with,

Pij=π(Xn+1=jXn=i)P_{ij} = \pi(X_{n+1} = j \mid X_n = i)

Further, a stochastic matrix is a square matrix PP with non-negative entries, satisfying P1=1P \mathbf{1} = \mathbf{1}, where 1\mathbf{1} is a row vector of ones (i.e., all rows sum to one).

Lastly, all transition matrices are stochastic matrices, and all stochastic matrices can be used as transition matrices.

Definition 6 (Limiting Distribution)

A limiting distribution for a Markov chain with transition matrix PP is a probability vector vv such that,

limn(Pn)ij=vj\lim_{n \to \infty} (P^n)_{ij} = v_j

for all states ii and jj.

An equivalent formulation is, the limit limn(Pn)ij\lim_{n \to \infty} (P^n)_{ij} exists and does not depend on ii.

Another equivalent formulation is, limnPn\lim_{n \to \infty} P^n is a stochastic matrix with all rows identical.

Further, a Markov chain has either no or one unique limiting distribution. If a limiting distribution exists, its probabilities correspond to the proportion of time steps the chain spends at each state in the long run.

Definition 7 (Stationary Distribution)

A stationary distribution for a Markov chain is a distribution that is unchanged when applying one step of the Markov chain. If PP is the transition matrix, then a probability vector vv represents a stationary distribution if and only if,

vP=vvP = v

A Markov chain can have zero, one, or many stationary distributions. Limiting distributions are always stationary distributions (but not necessarily vice versa).

Definition 8 (Regular Transition Matrices)

A stochastic matrix PP is positive if all entries are strictly positive, i.e., Pij>0P_{ij} > 0 for all ii and jj. A stochastic matrix PP is regular if PnP^n is positive for some n>0n > 0.

Theorem 1 (Limit Theorem for Regular Markov Chains)

If the transition matrix PP is regular, the limiting distribution exists. There are no other stationary distributions and the limiting distribution is positive, i.e., all states have positive probability in the limiting distribution.

Theorem 2 (Random Walks on Undirected Graphs)

Random walks in a Markov chains (i.e., undircted graphs) have the stationary distribution vv with,

vi=deg(i)Sv_i = \frac{\mathrm{deg}(i)}{S}

where deg(i)\mathrm{deg}(i) is the degree of node ii (i.e., the number of edges going into node ii) and SS is the sum over all nodes of their degrees. Further, weighted random walks in a Markov chains (i.e., weighted undircted graphs) have the stationary distribution vv with,

vi=w(i)ev_i = \frac{w(i)}{e}

where w(i)w(i) is the sum of the weights of the edges going into node ii, and ee is the total sum over all nodes of their w(i)w(i) values.

Intuition (Moving Between States)

State jj is accessible from state ii if (Pn)ij>0(P^n)_{ij} > 0 for some n0n \geq 0. States ii and jj communicate if ii is accessible from jj and jj is accessible from ii.

“Communication” is transitive, i.e., if ii communicates with jj and jj communicates with kk, then ii communicates with kk.

Communication is an equivalence relation, subdividing all states into communication classes. Communication classes can be found for example by drawing transition graphs.

Lastly, a Markov chain is irreducible if it has exactly one communication class.

Definition 9 (Recurrence and Transience)

Let TjT_j be the first passage time to state jj,

Tj=min{n>0:Xn=j}T_j = \min \{n > 0 : X_n = j \}

Let fjf_j be the probability that a chain starting at jj will return to jj,

fj=P(Tj<X0=j)f_j = P(T_j < \infty \mid X_0 = j)

A state jj is recurrent if a chain starting at jj will eventually revisit jj, i.e., if fj=1f_j = 1.

A state jj is transient if a chain starting at jj has a positive probability of never revisiting jj, i.e., if fj<1f_j < 1.

Note

The expected number of visits at jj when the chain starts at ii is given by n=0(Pn)ij\sum_{n = 0}^{\infty} (P^n)_{ij}.

jj is recurrent if and only if n=0(Pn)jj=\sum_{n = 0}^{\infty} (P^n)_{jj} = \infty.

jj is transient if and only if n=0(Pn)jj<\sum_{n = 0}^{\infty} (P^n)_{jj} < \infty.

Note (Communication Classes)

The states of a communication class are either all recurrent or all transient. The states of a finite irreducible Markov chain are all recurrent.

If a state is recurrent, only states inside its communication class are accessible from it. If no states outside a finite communication class are accessible from it, then the class consists of recurrent states.

Theorem 3 (Limit Theorem for Finite Irreducible Markov Chains)
Recall

In a finite irreducible Markov chain, all states are recurrent.

Let μj=E[TjX0=j]\mu_j = \mathbb{E}[T_j \mid X_0 = j] be the expected return time to state jj. Then μj<\mu_j < \infty for all states jj and the vector vv with vj=1μjv_j = \frac{1}{\mu_j} is the unique stationary distribution.

Further,

vj=limn1nm=0n1(Pm)ijv_j = \lim_{n \to \infty} \frac{1}{n} \sum_{m = 0}^{n - 1} (P^m)_{ij}
Note

All finite regular Markov chains are finite irreducible Markov chains, but not vice versa. The overall conclusion here is weaker than that for finite regular Markov chains. Not all finite irreducible Markov chains have limiting distributions.

Definition 10 (Periodicity)

The period of a state ii is the greatest common divisor of all n>0n > 0 such that (Pn)ii>0(P^n)_{ii} > 0. All states of a communication class have the same period.

A Markov chain is periodic if it is irreducible and all states have period greater than 1.

A Markov chain is aperiodic if it is irreducible and all states have period 1.

Definition 11 (Ergodic Markov Chains)

A Markov chain is ergodic if,

  1. It is irreducible.

  2. It is aperiodic.

  3. All states are positive recurrent (i..e, have finite expected return times (which always is true if the state space is finite)).

Theorem 4 (Fundamental Limit Theorem for Ergodic Markov Chains)

There exists a unique stationary distribution vv which is the limiting distribution of the chain.

Note

We can also show that a finite Markov chain is ergodic if and only if its transition matrix is regular.

Definition 12 (Time Reversibility)

Let PP be the transition matrix of an irreducible Markov chain with stationary distribution vv. The chain is time reversible if, when running from its stationary distribution, it looks the same moving forward as backwards, i.e.,

π(Xk=i,Xk+1=j)=π(Xk=j,Xk+1=i)\pi(X_k = i, X_{k + 1} = j) = \pi(X_k = j, X_{k + 1} = i)

This may also be written as viPij=vjPjiv_i P_{ij} = v_j P_{ji} for all states ii and jj. It is called the detailed balance equations.

Note

If xx is a probability vector satisfying xiPij=xjPjix_i P_{ij} = x_j P_{ji} for all states ii and jj, then xx is the stationary distribution of the chain, and the chain is time reversible.

Proof

We have,

(xP)j=ixiPij=ixjPji=xjiPji=xj\begin{align*} (xP)_j & = \sum_{i} x_i P_{ij} \newline & = \sum_i x_j P_{ji} \newline & = x_j \sum_i P_{ji} \newline & = x_j \end{align*}

Thus, xP=xxP = x, so xx is the stationary distribution.

If a Markov chain is defined as a random walk on a weighted undirected graph, then it is time reversible.

Proof

Let w(i,j)w(i, j) be the weight of the edge between nodes ii and jj (0 if no edge exists). Let w(i)=kw(i,k)w(i) = \sum_{k} w(i, k) be the sum of weights of edges going into node ii. The transition matrix is given by Pij=w(i,j)w(i)P_{ij} = \frac{w(i, j)}{w(i)}. Let vi=w(i)ev_i = \frac{w(i)}{e}, where ee is the total sum over all nodes of their w(i)w(i) values. Then,

viPij=w(i)ew(i,j)w(i)=w(i,j)e=w(j,i)e=w(j)ew(j,i)w(j)=vjPji\begin{align*} v_i P_{ij} & = \frac{w(i)}{e} \cdot \frac{w(i, j)}{w(i)} \newline & = \frac{w(i, j)}{e} \newline & = \frac{w(j, i)}{e} \newline & = \frac{w(j)}{e} \cdot \frac{w(j, i)}{w(j)} \newline & = v_j P_{ji} \end{align*}

Thus, the detailed balance equations are satisfied, so the chain is time reversible.

If a finite Markov chain is time reversible, it can be represented as a random walk on a weighted undirected graph.

Proof

Let PP be the transition matrix of a finite time reversible Markov chain with stationary distribution vv. Let w(i,j)=viPijw(i, j) = v_i P_{ij} be the weight of the edge between nodes ii and jj. Then,

jw(i,j)=jviPij=vijPij=vi\sum_{j} w(i, j) = \sum_{j} v_i P_{ij} = v_i \sum_{j} P_{ij} = v_i

Thus, the sum of weights of edges going into node ii is viv_i. The transition matrix is given by,

Pij=w(i,j)kw(i,k)=viPijvi=PijP_{ij} = \frac{w(i, j)}{\sum_{k} w(i, k)} = \frac{v_i P_{ij}}{v_i} = P_{ij}

Thus, the Markov chain can be represented as a random walk on a weighted undirected graph.

Definition 13 (Absorbing Chains)

State ii is absorbing if Pii=1P_{ii} = 1a A Markov chain is an absorbing chain if it has at least one absorbing state.

By reordering the states, the transition matrix for an absorbing chain can be written in the block form,

P=[QR0I]P = \begin{bmatrix} Q & R \newline 0 & I \end{bmatrix}

where II is the identity matrix, 00 is a matrix of zeroes, and QQ corresponds to transient states. We can prove by induction that,

Pn=[Qn(I+Q+Q2++Qn1)R0I]P^n = \begin{bmatrix} Q^n & (I + Q + Q^2 + \ldots + Q^{n - 1}) R \newline 0 & I \end{bmatrix}

Taking the limit and using limnQn=0\lim_{n \to \infty} Q^n = 0 (since all states corresponding to QQ are transient), we have,

limnPn=[0(IQ)1R0I]=[0FR0I]\lim_{n \to \infty} P^n = \begin{bmatrix} 0 & (I - Q)^{-1} R \newline 0 & I \end{bmatrix} = \begin{bmatrix} 0 & FR \newline 0 & I \end{bmatrix}

where F=(IQ)1=limn(I+Q+Q2++Qn1)F = (I - Q)^{-1} = \lim_{n \to \infty} (I + Q + Q^2 + \ldots + Q^{n - 1}) is called the fundamental matrix.

The probability to be absorbed in a particular absorbing state given a start in a transient state is given by the entries of FRFR.

Further, the expected number of visits in state jj for a chain that starts in the transient state ii is given by FijF_{ij} (as this is equal to n=0(Pn)ij\sum_{n = 0}^{\infty} (P^n)_{ij}).

Thus, the expected number of steps until absorption is given by the vector F1TF \mathbf{1}^T, where 1\mathbf{1} is a column vector of ones.

Note

Given an irreducible Markov chain, to compute the expected number of steps needed to go from state ii to the first visit to state jj, one can change the chain into one where state jj is absorbing, and compute the expected number of steps until absorption when starting at state ii.

Definition 14 (Galton-Watson Branching Process)

A (Galton-Watson) branching process is a discrete Markov chain Z0,Z1,,Zn,Z_0, Z_1, \ldots, Z_n, \ldots where,

  • The state space is the non-negative integers.

  • Z0=1Z_0 = 1 (i.e., the process starts with one individual).

  • ZnZ_n is the sum X1+X2++XZn1X_1 + X_2 + \ldots + X_{Z_{n - 1}}, where XiX_i are independent random non-negative integers all with the same offspring distribution,

Zn=i=1Zn1XiZ_n = \sum_{i = 1}^{Z_{n - 1}} X_i

Connecting each of the ZnZ_n individuals in generation nn with their offspring in generation n+1n + 1 we get a tree illustrating the branching process.

The offspring distribution is described by the probability vector a=(a0,a1,)a = (a_0, a_1, \ldots) where aj=P(Xi=j)a_j = P(X_i = j).

We will mainly focus on the interesting cases where we assume a0>0a_0 > 0 (i.e., there is a positive probability of having no offspring) and a0+a1<1a_0 + a_1 < 1 (i.e., there is a positive probability of having more than one offspring).

Definition 15 (Expected Size of Generations in a Branching Process and Classification of Branching Processes)
Note

Note that the state 0 is abosribing, this absorbtion is called extinction. Also, as a0>0a_0 > 0, all nonzero states are transient. Let μ=E[Xi]=j=0jaj\mu = \mathbb{E}[X_i] = \sum_{j=0}^{\infty} j \cdot a_j, i.e., the expected number of offspring per individual.

Then, we may compute that,

E[Zn]=E(i=1Zn1)Xi=E(E(i=1Zn1XiZn1))==E[Zn1]μ\begin{align*} \mathbb{E}[Z_n] & = \mathbb{E}\left(\sum_{i = 1}^{Z_{n-1}}\right) X_i \newline & = \mathbb{E}\left(\mathbb{E}\left(\sum_{i = 1}^{Z_{n-1}} X_i \mid Z_{n - 1} \right)\right) & = \ldots & = \mathbb{E}[Z_{n - 1}] \cdot \mu \newline \end{align*}

Thus, we directly get that,

E[Zn]=μnE[Z0]1=μn\mathbb{E}[Z_n] = \mu^n \cdot \underbrace{\mathbb{E}[Z_0]}_{\coloneqq 1} = \mu^n

Based on the value of μ\mu, we classify branching processes into three categories:

  1. Subcritical branching processes: μ<1\mu < 1. The expected generation size decreases over time, limnE[Zn]=0\lim_{n \to \infty} \mathbb{E}[Z_n] = 0.

  2. Critical branching processes: μ=1\mu = 1. The expected generation size remains constant over time, limnE[Zn]=1\lim_{n \to \infty} \mathbb{E}[Z_n] = 1.

  3. Supercritical branching processes: μ>1\mu > 1. The expected generation size increases over time, limnE[Zn]=\lim_{n \to \infty} \mathbb{E}[Z_n] = \infty.

Note

One can prove that if limnZn=0\lim_{n \to \infty} Z_n = 0, then the probability of extinction is 1. We will do this later.

Definition 16 (Variance of Generation Sizes in a Branching Process)

Let μ\mu be defined as above, and let σ2=Var(Xi)\sigma^2 = \mathrm{Var}(X_i) be the variance of the number of children.

Using the law of total variance we get,

Var(Zn)=Var(E[ZnZn1])+E[Var(ZnZn1)]=Var(E(i=1Zn1XiZn1))+E[Var(i=1Zn1XiZn1)]=Var(μZn1)+E[σ2Zn1]=μ2Var(Zn1)+σ2E[Zn1]\begin{align*} \mathrm{Var}(Z_n) & = \mathrm{Var}(\mathbb{E}[Z_n \mid Z_{n - 1}]) + \mathbb{E}[\mathrm{Var}(Z_n \mid Z_{n - 1})] \newline & = \mathrm{Var}\left(\mathbb{E}\left(\sum_{i = 1}^{Z_{n - 1}} X_i \mid Z_{n - 1}\right)\right) + \mathbb{E}\left[\mathrm{Var}\left(\sum_{i = 1}^{Z_{n - 1}} X_i \mid Z_{n - 1}\right)\right] \newline & = \mathrm{Var}(\mu Z_{n - 1}) + \mathbb{E}[\sigma^2 Z_{n - 1}] \newline & = \mu^2 \mathrm{Var}(Z_{n - 1}) + \sigma^2 \mathbb{E}[Z_{n - 1}] \newline \end{align*}

From this, we can prove by induction that, for n1n \geq 1,

Var(Zn)=σ2μn1k=0n1μk={nσ2if μ=1σ2μn1μn1μ1if μ1\mathrm{Var}(Z_n) = \sigma^2 \mu^{n - 1} \sum_{k = 0}^{n - 1} \mu^k = \begin{cases} n \sigma^2 & \text{if } \mu = 1 \newline \sigma^2 \mu^{n - 1} \frac{\mu^n - 1}{\mu - 1} & \text{if } \mu \neq 1 \end{cases}
Theorem 5 (Extinction Probability Theorem)

Let GXG_X be the probability generating function for the offspring distribution for a branching process. The probability of eventual extinction is the smallest positive root of the equation GX(s)=sG_X(s) = s.

Intuition (Probability Generating Functions)

For any discrete random variable XX taking values in {0,1,2,}\{0, 1, 2, \ldots\}, define the probability function GX(s)G_X(s) as,

GX(s)=E[sX]=k=0P(X=k)skG_X(s) = \mathbb{E}[s^X] = \sum_{k = 0}^{\infty} P(X = k) s^k

The series converges absolutely for s1\Vert s \Vert \leq 1. We assume ss is a real number in [0,1][0, 1].

We get a 1-1 correspondence between probability vectors on {0,1,2,}\{0, 1, 2, \ldots\} and functions represented by a series where the coefficients sum to 1.

Specifically, if GX(s)=GY(s)G_X(s) = G_Y(s) for all ss for random variables XX and YY, then XX and YY have the same distribution.

Note (Properties of Probability Generating Functions)
  • To go from XX to GX(s)G_X(s), we compute the (in)finite sum.

  • To go from GX(s)G_X(s) to XX, we use that we have,

P(X=k)=GX(k)(0)k!P(X = k) = \frac{G_X^{(k)}(0)}{k!}
  • If XX and YY are independent random variables, then,
GX+Y(s)=E[sX+Y]=E[sXsY]=E[sX]E[sY]GX(s)GY(s)G_{X + Y}(s) = \mathbb{E}[s^{X + Y}] = \mathbb{E}[s^X s^Y] = \mathbb{E}[s^X] \mathbb{E}[s^Y] \eqqcolon G_X(s) G_Y(s)
  • E[X]=GX(1)\mathbb{E}[X] = G_X^{\prime}(1)

  • E[X(X1)]=GX(1)\mathbb{E}[X(X - 1)] = G_X^{\prime \prime}(1)

  • As a consequence, Var(X)=GX(1)+GX(1)(GX(1))2\mathrm{Var}(X) = G_X^{\prime \prime}(1) + G_X^{\prime}(1) - (G_X^{\prime}(1))^2

Further, assume we have a branching process Z0,Z1,Z_0, Z_1, \ldots with independent offspring variables XiX_i all with the same distribution as XX. By writing Gn(s)=GZn(s)=E[sZn]G_n(s) = G_{Z_n}(s) = \mathbb{E}[s^{Z_n}] and GX(s)=E[sX]G_X(s) = \mathbb{E}[s^X], we get,

Gn(s)=E[sk=1Zn1Xk]=E[E[sk=1Zn1XkZn1]]=E[E[k=1Zn1sXkZn1]]=E[G(s)Zn1]=Gn1(GX(s))\begin{align*} G_n(s) & = \mathbb{E}[s^{\sum_{k = 1}^{Z_{n - 1}} X_k}] \newline & = \mathbb{E}[\mathbb{E}[s^{\sum_{k = 1}^{Z_{n - 1}} X_k} \mid Z_{n - 1}]] \newline & = \mathbb{E}\left[\mathbb{E}\left[\prod_{k = 1}^{Z_{n - 1}} s^{X_k} \mid Z_{n - 1}\right]\right] \newline & = \mathbb{E}[G(s)^{Z_{n - 1}}] \newline & = G_{n - 1}(G_X(s)) \newline \end{align*}

As G0(s)=E[sZ0]=sG_0(s) = \mathbb{E}[s^{Z_0}] = s, it follows that,

Gn(s)=GX(GX(GX(s)))n timesG_n(s) = \underbrace{G_X(G_X(\ldots G_X(s) \ldots))}_{n \text{ times}}
Theorem 6 (Extinction Probability Theorem (Addition))

Let GXG_X be the probability generating function for the offspring distribution for a branching process. The probability of eventual extinction is the smallest positive root of the equation GX(s)=sG_X(s) = s. Also, if the process is critical (μ=1\mu = 1) then the extinction probability is 1.

Proof (Of the last sentence)

In the critical case,

GX(1)=E[X]=μ=1G_X^{\prime}(1) = \mathbb{E}[X] = \mu = 1

As GX(s)>0G_X^{\prime \prime}(s) > 0 for s(0,1)s \in (0, 1), we get that GX(s)<1G_X^{\prime}(s) < 1 for s(0,1)s \in (0, 1), and dds(G(s)s)<0\frac{d}{ds}(G(s) - s) < 0 for s(0,1)s \in (0, 1). As G(1)1=0G(1) - 1 = 0 for any probability generating function, we get that G(s)=sG(s) = s has its smallest positive root at s=1s = 1.

Theorem 7 (Strong Law of Large Numbers)

Strong law of large numbers of samples: If Y1,Y2,,YmY_1, Y_2, \ldots, Y_m and YY are i.i.d random variables from a distribution with finite mean, and if E[r(Y)]\mathbb{E}[r(Y)] exists, then, with probability 1,

limmr(Y1)+r(Y2)++r(Ym)m=E[r(Y)]\lim_{m \to \infty} \frac{r(Y_1) + r(Y_2) + \ldots + r(Y_m)}{m} = \mathbb{E}[r(Y)]
Theorem 8 (Strong Law of Large Numbers for Markov Chains)

Strong law of large numbers for Markov chains: If X0,X1,X_0, X_1, \ldots is an ergodic Markov chain with stationary distribution π\pi, and if EXπ[r(X)]\mathbb{E}_{X \sim \pi}[r(X)] exists, then, with probability 1,

limmr(X0)+r(X1)++r(Xm)m=EXπ[r(X)]\lim_{m \to \infty} \frac{r(X_0) + r(X_1) + \ldots + r(X_m)}{m} = \mathbb{E}_{X \sim \pi}[r(X)]

where XX has distribution π\pi.

Note

When using this theorem in practice, one might improve accuracy by throwing away the first sequence X1,,XsX_1, \ldots, X_s for s<ms < m before computing the average. The first sequence is then called the burn-in.

Definition 17 (The Metropolis-Hastings Algorithm)

The Metropolis-Hastings algorithm is a method to construct a Markov chain having a desired limiting distribution. Let θ\theta be a random variable with probability mass function π(θ)\pi( \theta). We also assume given a proposal distribution T(θnewθ)T(\theta_{\text{new}} \mid \theta), which, for every given θ\theta, provides a PMF for a new θnew\theta_{\text{new}}.

Finally, we define, for θ\theta and θnew\theta_{\text{new}}, the acceptance probability,

a=min(1,π(θnew)T(θθnew)π(θ)T(θnewθ))a = \min\left(1, \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}\right)

The Metropolis-Hastings algorithm is: Starting with some initial value θ0\theta_0, generate θ1,θ2,\theta_1, \theta_2, \ldots by, at each step, proposing a new θ\theta based on the old using the proposal function and accepting it with probability aa. If it is not accepted, the old value is used again.

If this defines an ergodic Markov chain, its unique stationary distribution is π(θ)\pi(\theta).

Proof

Let PP be the transition matrix of the Markov chain defined by the Metropolis-Hastings algorithm. We need to show that πP=π\pi P = \pi. It suffices to show that the detailed balance equations are satisfied, i.e., that,

π(θ)Pθθnew=π(θnew)Pθnewθ\pi(\theta) P_{\theta \theta_{\text{new}}} = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta}

for all states θ\theta and θnew\theta_{\text{new}}. We have,

Pθθnew=T(θnewθ)a=T(θnewθ)min(1,π(θnew)T(θθnew)π(θ)T(θnewθ))P_{\theta \theta_{\text{new}}} = T(\theta_{\text{new}} \mid \theta) \cdot a = T(\theta_{\text{new}} \mid \theta) \cdot \min\left(1, \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}\right)

and,

Pθnewθ=T(θθnew)a=T(θθnew)min(1,π(θ)T(θnewθ)π(θnew)T(θθnew))P_{\theta_{\text{new}} \theta} = T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} = T(\theta \mid \theta_{\text{new}}) \cdot \min\left(1, \frac{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}\right)

If π(θ)T(θnewθ)π(θnew)T(θθnew)\pi(\theta) T(\theta_{\text{new}} \mid \theta) \leq \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}), then,

π(θ)Pθθnew=π(θ)T(θnewθ)1=π(θnew)T(θθnew)π(θ)T(θnewθ)π(θnew)T(θθnew)=π(θnew)T(θθnew)a=π(θnew)Pθnewθ\begin{align*} \pi(\theta) P_{\theta \theta_{\text{new}}} & = \pi(\theta) T(\theta_{\text{new}} \mid \theta) \cdot 1 \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot \frac{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})} \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} \newline & = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} \end{align*}

If π(θ)T(θnewθ)>π(θnew)T(θθnew)\pi(\theta) T(\theta_{\text{new}} \mid \theta) > \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}), then,

π(θ)Pθθnew=π(θ)T(θnewθ)π(θnew)T(θθnew)π(θ)T(θnewθ)=π(θnew)T(θθnew)1=π(θnew)T(θθnew)a=π(θnew)Pθnewθ\begin{align*} \pi(\theta) P_{\theta \theta_{\text{new}}} & = \pi(\theta) T(\theta_{\text{new}} \mid \theta) \cdot \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)} \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot 1 \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} \newline & = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} \end{align*}

Thus, in both cases, the detailed balance equations are satisfied, so π\pi is the stationary distribution of the chain.

Definition 18 (Gibbs Sampling)

For any probability model over a vector θ=(θ1,θ2,,θn)\theta = (\theta_1, \theta_2, \ldots, \theta_n), consider a Metropolis-Hastings proposal function changing only one coordinate, with the value of this coordinate simulated from the conditional distribution given the remaining coordinates. The acceptance probability is always 1, so the proposal is always accepted.

Proof

Let θ=(θ1,θ2,,θn)\theta = (\theta_1, \theta_2, \ldots, \theta_n) and θnew=(θ1,,θi1,θi,new,θi+1,,θn)\theta_{\text{new}} = (\theta_1, \ldots, \theta_{i - 1}, \theta_{i,\text{new}}, \theta_{i + 1}, \ldots, \theta_n) differ only in coordinate ii. The proposal distribution is,

T(θnewθ)=P(θi=θi,newθ1,,θi1,θi+1,,θn)T(\theta_{\text{new}} \mid \theta) = P(\theta_i = \theta_{i,\text{new}} \mid \theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)

and,

T(θθnew)=P(θi=θiθ1,,θi1,θi+1,,θn)T(\theta \mid \theta_{\text{new}}) = P(\theta_i = \theta_i \mid \theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)

Using the definition of conditional probability, we have,

T(θnewθ)=P(θ1,,θi1,θi,new,θi+1,,θn)P(θ1,,θi1,θi+1,,θn) =π(θnew)xπ(θ1,,θi1,x,θi+1,,θn),\begin{align*} T(\theta_{\text{new}} \mid \theta) & = \frac{P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i,\text{new}}, \theta_{i + 1}, \ldots, \theta_n)} {P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)} \ & = \frac{\pi(\theta_{\text{new}})} {\sum_x \pi(\theta_1, \ldots, \theta_{i - 1}, x, \theta_{i + 1}, \ldots, \theta_n)} , \end{align*}

and,

T(θθnew)=P(θ1,,θi1,θi,θi+1,,θn)P(θ1,,θi1,θi+1,,θn) =π(θ)xπ(θ1,,θi1,x,θi+1,,θn),\begin{align*} T(\theta \mid \theta_{\text{new}}) & = \frac{P(\theta_1, \ldots, \theta_{i - 1}, \theta_i, \theta_{i + 1}, \ldots, \theta_n)} {P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)} \ & = \frac{\pi(\theta)} {\sum_x \pi(\theta_1, \ldots, \theta_{i - 1}, x, \theta_{i + 1}, \ldots, \theta_n)} , \end{align*}

Thus, we have,

π(θnew)T(θθnew)π(θ)T(θnewθ)=π(θnew)π(θ)xπ(θ1,,θi1,x,θi+1,,θn)π(θ)π(θnew)xπ(θ1,,θi1,x,θi+1,,θn) =π(θnew)π(θ)π(θ)π(θnew)=1,\begin{align*} \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})} {\pi(\theta) T(\theta_{\text{new}} \mid \theta)} & = \frac{ \pi(\theta_{\text{new}}) \frac{\pi(\theta)}{\sum_x \pi(\theta_1,\ldots,\theta_{i-1},x,\theta_{i+1},\ldots,\theta_n)} }{ \pi(\theta) \frac{\pi(\theta_{\text{new}})}{\sum_x \pi(\theta_1,\ldots,\theta_{i-1},x,\theta_{i+1},\ldots,\theta_n)} } \ & = \frac{\pi(\theta_{\text{new}})\pi(\theta)} {\pi(\theta)\pi(\theta_{\text{new}})} = 1 , \end{align*}

Thus, the proposal is always accepted.

Putting together an algorithm updating different coordinates in different steps may create an ergodic Markov chain.

This is called Gibbs sampling. Sometimes the conditional distributions are easy to derive. Then this is an easy-to-use version of Metropolis-Hastings.

Definition 19 (The Ising Model)

Uses a grid of vertices; We will assume an n×nn \times n grid. Two vertices vv and ww are neighbours, denoted vwv \sim w, if they are next to each other in the grid.

Each vertex vv can have value +1+1 or 1-1 (called its “spin”); We denote this by σv=1\sigma_{v} = 1 or σv=1\sigma_{v} = -1.

A configuration σ\sigma consists of a choice of +1+1 or 1-1 for each vertex. Thus, the set Ω\Omega of possible configurations has 2(n2)2^{(n^2)} elements. We define the energy of a configuration as E(σ)=vwσvσwE(\sigma) = - \sum_{v \sim w} \sigma_v \sigma_w.

The Gibbs distribution is the probability mass function on Ω\Omega defined by,

π(σ)exp(βE(σ))\pi(\sigma) \propto \exp(-\beta E(\sigma))

where β\beta is a parameter of the model; 1β\frac{1}{\beta} is called the temperature.

It turns out that when the temperature is high, samples from the model will show a chaotic pattern of spins, but when the temperature sinks below the phase transition value, samples will show chunks of neighbouring vertices with the same spin, i.e., the system will be “magnetized”.

Intuition (Simulating from the Ising Model using Gibbs Sampling)

For a vertex configuration σ\sigma and a vertex vv, let σv\sigma_{-v} denote the part of σ\sigma that does not involve vv. Prpose a new configuration σ\sigma^{\star} given an old configuration σ\sigma by first choosing a vertex vv, then, let σ\sigma^{\star} be identical to σ\sigma except possibly at vv. Decide the spin at vv using the conditional distribution given σv\sigma_{-v},

π(σσv)=π(σv=1,σv)π(σv)=π(σv=1,σv)π(σv=1,σv)+π(σv=1,σv)=11+π(σv=1,σv)π(σv=1,σv)=11+exp(βE(σv=1,σv)+βE(σv=1,σv))=11+exp(βvwσvσwσv=1βvwσvσwσv=1)=11+exp(2βvwσw)\begin{align*} \pi(\sigma^{\star} \mid \sigma_{-v}) & = \frac{\pi(\sigma_v = 1, \sigma_{-v})}{\pi(\sigma_{-v})} \newline & = \frac{\pi(\sigma_v = 1, \sigma_{-v})}{\pi(\sigma_v = 1, \sigma_{-v}) + \pi(\sigma_v = -1, \sigma_{-v})} \newline & = \frac{1}{1 + \frac{\pi(\sigma_v = -1, \sigma_{-v})}{\pi(\sigma_v = 1, \sigma_{-v})}} \newline & = \frac{1}{1 + \exp(-\beta E(\sigma_v = -1, \sigma_{-v}) + \beta E(\sigma_v = 1, \sigma_{-v}))} \newline & = \frac{1}{1 + \exp(\beta \sum_{v \sim w} \sigma_v \sigma_w \mid \sigma_v = -1 - \beta \sum_{v \sim w} \sigma_v \sigma_w \mid \sigma_v = 1)} \newline & = \frac{1}{1 + \exp(-2 \beta \sum_{v \sim w} \sigma_w)} \newline \end{align*}
Intuition (Perfect Sampling)

Given an ergodic Markov chain with finite state space of size kk and limiting distribution π\pi. The idea is that, given nn prove that XnX_n actually has reached the limit distribution π\pi.

If we can prove that the distribution at XnX_n is independent of the starting state X0X_0, then we might be able to conclude that XnX_n has the limiting distribution π\pi.

Construct kk Markov chains that are dependent (“coupled”) but which are marginally Markov chains as above. If they start at the kk possible value at X0X_0 but have identical values at XnX_n, then we are done.

Note

nn cannot be determined as the first value where the kk chains meet, it must be determined independently of such information. Thus, usually one wants to generate chains Xn,Xn+1,,X0X_{-n}, X_{-n + 1}, \ldots, X_0 where X0X_0 has the limiting distribution, and we stepwise increase nn to make all chains coalesce to one chain at time 0.

Intuition (Using same source of randomness for all chains)

Consider the chains Xn(j),,X0(j)X^{(j)}_{-n}, \ldots, X^{(j)}_0 for j=1,2,,kj = 1, 2, \ldots, k. Instead of simulating Xi+1(j)X^{(j)}_{i + 1} based on Xi(j)X^{(j)}_i independently for each j, we define a function gg so that Xi+1(j)=g(Xi(j),Ui)X^{(j)}_{i + 1} = g(X^{(j)}_i, U_i) for all jj where UiUniform(0,1)U_i \sim \mathrm{Uniform}(0, 1).

Thus, if two chains have identical values in XiX_i, they will also be identical in Xi+1X_{i + 1}.

Thus, for a particular nn, if all chains have not converged at X0X_0, we simulate kk chains from X2nX_{-2n} to XnX_{-n}. They might only hit a subset of the kk states at XnX_{-n} and thus might coalesce to one state at X0X_0 using the old simulations. If not, double nn again.

Definition 20 (Counting Process)

A counting process {Nt}tI\{N_t\}_{t \in \mathcal{I}} is a stochastic process where I=R0+\mathcal{I} = \mathbb{R}_0^+, where the state space is the non-negative integers, and where 0st0 \leq s \leq t implies NsNtN_s \leq N_t.

Generally, when s<t,NtNss < t, N_t - N_s counts the number of “events” (in the loose meaning “things happening”) in (s,t](s, t].

A realization of NtN_t is then a function of tt that is a right-continuous step function.

Definition 21 (Poisson Process I)

A Poisson process {Nt}t0\{N_t\}_{t \geq 0} with parameter λ>0\lambda > 0 is a counting process, fulfilling,

  1. N0=0N_0 = 0.
  2. NtPoisson(λt)N_t \sim \mathrm{Poisson}(\lambda t) for all t>0t > 0.
  3. Stationary Increments: Nt+sNsN_{t + s} - N_s has the same distribution as NtN_t for all s>0,t>0s > 0, t > 0.
  4. Independent Increments: NtNsN_t - N_s and NrNqN_r - N_q are independent, when 0q<rs<t0 \leq q < r \leq s < t.
Note

It is not obvious that such a process necessarily exists. Further, E[Nt]=λt\mathbb{E}[N_t] = \lambda t, thus what one is counting occurs with a rate of λ\lambda items per time unit.

Definition 22 (Poisson Process II)

Let X1,X2,X_1, X_2, \ldots be a sequence of i.i.d. exponential random variables with parameter λ\lambda. Let N0=0N_0 = 0 and, for t>0t > 0,

Ntmax{n:X1++Xnt}.N_t \coloneqq \max\{n : X_1 + \ldots + X_n \leq t\}.

Then, {Nt}t0\{N_t\}_{t \geq 0} is a Poisson process with parameter λ\lambda.

Proof

Firstly, if we start with a Poisson process (definition I), and let X1,X2,X_1, X_2, \ldots be the inter-arrival times, i.e., the time between event 1 and event 2, event 2 and event 3, etc., then these are i.i.d. exponential random variables with parameter λ\lambda. Thus, NtN_t will be defined as above.

Conversely, if we construct NtN_t as above, all properties of definition I can be verified.

Let {Mt}t0\{M_t\}_{t \geq 0} be defined as MtNt+sNsM_t \coloneqq N_{t + s} - N_s for some s>0s > 0.

  • It is a counting process by definition.
  • N0=0N_0 = 0 by definition.
  • Stationary increments follow from the memoryless property of the exponential distribution: Mk=Ns+t+kNs+t=max{n:XNs+t+1++XNs+t+nk}M_k = N_{s + t + k} - N_{s + t} = \max\{n : X_{N_{s + t} + 1} + \ldots + X_{N_{s + t} + n} \leq k\}, which has the same distribution as NkN_k.
  • Independent increments follow from the independence of the XiX_i’s.
  • Mk=Mk+sNs=max{n:XNs+1++XNs+nk}M_k = M_{k + s} - N_s = \max\{n : X_{N_s + 1} + \ldots + X_{N_s + n} \leq k\}, which has a Poisson(λk)\mathrm{Poisson}(\lambda k) distribution.

We call SnX1++XnS_n \coloneqq X_1 + \ldots + X_n the arrival times of the process. Let M=min{X1,,Xn}M = \min\{X_1, \ldots, X_n\}, where, independently for each ii, XiExponential(λi)X_i \sim \mathrm{Exponential}(\lambda_i), then,

MExponential(λ1++λn),P(M=Xk)=λkλ1++λn.M \sim \mathrm{Exponential}(\lambda_1 + \ldots + \lambda_n), \quad P(M = X_k) = \frac{\lambda_k}{\lambda_1 + \ldots + \lambda_n}.
Proof

Since the XiX_i’s are independent,

P(X1>t)P(X2>t)P(Xn>t)=eλ1teλ2teλnt=e(λ1++λn)t\begin{align*} P(X_1 > t) P(X_2 > t) \ldots P(X_n > t) & = e^{-\lambda_1 t} e^{-\lambda_2 t} \ldots e^{-\lambda_n t} \newline & = e^{-(\lambda_1 + \ldots + \lambda_n) t} \newline \end{align*}

Thus, MExponential(λ1++λn)M \sim \mathrm{Exponential}(\lambda_1 + \ldots + \lambda_n).

Further,

P(M=X1)=P(X2X1,,XnX1)=E[I[X2X1,,XnX1]]=E[E[I[X2X1,,XnX1]X1]]=E[P(X2X1,,XnX1X1)]=E[eλ2X1eλnX1X1]=E[e(λ2++λn)X1]=0e(λ2++λn)xλ1eλ1x dx=λ10e(λ1+λ2++λn)x dx=λ1λ1+λ2++λn[1λ1+λ2++λne(λ1+λ2++λn)x]0=1=λ1λ1+λ2++λn \begin{align*} P(M = X_1) & = P(X_2 \geq X_1, \ldots, X_n \geq X_1) \newline & = \mathbb{E}[\mathbb{I}[X_2 \geq X_1, \ldots, X_n \geq X_1]] \newline & = \mathbb{E}[\mathbb{E}[\mathbb{I}[X_2 \geq X_1, \ldots, X_n \geq X_1] \mid X_1]] \newline & = \mathbb{E}[P(X_2 \geq X_1, \ldots, X_n \geq X_1 \mid X_1)] \newline & = \mathbb{E}[e^{-\lambda_2 X_1} \ldots e^{-\lambda_n X_1} \mid X_1] \newline & = \mathbb{E}[e^{-(\lambda_2 + \ldots + \lambda_n) X_1}] \newline & = \int_{0}^{\infty} e^{-(\lambda_2 + \ldots + \lambda_n) x} \lambda_1 e^{-\lambda_1 x} \ dx \newline & = \lambda_1 \int_{0}^{\infty} e^{-(\lambda_1 + \lambda_2 + \ldots + \lambda_n) x} \ dx \newline & = \frac{\lambda_1}{\lambda_1 + \lambda_2 + \ldots + \lambda_n} \underbrace{\left[-\frac{1}{\lambda_1 + \lambda_2 + \ldots + \lambda_n} e^{-(\lambda_1 + \lambda_2 + \ldots + \lambda_n) x}\right]_{0}^{\infty}}_{= 1} \newline & = \frac{\lambda_1}{\lambda_1 + \lambda_2 + \ldots + \lambda_n} \ _\blacksquare \end{align*}
Definition 23 (Poisson Process III)

A Poisson process {Nt}t0\{N_t\}_{t \geq 0} with parameter λ>0\lambda > 0 is a counting process, fulfilling,

  • N0=0N_0 = 0.
  • The process has stationary and independent increments.
P(Nh=0)=1λh+o(h)P(Nh=1)=λh+o(h)P(Nh>1)=o(h)\begin{align*} P(N_h = 0) & = 1 - \lambda h + o(h) \newline P(N_h = 1) & = \lambda h + o(h) \newline P(N_h > 1) & = o(h) \newline \end{align*}

for all h>0h > 0, where o(h)o(h) is a function such that limh0o(h)h=0\lim_{h \to 0} \frac{o(h)}{h} = 0.

Lemma 1 (Superposition of Poisson Processes)

Let {Nt(1)}t0,,{Nt(n)}t0\{N_t^{(1)}\}_{t \geq 0}, \ldots, \{N_t^{(n)}\}_{t \geq 0} be independent Poisson processes with parameters λp1,,λpn\lambda p_1, \ldots, \lambda p_n, respectively, where p=(p1,,pn)p = (p_1, \ldots, p_n) is a probability vector. If c=(c1,,cn)c = (c_1, \ldots, c_n) are the counts after time tt (i.e., ci=Nt(i)c_i = N_t^{(i)}), then, the conditional distribution of (Nt(1),,Nt(n))(N_t^{(1)}, \ldots, N_t^{(n)}), an equivalent model is,

cMultinomial(Nt,p),c \sim \mathrm{Multinomial}(N_t, p),

where {Nt}t0\{N_t\}_{t \geq 0} is a Poisson process with parameter λ\lambda.

Proof

Using the model with independent Poisson processes, the probability of observing the count vector cc after time tt is (denoting N=c1++cnN = c_1 + \ldots + c_n),

i=1nPoisson(ci;λpit)=i=1neλpit(λpit)cici!=eλt(λt)Ni=1npicici!=eλt(λt)NN!N!c1!c2!cn!p1c1p2c2pncn=Poisson(N;λt)Multinomial(c;N,p)\begin{align*} \prod_{i = 1}^n \mathrm{Poisson}(c_i; \lambda p_i t) & = \prod_{i = 1}^n e^{-\lambda p_i t} \frac{(\lambda p_i t)^{c_i}}{c_i!} \newline & = e^{-\lambda t} (\lambda t)^N \prod_{i = 1}^n \frac{p_i^{c_i}}{c_i!} \newline & = e^{-\lambda t} \frac{(\lambda t)^N}{N!} \cdot \frac{N!}{c_1! c_2! \ldots c_n!} p_1^{c_1} p_2^{c_2} \ldots p_n^{c_n} \newline & = \mathrm{Poisson}(N; \lambda t) \cdot \mathrm{Multinomial}(c; N, p) \newline \end{align*}

The process for NN inherits independent and stationary increments from the sub-processes, so it follows it also a Poisson process with parameter λ\lambda.

Lemma 2 (Uniformly Distributed Arrival Times)

Let {Nt}t0\{N_t\}_{t \geq 0} be a Poisson process with parameter λ\lambda. If we fix that Nt=kN_t = k, and we select uniformly randomly one of the kk arrivals, then its arrival time is uniformly distributed on the interval [0,t][0, t].

Proof

Let S1,S2,,SkS_1, S_2, \ldots, S_k be the arrival times given that Nt=kN_t = k.

P(Sksk uniformly random in 1,,n,Nt=n)=1nk=1nP(SksNt=n)=1nk=1nj=0k1P(Ns=jNt=n)=1nk=1nj=0k1P(Ns=j)P(Nts=nj)P(Nt=n)=1nk=1nj=0k1eλs(λs)jj!eλ(ts)(λ(ts))nj(nj)!eλt(λt)nn!=1nj=0n1(nj)n!j!(nj)!(st)j(1st)nj=[j=0n1n!j!(nj1)!(st)j(1st)nj1](1st)=1st\begin{align*} P(S_k \geq s \mid k \text{ uniformly random in } {1, \ldots, n}, N_t = n) & = \frac{1}{n} \sum_{k = 1}^{n} P(S_k \geq s \mid N_t = n) \newline & = \frac{1}{n} \sum_{k = 1}^{n} \sum_{j = 0}^{k - 1} P(N_s = j \mid N_t = n) \newline & = \frac{1}{n} \sum_{k = 1}^{n} \sum_{j = 0}^{k - 1} \frac{P(N_s = j) P(N_{t - s} = n - j)}{P(N_t = n)} \newline & = \frac{1}{n} \sum_{k = 1}^{n} \sum_{j = 0}^{k - 1} \frac{e^{-\lambda s} \frac{(\lambda s)^j}{j!} e^{-\lambda (t - s)} \frac{(\lambda (t - s))^{n - j}}{(n - j)!}}{e^{-\lambda t} \frac{(\lambda t)^n}{n!}} \newline & = \frac{1}{n} \sum_{j = 0}^{n - 1} (n - j) \frac{n!}{j! (n - j)!} \left(\frac{s}{t}\right)^j \left(1 - \frac{s}{t}\right)^{n - j} \newline & = \left[\sum_{j = 0}^{n - 1} \frac{n!}{j! (n - j - 1)!} \left(\frac{s}{t}\right)^j \left(1 - \frac{s}{t}\right)^{n - j - 1}\right] \cdot \left(1 - \frac{s}{t}\right) \newline & = 1 - \frac{s}{t} \end{align*}
Definition 24 (Spatial Poisson Process)

A collection of random variables {NA}ARd\{N_A\}_{A \subseteq \mathbb{R}^d} is a spatial Poisson process with parameter λ\lambda if,

  • For each bounded set ARdA \subseteq \mathbb{R}^d, NAN_A has a Poisson distribution with parameter λA\lambda |A|.
  • Whenever AB,NANBA \subseteq B, N_A \leq N_B (i.e., spatial counting process).
  • Whenever AA and BB are disjoint, NAN_A and NBN_B are independent.
Definition 25 (Non-Homogeneous Poisson Process)

A counting process {Nt}t0\{N_t\}_{t \geq 0} is a non-homogeneous Poisson process with intensity function λ(t)\lambda(t) if,

  • N0=0N_0 = 0.
  • For 0<s<t0 < s < t,
NtNsPoisson(stλ(x) dx).N_t - N_s \sim \mathrm{Poisson}\left(\int_{s}^{t} \lambda(x) \ dx\right).
  • It has independent increments.
Definition 26 (Continuous-Time Markov Chain)

A continuous-time stochastic process {Xt}t0\{X_t\}_{t \geq 0} with discrete state space S\mathcal{S} is a continuous-time Markov chain if,

P(Xt+s=jXs=i,Xu,0u<s)=P(Xt+s=jXs=i),P(X_{t + s} = j \mid X_s = i, X_u, 0 \leq u < s) = P(X_{t + s} = j \mid X_s = i),

where s,t0s, t \geq 0 and i,j,XuSi, j, X_u \in \mathcal{S}.

The process is time-homogeneous if for s,t0s, t \geq 0 and all i,jSi, j \in \mathcal{S},

P(Xt+s=jXs=i)=P(Xt=jX0=i).P(X_{t + s} = j \mid X_s = i) = P(X_t = j \mid X_0 = i).

We then define the transition function as the matrix function P(t)P(t) with entries of the matrix given by,

P(t)ij=P(Xt=jX0=i).P(t)_{ij} = P(X_t = j \mid X_0 = i).
Theorem 9 (Chapman-Kolmogorov Equations)

For the transition function P(t)P(t) we have,

  • P(s+t)=P(s)P(t)P(s + t) = P(s) P(t)
  • P(0)=IP(0) = I (identity matrix)
Proof

For the first property; If they are same for an arbitrary element i,ji, j of the matrix, then they are equal.

P(s+t)ij=P(Xs+t=jXs=i)=kP(Xs+t=j,Xs=kX0=i)=kP(Xs=kX0=i)P(Xs+t=jXs=k,X0=i)=kP(Xs=kX0=i)P(Xs+t=jXs=k)=kP(Xs=kX0=i)P(Xt=jX0=k)=kP(s)ikP(t)kj=(P(s)P(t))ij \begin{align*} P(s + t)_{ij} & = P(X_{s + t} = j \mid X_s = i) \newline & = \sum_k P(X_{s + t} = j, X_s = k \mid X_0 = i) \newline & = \sum_k P(X_s = k \mid X_0 = i) P(X_{s + t} = j \mid X_s = k, X_0 = i) \newline & = \sum_k P(X_s = k \mid X_0 = i) P(X_{s + t} = j \mid X_s = k) \newline & = \sum_k P(X_s = k \mid X_0 = i) P(X_t = j \mid X_0 = k) \newline & = \sum_k P(s)_{ik} P(t)_{kj} \newline & = (P(s) P(t))_{ij} \ _\blacksquare \end{align*}

The second property follows directly from the definition of the transition function.

Intuition (Holding Times)

Define TiT_i as the time the continuous-time Markov chain {Xt}t0\{X_t\}_{t \geq 0} that always starts in state ii stays in ii before moving to a different state, so that for any s>0s > 0,

P(Ti>s)=P(Xu=i,0us)P(T_i > s) = P(X_u = i, 0 \leq u \leq s)

The distribution of TiT_i is memoryless and thus exponential. We define qiq_i such that,

TiExponential(qi)T_i \sim \mathrm{Exponential}(q_i)

Remember that this means that the average time the process stays in ii is 1qi\frac{1}{q_i}. The rate of transition out of the state is qiq_i.

Note

We can have qi=0q_i = 0, meaning that the state ii is absorbing,

P(Ti>s)=1.P(T_i > s) = 1.
Intuition (Embedded Markov Chain)

We can define a new stochastic process by listing the states the chain visits. This will be a discrete time Markov chain called the embedded Markov chain with transition matrix P~\tilde{P}.

Note

The transition matrix P~\tilde{P} has zeros along its diagonal! Why? Because the process cannot stay in the same state when it makes a transition.

Further, note that the continuous-time Markov chain is completely determined by the expected holding times (1q1,,1qk)(\frac{1}{q_1}, \ldots, \frac{1}{q_k}) and the transition matrix of the embedded Markov chain P~\tilde{P}.

Intuition (Constructing a Continuous-Time Markov Chain)

A way to describe a continuous-time Markov chain is to describe k×(k1)k \times (k - 1) independent “alarm clocks”.

For states ii and jj so that iji \neq j, let qijq_{ij} be the parameter of an Exponentially distributed random variable representing the time until an “alarm clock” rings.

When in state ii, wait until the first alarm clock qiq_i rings, then move to the state given by the index jj of that alarm clock. This defines a continuous-time Markov chain.

The time until the first alarm clock qiq_i rings is Exponentially distributed with parameter given by,

qi=qi1+qi2++qi,i1+qi,i+1++qik.q_i = q_{i1} + q_{i2} + \ldots + q_{i, i - 1} + q_{i, i + 1} + \ldots + q_{ik}.

i.e., the parameter of the holding time distribution at ii.

The chain is completely described by the rates qijq_{ij}.

Further, we saw also that the chain is completely described by the pijp_{ij} and the qiq_i. The relationship is described by the equation above, and,

pij=qijqi1+qi2++qi,i1+qi,i+1++qik=qijqi.p_{ij} = \frac{q_{ij}}{q_{i1} + q_{i2} + \ldots + q_{i, i - 1} + q_{i, i + 1} + \ldots + q_{ik}} = \frac{q_{ij}}{q_i}.

for iji \neq j.

Intuition (Infinitesimal Generator Matrix)

To relate P(t)P(t) to the qijq_{ij}’s, we first relate them to P(0)P^{\prime}(0).

Assuming P(t)P(t) is differentiable, we can show that,

P(0)=[q1q12q13q1kq21q2q23q2kqk1qk2qk3qk]=Q,P^{\prime}(0) = \begin{bmatrix} -q_1 & q_{12} & q_{13} & \ldots & q_{1k} \newline q_{21} & -q_2 & q_{23} & \ldots & q_{2k} \newline \vdots & \vdots & \vdots & \ddots & \vdots \newline q_{k1} & q_{k2} & q_{k3} & \ldots & -q_k \newline \end{bmatrix} = Q,
Proof (Lazy “proof”)

Firstly, let’s show that Q12=P(0)12=q12Q_{12} = P^{\prime}(0)_{12} = q_{12}.

Q12=limh0P(Xh=2X0=1)h=limh0P(h)12h=limh0P(h)12P(0)12hP(0)12\begin{align*} Q_{12} & = \lim_{h \to 0} \frac{P(X_h = 2 \mid X_0 = 1)}{h} \newline & = \lim_{h \to 0} \frac{P(h)_{12}}{h} \newline & = \lim_{h \to 0} \frac{P(h)_{12} - P(0)_{12}}{h} \newline & \eqqcolon P^{\prime}(0)_{12} \newline \end{align*}

Secondly, show that Q1=P(0)11-Q_1 = P^{\prime}(0)_{11}.

Q1=[Q12+Q13++Q1k]=limh0P(h)12+P(h)13++P(h)1kh=limh01P(h)11h=limh0P(h)111h=limh0P(h)11P(0)11hP(0)11\begin{align*} -Q_1 & = -[Q_{12} + Q_{13} + \ldots + Q_{1k}] \newline & = - \lim_{h \to 0} \frac{P(h)_{12} + P(h)_{13} + \ldots + P(h)_{1k}}{h} \newline & = - \lim_{h \to 0} \frac{1 - P(h)_{11}}{h} \newline & = \lim_{h \to 0} \frac{P(h)_{11} - 1}{h} \newline & = \lim_{h \to 0} \frac{P(h)_{11} - P(0)_{11}}{h} \newline & \eqqcolon P^{\prime}(0)_{11} \newline \end{align*}

The other elements can be shown similarly _\blacksquare.

Note that the rows of P(0)P^{\prime}(0), i.e., QQ, sum to zero!

Theorem 10 (Kolmogorov Forward and Backward Equations)

The Kolmogorov Forward and Backward equations describe how the transition function P(t)P(t) evolves over time.

We get that for all t0t \geq 0,

P(t)=P(t)Q=QP(t).P^{\prime}(t) = P(t) Q = Q P(t).

Thus, what this means in terms of the components of P(t)P(t),

P(t)ij=Pij(t)qj+kjPik(t)qkjP(t)ij=qiPij(t)+kiqikPkj(t)\begin{align*} P^{\prime}(t)_{ij} & = -P_{ij}(t) q_j + \sum_{k \neq j} P_{ik}(t) q_{kj} \newline P^{\prime}(t)_{ij} & = -q_i P_{ij}(t) + \sum_{k \neq i} q_{ik} P_{kj}(t) \newline \end{align*}

Thus, the above equations define a set of differential equations which the components of the matrix function P(t)P(t) must satisfy.

ProofP(t)=limh0P(t+h)P(t)h=limh0P(t)P(h)P(t)h=limh0P(t)P(h)Ih=P(t)limh0P(h)Ih=P(t)limh0P(h)P(0)hP(0)=Q=P(t)Q \begin{align*} P^{\prime}(t) & = \lim_{h \to 0} \frac{P(t + h) - P(t)}{h} \newline & = \lim_{h \to 0} \frac{P(t) P(h) - P(t)}{h} \newline & = \lim_{h \to 0} P(t) \frac{P(h) - I}{h} \newline & = P(t) \lim_{h \to 0} \frac{P(h) - I}{h} \newline & = P(t) \underbrace{\lim_{h \to 0} \frac{P(h) - P(0)}{h}}_{P^{\prime}(0) = Q} \newline & = P(t) Q \ _\blacksquare \end{align*}

Since we arbitrarily chose to write P(t+h)P(t + h) as P(t)P(h)P(t) P(h), we could have equally written it as P(h)P(t)P(h) P(t), leading to the other equation, thus completing the proof.

Intuition (Matrix Exponential Solution)

The solution to the Kolmogorov Forward and Backward equations is given by the matrix exponential. For any square matrix AA we can define the matrix exponential as,

eAn=01n!An=I+A+12A2+16A3+e^A \coloneqq \sum_{n = 0}^{\infty} \frac{1}{n!} A^n = I + A + \frac{1}{2} A^2 + \frac{1}{6} A^3 + \ldots

The series converges for all square matrices AA (but we will not prove this here).

Some important properties of the matrix exponential are:

  • e0=Ie^0 = I
  • eAeA=Ie^A e^{-A} = I
  • e(s+t)A=esAetAe^{(s + t) A} = e^{sA} e^{tA}
  • If AB=BAAB = BA, then eA+B=eAeB=eBeAe^{A + B} = e^A e^B = e^B e^A
  • tetA=AetA=etAA\frac{\partial}{\partial t} e^{tA} = A e^{tA} = e^{tA} A

Note that P(t)=etQP(t) = e^{tQ} is the unique solution to the differential equations given by the Kolmogorov Forward and Backward equations with initial condition P(0)=IP(0) = I.

Further, note that P(t)=QP(t)P^\prime(t) = Q P(t) for all t0t \geq 0 and P(0)=IP(0) = I.

In R one can use the expm package to compute matrix exponentials.

Definition 27 (Multinomial Distribution)

A vector x=(x1,x2,,xk)x = (x_1, x_2, \ldots, x_k) of non-negative integers has a Multinomial distribution with parameters nn and pp, where n>0n > 0 is an integer and pp is a probability vector of length kk, if i=1kxi=n\sum_{i = 1}^k x_i = n and the probability mass function is given by,

π(xn,p)=n!x1!x2!xk!p1x1p2x2pkxk.\pi(x \mid n, p) = \frac{n!}{x_1! x_2! \ldots x_k!} p_1^{x_1} p_2^{x_2} \ldots p_k^{x_k}.

We write xMultinomial(n,p)x \sim \mathrm{Multinomial}(n, p).

Definition 28 (Dirichlet Distribution)

A vector p=(p1,p2,,pk)p = (p_1, p_2, \ldots, p_k) of non-negative real numbers satisfying i=1kpi=1\sum_{i = 1}^k p_i = 1 has a Dirichlet distribution with parameter vector α=(α1,α2,,αk)\alpha = (\alpha_1, \alpha_2, \ldots, \alpha_k), if it has probability density function,

π(pα)=Γ(α1+α2++αk)Γ(α1)Γ(α2)Γ(αk)p1α11p2α21pkαk1,\pi(p \mid \alpha) = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1},

where αi>0\alpha_i > 0 for all i=1,2,,ki = 1, 2, \ldots, k. We write pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha).

Theorem 11 (Multinomial-Dirichlet Conjugacy)

Let x=(x1,x2,,xk)x = (x_1, x_2, \ldots, x_k) be a vector of counts with xMultinomial(n,p)x \sim \mathrm{Multinomial}(n, p) and prior pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha). Then the posterior distribution of pp given xx is,

pxDirichlet(α1+x1,α2+x2,,αk+xk).p \mid x \sim \mathrm{Dirichlet}(\alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_k + x_k).

Further, if pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha), then E[pi]=αij=1kαj\mathbb{E}[p_i] = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j} for i=1,2,,ki = 1, 2, \ldots, k.

Proof (Multinomial-Dirichlet Conjugacy)

Let π(xp)=Multinomial(x;n,p)\pi(x \mid p) = \mathrm{Multinomial}(x; n, p) and π(p)=Dirichlet(p;α)\pi(p) = \mathrm{Dirichlet}(p; \alpha), respectively. Then,

π(px)π(xp)π(p)=Multinomial(x;n,p)Dirichlet(p;α)pp1x1p2x2pkxkp1α11p2α21pkαk1pp1α1+x11p2α2+x21pkαk+xk1=Dirichlet(p;α1+x1,α2+x2,,αk+xk) \begin{align*} \pi(p \mid x) & \propto \pi(x \mid p) \pi(p) \newline & = \mathrm{Multinomial}(x; n, p) \mathrm{Dirichlet}(p; \alpha) \newline & \propto_p p_1^{x_1} p_2^{x_2} \ldots p_k^{x_k} \cdot p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1} \newline & \propto_p p_1^{\alpha_1 + x_1 - 1} p_2^{\alpha_2 + x_2 - 1} \ldots p_k^{\alpha_k + x_k - 1} \newline & = \mathrm{Dirichlet}(p; \alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_k + x_k) \ _\blacksquare \end{align*}
Proof (Expectation of Dirichlet Distribution)

Let pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha). Then,

E[pi]=piΓ(α1+α2++αk)Γ(α1)Γ(α2)Γ(αk)p1α11p2α21pkαk1 dp1dp2dpk=Γ(α1+α2++αk)Γ(α1)Γ(α2)Γ(αk)p1α11p2α21piαipkαk1 dp1dp2dpk=Γ(α1+α2++αk)Γ(α1)Γ(α2)Γ(αk)Γ(α1)Γ(α2)Γ(αi+1)Γ(αk)Γ(α1+α2++αk+1)=Γ(αi+1)Γ(αi)Γ(α1+α2++αk)Γ(α1+α2++αk+1)=αi1α1+α2++αk=αij=1kαj \begin{align*} \mathbb{E}[p_i] & = \int \cdots \int p_i \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1} \ dp_1 dp_2 \ldots dp_k \newline & = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} \int \cdots \int p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_i^{\alpha_i} \ldots p_k^{\alpha_k - 1} \ dp_1 dp_2 \ldots dp_k \newline & = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} \cdot \frac{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_i + 1) \ldots \Gamma(\alpha_k)}{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k + 1)} \newline & = \frac{\Gamma(\alpha_i + 1)}{\Gamma(\alpha_i)} \cdot \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k + 1)} \newline & = \alpha_i \cdot \frac{1}{\alpha_1 + \alpha_2 + \ldots + \alpha_k} \newline & = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j} \ _\blacksquare \end{align*}
Intuition (Predictions for The Multinomial-Dirichlet Model)

If pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha) and xpMultinomial(n,p)x \mid p \sim \mathrm{Multinomial}(n, p), then the predictive distribution is given by,

π(x)=n!x1!x2!xk!Γ(α1+x1)Γ(α1)Γ(α2+x2)Γ(α2)Γ(αk+xk)Γ(αk)Γ(i=1kαi)Γ(n+i=1kαi).\pi(x) = \frac{n!}{x_1! x_2! \ldots x_k!} \cdot \frac{\Gamma(\alpha_1 + x_1)}{\Gamma(\alpha_1)} \cdot \frac{\Gamma(\alpha_2 + x_2)}{\Gamma(\alpha_2)} \cdots \frac{\Gamma(\alpha_k + x_k)}{\Gamma(\alpha_k)} \cdot \frac{\Gamma(\sum_{i = 1}^k \alpha_i)}{\Gamma(n + \sum_{i = 1}^k \alpha_i)}.

For example, if eie_i is the vector with 1 at position ii and zeros elsewhere, then π(x=ei)=αij=1kαj\pi(x = e_i) = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j}. Further, if xnewx_{\text{new}} is a vector of new counts, then, as pxMultinomial(x+α)p \mid x \sim \mathrm{Multinomial}(x + \alpha), we get,

π(xnew=eix)=αi+xij=1kαj+n.\pi(x_{\text{new}} = e_i \mid x) = \frac{\alpha_i + x_i}{\sum_{j = 1}^k \alpha_j + n}.

The αi\alpha_i in the prior are often called pseudocounts, as they can be interpreted as prior counts added to the observed counts.

Definition 29 (Hidden Markov Model)

A Hidden Markov Model (HMMs) consists of:

  • A Markov chain X0,,Xn,X_0, \ldots, X_n, \ldots and,
  • another sequence Y0,,Yn,Y_0, \ldots, Y_n, \ldots such that,
P(YkY0,,Yk1,X0,,Xn)=P(YkXk).P(Y_k \mid Y_0, \ldots, Y_{k - 1}, X_0, \ldots, X_n) = P(Y_k \mid X_k).

In some models we may instead have,

P(YkY0,,Yk1,X0,,Xn)=P(YkYk1,Xk).P(Y_k \mid Y_0, \ldots, Y_{k - 1}, X_0, \ldots, X_n) = P(Y_k \mid Y_{k - 1}, X_k).

Generally, Y0,,YnY_0, \ldots, Y_n are called observations and X0,,XnX_0, \ldots, X_n are called hidden states. The XiX_i’s represent the “underlying process” that the observed values YiY_i’s depend on. Further, generally the XkX_k have a finite state space.

Intuition (Inference in HMMs)

When the parameters of the HMM are known, we want to know about the values of the hidden variables XiX_i, for example,

  • What is the most likely sequence X0,X1,,XnX_0, X_1, \ldots, X_n given the data?
  • What is the probability distribution for a single XiX_i given the data? However, when the parameters of the HMM are unknown, we need to infer these from some data.

If data with all XiX_i and YiY_i known is available, inference for parameters is based on counts of transitions. The inference for the Markov chain is exactly as for the Markov chains we have looked at before. The inference for the emission probabilities, i.e., the parameters of P(YkXk)P(Y_k \mid X_k), can be done independently of the inference for the Markov chain.

Intuition (Bayesian Inference for Poisson Processes)

For a homogeneous Poisson process, we setup up a prior π(λ)\pi(\lambda) for its parameter λ\lambda, and find the posterior given the observations.

The likelihood for observing yy events in an interval of length tt is given by,

Poisson(y;λt)=exp(λt)(λt)yy!λexp(λt)λy.\mathrm{Poisson}(y; \lambda t) = \exp(-\lambda t) \frac{(\lambda t)^y}{y!} \propto_{\lambda} \exp(-\lambda t) \lambda^y.

A convenient prior to use is λGamma(α,β)\lambda \sim \mathrm{Gamma}(\alpha, \beta).

In this case, the posterior becomes,

λDGamma(α+y,β+t),\lambda \mid \mathcal{D} \sim \mathrm{Gamma}(\alpha + y, \beta + t),

and the predictive distribution for the number of observations yny_n in a different interval of length uu becomes,

π(yn)=NegativeBinomial(yn;α,ββ+u).\pi(y_n) = \mathrm{Negative-Binomial}\left(y_n; \alpha, \frac{\beta}{\beta + u}\right).
Intuition (Bayesian Inference for Continuous-Time Markov Chains)

Lastyl, recall that a continuous-time Markov chain with finite state space is completely characterized by its holding times parameter vector qq and the transition matrix P~\tilde{P} of its embedded Markov chain.

Parametrizing instead with the “alarm clock” parameters qijq_{ij} gives an equivalent description of the process.

The two parts of the data can be considered independently are,

  • We learn about P~\tilde{P} from the counts of transitions between states, and
  • We learn about qq from the observed lengths of stays the process has in each state.

For P~\tilde{P} the situation is analogous to the one for discrete-time Markov chains, except that the diagonal of P~\tilde{P} must be zero, so the prior must exclude the possibility of non-zero diagonal elements.

For example, for P~1\tilde{P}_1, the first row of P~\tilde{P}, we might use the prior Dirichlet(0,1,,1)\mathrm{Dirichlet}(0, 1, \ldots, 1), i.e., a Dirichlet prior with a zero pseudocount for the first element.

The holding times in state ii are distributed as Exponential(qi)\mathrm{Exponential}(q_i). If we have observed a total holding time of hh over nn intervals, that data has likelihood proportional to ehqiqine^{-hq_i}q_i^n. Using qiGamma(α,β)q_i \sim \mathrm{Gamma}(\alpha, \beta) as prior results in the posterior qiDGamma(α+n,β+h)q_i \mid \mathcal{D} \sim \mathrm{Gamma}(\alpha + n, \beta + h).

Definition 30 (Stationary Distribution for Continuous-Time Markov Chains)

In the context of continuous-time Markov chains, vv is a stationary distribution if and only if 0=vQ0 = v Q, where QQ is the generator matrix of the chain.

Proof (Stationary Distribution and Generator Matrix)

Firstly, let vQ=0vQ = 0,

ddt(vP(t))vddtP(t)vQP(t)=0\begin{align*} \frac{d}{dt} (v P(t)) & \coloneqq v \frac{d}{dt} P(t) \newline & \eqqcolon v Q P(t) \newline & = 0 \newline \end{align*}

as vQ=0vQ = 0 by assumption. Thus, vP(t)v P(t) is constant in tt, i.e., vP(t)=vP(0)=vI=vv P(t) = v P(0) = v I = v for all t0t \geq 0.

Conversely, let vv be a stationary distribution, i.e., v=vP(t)v = v P(t) for all t0t \geq 0. Then,

ddt(vP(t))vddtP(t)vQP(t)=0\begin{align*} \frac{d}{dt} (v P(t)) & \coloneqq v \frac{d}{dt} P(t) \newline & \eqqcolon v Q P(t) \newline & = 0 \newline \end{align*}

as vP(t)v P(t) is constant in tt by assumption. In particular, for t=0t = 0 we get vQP(0)=vQI=vQ=0v Q P(0) = v Q I = v Q = 0. Thus, the proof is complete _\blacksquare.

Intuition (Long-Term Behavior of Continuous-Time Markov Chains)

A continuous-time Markov chain is irreducible if for all ii and jj there exists a t>0t > 0 such that Pij(t)>0P_{ij}(t) > 0, i.e., it is possible to get to any state from any state.

However, periodic continuous-time Markov chains do not exist. If Pij(t)>0P_{ij}(t) > 0 for some t>0t > 0, then Pij(t)>0P_{ij}(t) > 0 for all t>0t > 0.

Theorem 12 (Fundamental Limit Theorem for Continuous-Time Markov Chains)

Let {Xt}t0\{X_t\}_{t \geq 0} be a finite, irreducible, continuous-time Markov chain with transition function P(t)P(t). Then there exists a unique positive stationary distribution vector vv which is also the limiting distribution.

The limiting distribution of such a chain can be found as the unique vv satisfying vQ=0v Q = 0

Definition 31 (Absorbing State)

An absorbing state is one where the rate of leaving it is zero.

Assume {Xt}t0\{X_t\}_{t \geq 0} is a continuous-time Markov chain with kk states. Assume the last state is absorbing and the rest are not (If the chain is irreducible they are then transient).

WE have that qk=0q_k = 0 and the entire last row must consist of zeros. We thus get,

Q=[V00].Q = \begin{bmatrix} V & \star \newline \mathbf{0} & 0 \newline \end{bmatrix}.

Let FF be the (k1)×(k1)(k - 1) \times (k - 1) matrix such that FijF_{ij} (with i<k,j<ki < k, j < k) is the expected time spent in state jj when the chain starts in ii. We can show that F=V1F = -V^{-1}.

Proof (Expected Time in Transient States Before Absorption)

Generally, we can define DD as the matrix with (1q1,,1qk)(\frac{1}{q_1}, \ldots, \frac{1}{q_k}) along its diagonal, with all other entries zero. If there are no absorbing states,

P~=DQ+Ik.\tilde{P} = DQ + I_k.

Write A_A\_ for a square matrix without its last row and column, so, e.g., Q_=VQ\_ = V.

If the last state is absorbing, i.e., qk=0q_k = 0, we get,

P~_=D_Q_+Ik1.\tilde{P}\_ = D\_ Q\_ + I_{k - 1}.

Further, let FF^{\prime} be the matrix where FijF^{\prime}_{ij} is the expected number of stays in state jj before absorbtion when starting in state ii. As the lengths of stays and changes in states are independent, we get Fij=Fij1qjF^{\prime}_{ij} = F_{ij} \frac{1}{q_j} and thus,

F=FD_.F = F^{\prime} D\_.

By the theory for discrete-time Markov chains with absorbing states, we have,

F=(Ik1P~_)1.F^{\prime} = (I_{k - 1} - \tilde{P}\_)^{-1}.

Thus,

F=FD_=(Ik1P~_)1D_=(Ik1(D_Q_+Ik1))1D_=(D_Q_)1D_=Q_1D_1D_=Q_1V1 \begin{align*} F & = F^{\prime} D\_ \newline & = (I_{k - 1} - \tilde{P}\_)^{-1} D\_ \newline & = (I_{k - 1} - (D\_ Q\_ + I_{k - 1}))^{-1} D\_ \newline & = (-D\_ Q\_)^{-1} D\_ \newline & = -Q\_^{-1} D\_^{-1} D\_ \newline & = -Q\_^{-1} \newline & \eqqcolon -V^{-1} \ _\blacksquare \end{align*}

FF is called the fundamental matrix (similar to the discrete-time case).

Note

If the chain starts in state ii, the expected time until absorption is the sum of the ii-th row of FF. Thus, the expected times until absorption are given by the matrix product F1F1 of FF with a column of 1s.

Intuition (Stationary Distribution of the Embedded Markov Chain)

The embedded chain of a continuous-time Markov chain: The discrete-time Markov chain where holding times are ignored.

The stationary distribution for the embedded chain and for the continuous-time chain are generally not the same!

However, there is a simple relationship: A probability vector π\pi is a stationary distribution for a continuous-time Markov chain if and only if ψ\psi is a stationary distribution for the embedded chain, where ψj=Cπjqj\psi_j = C \pi_j q_j for a constant CC making the entries sum to 1.

Proof (Stationary Distribution of the Embedded Markov Chain)

Using notation as above, we have P~=DQ+I\tilde{P} = DQ + I. For any vector vv, we get,

vP~=v(DQ+I)=vDQ+v.v \tilde{P} = v (DQ + I) = v DQ + v.

so vP~=vv \tilde{P} = v if and only if vDQ=0v DQ = 0.

Definition 32 (Global Balance Equations)

Let v=(v1,v2,,vn)v = (v_1, v_2, \ldots, v_n) be the stationary distribution of a continuous-time Markov chain with generator matrix QQ. At vv, the flow into a state must be equal to the flow out of the state, by the definition of stationary distribution. Which are exactly the equations we get from vQ=0v Q = 0,

(v1,v2,,vn)[q1q12q1nq21q2q2nqn1qn2qn](v_1, v_2, \ldots, v_n) \begin{bmatrix} -q_1 & q_{12} & \cdots & q_{1n} \newline q_{21} & -q_2 & \cdots & q_{2n} \newline \vdots & \vdots & \ddots & \vdots \newline q_{n1} & q_{n2} & \cdots & -q_n \newline \end{bmatrix}

Thus, because vQ=0vQ = 0, for each state jj we have,

ijviqij=vjqj.\sum_{i \neq j} v_i q_{ij} = v_j q_j.

This is called the global balance equations.

One can generalize this to: If AA is a set of states, then the long term rates of movement into and out of AA are the same,

iAjAviqij=iAjAviqij.\sum_{i \in A} \sum_{j \notin A} v_i q_{ij} = \sum_{i \notin A} \sum_{j \in A} v_i q_{ij}.
Definition 33 (Local Balance Equations and Time-Reversible Chains)

A stronger condition than global balance is local balance. The flow between every pair of states is balanced.

Formally, an irreducible continuous-time Markov chain with stationary distribution vv is said to be time-reversible if, for all states ii and jj,

viqij=vjqji.v_i q_{ij} = v_j q_{ji}.
Note

The rate of observed changes from ii to jj is the same as the rate of observed change from jj to ii. Thus, this is also called time-reversibility, as the process looks the same when observed backward in time.

Further, If a probability vector vv satisfies the local balance equations, then it is a stationary distribution of the chain.

Definition 34 (Markov Process on a Tree)

Firstly, a tree is a connected (undirected) graph with no cycles. Thus, a transition graph gives rise to an undirected graph by connecting all nodes between which there is some rate of transition. Further, assume that the transition graph of an irreducible continuous-time Markov chain is (or gives rise to) a tree.

In a tree, any edge between two states divides all states into two groups (disjoint sets), each on each side of the edge. Thus, the flow must be balanced between across each edge.

In this scenario, the global balance condition then implies the local balance condition over this edge.

Definition 35 (Birth-and-Death Process)

A birth-and-death process is a continuous-time Markov chain where the state space is the set of non-negative integers and transitions only occur to neighboring states. The process is necessarily time-reversible, as the transition graph is a tree (in fact a line, if we further assume irreducibility).

We denote the births from ii to i+1i + 1 with λi\lambda_i, and the rate of deaths from ii to i1i - 1 with μi\mu_i. Thus, the generator matrix has the form,

Q=[λ0λ000μ1(λ1+μ1)λ100μ2(λ2+μ2)λ200μ3(λ3+μ3)]Q = \begin{bmatrix} -\lambda_0 & \lambda_0 & 0 & 0 & \cdots \newline \mu_1 & -(\lambda_1 + \mu_1) & \lambda_1 & 0 & \cdots \newline 0 & \mu_2 & -(\lambda_2 + \mu_2) & \lambda_2 & \cdots \newline 0 & 0 & \mu_3 & -(\lambda_3 + \mu_3) & \cdots \newline \vdots & \vdots & \vdots & \vdots & \ddots \newline \end{bmatrix}

where λi,μi>0\lambda_i, \mu_i > 0 for all i0i \geq 0.

Provided that k=1i=1kλi1μi<\sum_{k = 1}^{\infty} \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_{i}} < \infty, (i.e., the stationary distribution exists), the stationary distribution is given by,

vk=v0i=1kλi1μi,v0=(1+k=1i=1kλi1μi)1.\begin{align*} v_k & = v_0 \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_i}, \newline v_0 & = \left(1 + \sum_{k = 1}^{\infty} \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_i}\right)^{-1}. \newline \end{align*}
Proof (Stationary Distribution of Birth-and-Death Processes)

Using the local balance equations, we have,

vkλk=vk+1μk+1,\begin{align*} v_k \lambda_k & = v_{k + 1} \mu_{k + 1}, \newline \end{align*}

which gives,

vk+1=vkλkμk+1=v0i=1k+1λi1μi.\begin{align*} v_{k + 1} & = v_k \frac{\lambda_k}{\mu_{k + 1}} \newline & = v_0 \prod_{i = 1}^{k + 1} \frac{\lambda_{i - 1}}{\mu_i}. \newline \end{align*}

Further, as vv is a probability vector, we have,

1=k=0vk=v0(1+k=1i=1kλi1μi).v0=(1+k=1i=1kλi1μi)1.\begin{align*} 1 & = \sum_{k = 0}^{\infty} v_k \newline & = v_0 \left(1 + \sum_{k = 1}^{\infty} \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_i}\right). \newline & \Rightarrow v_0 = \left(1 + \sum_{k = 1}^{\infty} \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_i}\right)^{-1}. \newline \end{align*}

Thus, the proof is complete. _\blacksquare

Definition 36 (Brownian Motion)

Brownian motion is a continuous-time stochastic process {Bt}t0\{B_t\}_{t \geq 0} with the following properties:

  1. B0=0B_0 = 0.

  2. For t>0t >0, BtN(0,t)B_t \sim \mathcal{N}(0, t) (i.e., normally distributed with mean 00 and variance tt).

  3. For s,t>0s, t > 0, Bt+sBsN(0,t)B_{t + s} - B_s \sim \mathcal{N}(0, t) (i.e., the increments are stationary).

  4. For 0q<rs<t0 \leq q < r \leq s < t, BtBsB_t - B_s is independent of BrBqB_r - B_q (i.e., the increments are independent).

  5. The function tBtt \mapsto B_t is continuous with probability 11 (almost surely).

Intuition (Brownian Motion as Limit of Random Walks)

A random walk is a discrete-time Markov chain S0,S1,S2,S_0, S_1, S_2, \ldots where S0=0S_0 = 0 and,

Sn=Y1+Y2++Yn,S_n = Y_1 + Y_2 + \ldots + Y_n,

and Y1,Y2,Y_1, Y_2, \ldots are i.i.d. random variables. Assume E[Yi]=0\mathbb{E}[Y_i] = 0.

Further, if we assume Var(Yi)=1\mathrm{Var}(Y_i) = 1, we get Var(Sn)=n\mathrm{Var}(S_n) = n.

Interpolating between the values SnS_n we can make this into a continuous-time process StS_t where Var(St)t\mathrm{Var}(S_t) \approx t.

We may scale with an s>0s > 0 to get processes St(s)=SstsS_t^{(s)} = \frac{S_{st}}{\sqrt{s}} where we get limsVar(St(s))=t\lim_{s \to \infty} \mathrm{Var}(S_t^{(s)}) = t.

It turns out that the processes St(s)S_t^{(s)} when ss \to \infty are exactly Brownian motion, no matter what type of YiY_i we start with.

This is the Donsker’s invariance principle, which is a generalization of the central limit theorem to stochastic processes [^2].

Definition 37 (Gaussian Process)

A Gaussian process is a continuous-time stochastic process {Xt}t0\{X_t\}_{t \geq 0} with the property that for all n1n \geq 1 and 0t1<t2<<tn0 \leq t_1 < t_2 < \ldots < t_n, Xt1,Xt2,,XtnX_{t_1}, X_{t_2}, \ldots, X_{t_n} have a multivariate normal distribution.

Thus, a Gaussian process is completely determined by its mean function E[Xt]\mathbb{E}[X_t] and its covariance function Cov(Xs,Xt)\mathrm{Cov}(X_s, X_t).

Intuition (Brownian Motion as a Gaussian Process)

Brownian motion is a Gaussian process, as we can show that any a1Bt1+a2Bt2++anBtna_1 B_{t_1} + a_2 B_{t_2} + \ldots + a_n B_{t_n} is normally distributed.

A Gaussian process {Xt}t0\{X_t\}_{t \geq 0} is a Brownian motion if and only if,

  1. X0=0X_0 = 0.

  2. E[Xt]=0\mathbb{E}[X_t] = 0 for all tt.

  3. Cov(Xs,Xt)=min(s,t)\mathrm{Cov}(X_s, X_t) = \min(s, t) for all s,ts, t.

  4. The function tXtt \mapsto X_t is continuous with probability 11 (almost surely).

Intuition (Transformations of Brownian Motion)

The following transformations of Brownian motion also yield Brownian motion:

  • {Bt}t0\{-B_t\}_{t \geq 0}, negating the process yields another (reflected) Brownian motion.
  • {Bt+sBs}t0\{B_{t + s} - B_s\}_{t \geq 0} for any fixed s0s \geq 0, shifting the time origin yields another Brownian motion.
  • {1aBat}t0\{\frac{1}{\sqrt{a}} B_{a t}\}_{t \geq 0} for any fixed a>0a > 0, scaling time and space yields another Brownian motion.
  • The process {Xt}t0\{X_t\}_{t \geq 0} where X0=0X_0 = 0 and Xt=tB1tX_t = t B_{\frac{1}{t}} for t>0t > 0, time inversion yields another Brownian motion.
Intuition (Stopping Times)

We saw above that, for any fixed tt (Bt+sBs)t0(B_{t + s} - B_s)_{t \geq 0} is a Brownian motion. Does this phenomenon also hold if we start the chain anew from TT when TT is random? It depends.

If TT is the largest value less than 1 where BT=0B_T = 0, then (BT+sBT)s0(B_{T + s} - B_T)_{s \geq 0} is not a Brownian motion.

If TT is the smallest value where BT=aB_T = a for some constant aa, then (BT+sBT)s0(B_{T + s} - B_T)_{s \geq 0} is a Brownian motion. The reason is that the event T=tT = t can be determined based on BrB_r where 0rt0 \leq r \leq t.

Random TT’s that have this property are called stopping times. For these BT+sBTB_{T + s} - B_T is a Brownian motion.

Intuition (The Distribution of the First Hitting Time)

Given that a0a \neq 0, what is the distribution of the first hitting time Ta=min{t:Bt=a}T_a = \min\{t : B_t = a\}?

We will prove that,

1TaGamma(12,a22).\frac{1}{T_a} \sim \mathrm{Gamma}\left(\frac{1}{2}, \frac{a^2}{2}\right).

Assuming that a>0a > 0 and using that TaT_a is a stopping time we get for any t>0t > 0 that P(B1t>aTa<1t)=P(B1tTa>0)=12P(B_{\frac{1}{t}} > a \mid T_a < \frac{1}{t}) = P(B_{\frac{1}{t} - T_a} > 0) = \frac{1}{2}.

We also have,

P(B1t>aTa<1t)=P(B1t>a,Ta<1t)P(Ta<1t)=P(B1t>a)P(Ta<1t).\begin{align*} P(B_{\frac{1}{t}} > a \mid T_a < \frac{1}{t}) & = \frac{P(B_{\frac{1}{t}} > a, T_a < \frac{1}{t})}{P(T_a < \frac{1}{t})} \newline & = \frac{P(B_{\frac{1}{t}} > a)}{P(T_a < \frac{1}{t})}. \end{align*}

Further, it follows that P(Ta<1t)=2P(B1t>a)P(T_a < \frac{1}{t}) = 2 P(B_{\frac{1}{t}} > a) and thus,

P(1Tat)=2P(B1t>a)1=2P(B1at)1\begin{align*} P\left(\frac{1}{T_a} \leq t\right) & = 2 P\left(B_{\frac{1}{t}} > a\right) -1 \newline & = 2 P(B_1 \leq a \sqrt{t}) - 1 \newline \end{align*}

Taking the derivative with respect to tt we get the Gamma density,

π1/Ta(t)=212πexp(12(at)2)a2t1/2.\pi_{1/T_a}(t) = 2 \frac{1}{\sqrt{2 \pi}} \exp\left(-\frac{1}{2} \left(a \sqrt{t}\right)^2\right) \frac{a}{2} t^{-1/2}.
Intuition (Maximum of Brownian Motion)

We can define Mtmax0stBsM_t \coloneqq \max_{0 \leq s \leq t} B_s.

We may compute for a>0a > 0 (using the results above),

P(Mt>a)=P(Ta<t)=2P(Bt>a)=P(Bt>a).\begin{align*} P(M_t > a) & = P(T_a < t) \newline & = 2 P(B_t > a) \newline & = P(|B_t| > a). \end{align*}

Thus, MtM_t has the same d istribution as Bt|B_t|, i.e., the absolute value of BtB_t.

Intuition (Zeroes of Brownian Motion)

Let LL be the last zero in (0,1)(0, 1) of a Brownian motion. (i.e., L=max {t:0<t<1,Bt=0}L = \max \ \{t : 0 < t < 1, B_t = 0\}. Then,

LBeta(12,12).L \sim \mathrm{Beta}\left(\frac{1}{2}, \frac{1}{2}\right).
Proof (Outline of Proof)P(L>s)=P(L>sBs=t) N(t;0,s) dt=20P(L>sBs=t) N(t;0,s) dt=20P(M1s>t) N(t;0,s) dt=202P(B1s>t) N(t;0,s) dt=40tN(r;0,1s) N(t;0,s) dr dt==1πs11x(1x) dx=s1Beta(x;12,12) dx.\begin{align*} P(L > s) & = \int_{-\infty}^{\infty} P(L > s \mid B_s = t) \ \mathcal{N}(t; 0, s) \ dt \newline & = 2 \int_{-\infty}^{0} P(L > s \mid B_s = t) \ \mathcal{N}(t; 0, s) \ dt \newline & = 2 \int_{-\infty}^{0} P(M_{1 - s} > -t) \ \mathcal{N}(t; 0, s) \ dt \newline & = 2 \int_{0}^{\infty} 2 P(B_{1 - s} > t) \ \mathcal{N}(t; 0, s) \ dt \newline & = 4 \int_{0}^{\infty} \int_{t}^{\infty} \mathcal{N}(r; 0, 1 - s) \ \mathcal{N}(t; 0, s) \ dr \ dt \newline & = \ldots \newline & = \frac{1}{\pi} \int_s^1 \frac{1}{\sqrt{x(1 - x)}} \ dx \newline & = \int_s^1 \mathrm{Beta}\left(x; \frac{1}{2}, \frac{1}{2}\right) \ dx. \newline \end{align*}

We can then, let LtL_t be the last zero in (0,t)(0, t). Then,

LttBeta(12,12).\frac{L_t}{t} \sim \mathrm{Beta}\left(\frac{1}{2}, \frac{1}{2}\right).
Note

The probability that a Brownian motion has at least one zero in (r,t)(r, t) for 0r<t0 \leq r < t is 1P(Lt<r)1 - P(L_t < r).

Further, the cumulative distribution for the Beta(12,12)\mathrm{Beta}\left(\frac{1}{2}, \frac{1}{2}\right) density can be computed with the arcsin function,

P(Lt<r)=0rtBeta(s;12,12) ds=2πarcsin(rt).P(L_t < r) = \int_0^{\frac{r}{t}} \mathrm{Beta}\left(s; \frac{1}{2}, \frac{1}{2}\right) \ ds = \frac{2}{\pi} \arcsin\left(\sqrt{\frac{r}{t}}\right).
Definition 38 (Brownian Bridge)

Define a Gaussian process XtX_t by conditioning a Brownian motion BtB_t on B1=0B_1 = 0. Then XtX_t is a Brownian bridge.

If 0<s<t<10 < s < t < 1, then (Bs,Bt,B1)(B_s, B_t, B_1) is a multivariate normal with,

E[(Bs,Bt,B1)]=(0,0,0)Var((Bs,Bt,B1))=Σ=(ssssttst1).\begin{align*} \mathbb{E}[(B_s, B_t, B_1)] & = (0, 0, 0) \newline \mathrm{Var}((B_s, B_t, B_1)) & = \Sigma = \begin{pmatrix} s & s & s \newline s & t & t \newline s & t & 1 \newline \end{pmatrix}. \newline \end{align*}

Conditioning on B1=0B_1 = 0 and using the properties of the multivariate normal we get that E[Xt]=0\mathbb{E}[X_t] = 0 and,

Cov(Xs,Xt)=s(1t)=sst.\mathrm{Cov}(X_s, X_t) = s(1 - t) = s - st.

It follows that this is identical to the Brownian bridge defined above.

Definition 39 (Brownian Motion with Drift)

For any real μ>0\mu > 0 and σ>0\sigma > 0 we can define the Gaussian process XtX_t as,

Xt=μt+σBt.X_t = \mu t + \sigma B_t.

This is a Brownian motion with a draft, and is often a more useful model than standard Brownian motion.

Note

Note that XtX_t is Normal with expectation μt\mu t and variance σ2t\sigma^2 t.

This is a Gaussian process with continuous paths and stationary and independent increments.

Definition 40 (Geometric Brownian Motion)

The stochastic process,

Gt=G0eμt+σBt,G_t = G_0 e^{\mu t + \sigma B_t},

where G0>0G_0 > 0 is called a geometric Brownian motion with drift parameter μ\mu and variance parameter σ2\sigma^2.

log(Gt)\log(G_t) is a Gaussian process with expectation log(G0)+μt\log(G_0) + \mu t and variance σ2t\sigma^2 t.

Note

One can show that,

E[Gt]=G0et(μ+12σ2)Var(Gt)=G02e2t(μ+σ2)(eσ2t1).\begin{align*} \mathbb{E}[G_t] & = G_0 e^{t(\mu + \frac{1}{2} \sigma^2)} \newline \mathrm{Var}(G_t) & = G_0^2 e^{2 t(\mu + \sigma^2)} (e^{\sigma^2 t} - 1). \newline \end{align*}
Proof

Using that log(Gt)N(log(G0)+μt,σ2t)\log(G_t) \sim \mathcal{N}(\log(G_0) + \mu t, \sigma^2 t) we have,

E[Gt]=E[elog(Gt)]=ex N(x;log(G0)+μt,σ2t) dx=12πσ2texp((x(log(G0)+μt))22σ2t+x) dx=12πσ2texp((x(log(G0)+μtσ2t))22σ2t+log(G0)+μt+12σ2t) dx=elog(G0)+μt+12σ2t12πσ2texp((x(log(G0)+μtσ2t))22σ2t) dx=elog(G0)+μt+12σ2t=G0et(μ+12σ2).\begin{align*} \mathbb{E}[G_t] & = \mathbb{E}[e^{\log(G_t)}] \newline & = \int_{-\infty}^{\infty} e^x \ \mathcal{N}(x; \log(G_0) + \mu t, \sigma^2 t) \ dx \newline & = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t))^2}{2 \sigma^2 t} + x\right) \ dx \newline & = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t - \sigma^2 t))^2}{2 \sigma^2 t} + \log(G_0) + \mu t + \frac{1}{2} \sigma^2 t\right) \ dx \newline & = e^{\log(G_0) + \mu t + \frac{1}{2} \sigma^2 t} \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t - \sigma^2 t))^2}{2 \sigma^2 t}\right) \ dx \newline & = e^{\log(G_0) + \mu t + \frac{1}{2} \sigma^2 t} \newline & = G_0 e^{t(\mu + \frac{1}{2} \sigma^2)}. \newline \end{align*}

Similarly, we can compute E[Gt2]\mathbb{E}[G_t^2] and use Var(Gt)=E[Gt2](E[Gt])2\mathrm{Var}(G_t) = \mathbb{E}[G_t^2] - (\mathbb{E}[G_t])^2 to get the variance,

E[Gt2]=E[e2log(Gt)]=e2x N(x;log(G0)+μt,σ2t) dx=12πσ2texp((x(log(G0)+μt))22σ2t+2x) dx=12πσ2texp((x(log(G0)+μt2σ2t))22σ2t+log(G0)+μt+2σ2t) dx=elog(G0)+μt+2σ2t12πσ2texp((x(log(G0)+μt2σ2t))22σ2t) dx=elog(G0)+μt+2σ2t=G02et(2μ+2σ2).\begin{align*} \mathbb{E}[G_t^2] & = \mathbb{E}[e^{2 \log(G_t)}] \newline & = \int_{-\infty}^{\infty} e^{2x} \ \mathcal{N}(x; \log(G_0) + \mu t, \sigma^2 t) \ dx \newline & = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t))^2}{2 \sigma^2 t} + 2x\right) \ dx \newline & = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t - 2 \sigma^2 t))^2}{2 \sigma^2 t} + \log(G_0) + \mu t + 2 \sigma^2 t\right) \ dx \newline & = e^{\log(G_0) + \mu t + 2 \sigma^2 t} \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t - 2 \sigma^2 t))^2}{2 \sigma^2 t}\right) \ dx \newline & = e^{\log(G_0) + \mu t + 2 \sigma^2 t} \newline & = G_0^2 e^{t(2 \mu + 2 \sigma^2)}. \newline \end{align*}

Thus,

Var(Gt)=G02et(2μ+2σ2)(G0et(μ+12σ2))2=G02e2t(μ+σ2)(eσ2t1).\begin{align*} \mathrm{Var}(G_t) & = G_0^2 e^{t(2 \mu + 2 \sigma^2)} - \left(G_0 e^{t(\mu + \frac{1}{2} \sigma^2)}\right)^2 \newline & = G_0^2 e^{2 t(\mu + \sigma^2)} (e^{\sigma^2 t} - 1). \newline \end{align*}

_\blacksquare

Intuition (Stock Options)

A (European) stock option is a right (but not an obligation) to buy a stock at a given time tt in the future for a given price KK.

How much can you expect to earn from a stock option at that future time?

We get that (we will derive this),

E[max(GtK,0)]=G0et(μ+σ22)P(B1>βσtt)KP(B1>βt),\mathbb{E}[\max(G_t - K, 0)] = G_0 e^{t(\mu + \frac{\sigma^2}{2})} P\left(B_1 > \frac{\beta - \sigma t}{\sqrt{t}}\right) - K P\left(B_1 > \frac{\beta}{\sqrt{t}}\right),

where β=log(K/G0)μtσ\beta = \frac{\log(K / G_0) - \mu t}{\sigma}.

Derivation

Firstly, we need to prove the algebraic identity,

eσxN(x;0,t)=eσ2t2N(x;σt,t).e^{\sigma x} \mathcal{N}(x; 0, t) = e^{\frac{\sigma^2 t}{2}} \mathcal{N}(x; \sigma t, t).
Proof

We have,

eσxN(x;0,t)=12πtexp(x22t+σx)=12πtexp((xσt)22t+σ2t2)=eσ2t2N(x;σt,t). \begin{align*} e^{\sigma x} \mathcal{N}(x; 0, t) & = \frac{1}{\sqrt{2 \pi t}} \exp\left(-\frac{x^2}{2 t} + \sigma x\right) \newline & = \frac{1}{\sqrt{2 \pi t}} \exp\left(-\frac{(x - \sigma t)^2}{2 t} + \frac{\sigma^2 t}{2}\right) \newline & = e^{\frac{\sigma^2 t}{2}} \mathcal{N}(x; \sigma t, t). \ _\blacksquare \end{align*}

Then, defining β=log(K/G0)μtσ\beta = \frac{\log(K / G_0) - \mu t}{\sigma} we get,

E[max(GtK,0)]E[max(G0eμt+σBtK,0)]=max(G0eμt+σxK,0) N(x;0,t) dx=β(G0eμt+σxK) N(x;0,t) dx=G0eμtβeσx N(x;0,t) dxKβN(x;0,t) dx=G0et(μ+σ22)βN(x;σt,t) dxKβN(x;0,t) dx=G0et(μ+σ22)P(B1>βσtt)KP(B1>βt). \begin{align*} \mathbb{E}[\max(G_t - K, 0)] & \coloneqq \mathbb{E}[\max(G_0 e^{\mu t + \sigma B_t} - K, 0)] \newline & = \int_{-\infty}^{\infty} \max(G_0 e^{\mu t + \sigma x} - K, 0) \ \mathcal{N}(x; 0, t) \ dx \newline & = \int_{\beta}^{\infty} (G_0 e^{\mu t + \sigma x} - K) \ \mathcal{N}(x; 0, t) \ dx \newline & = G_0 e^{\mu t} \int_{\beta}^{\infty} e^{\sigma x} \ \mathcal{N}(x; 0, t) \ dx - K \int_{\beta}^{\infty} \mathcal{N}(x; 0, t) \ dx \newline & = G_0 e^{t(\mu + \frac{\sigma^2}{2})} \int_{\beta}^{\infty} \mathcal{N}(x; \sigma t, t) \ dx - K \int_{\beta}^{\infty} \mathcal{N}(x; 0, t) \ dx \newline & = G_0 e^{t(\mu + \frac{\sigma^2}{2})} P\left(B_1 > \frac{\beta - \sigma t}{\sqrt{t}}\right) - K P\left(B_1 > \frac{\beta}{\sqrt{t}}\right). \ _\blacksquare \end{align*}
Definition 41 (Martingale)

A stochastic process {Yt}t0\{Y_t\}_{t \geq 0} is a martingale if for t0t \geq 0,

  • E[YtYr,0rs]=Ys\mathbb{E}[Y_t \mid Y_r, 0 \leq r \leq s] = Y_s for 0st0 \leq s \leq t.
  • E[Yt]<\mathbb{E}[|Y_t|] < \infty. A Brownian motion BtB_t is a martingale.
Proof

We have,

E[BtBr,0rs]=E[BtBs+BsBr,0rs]=E[BtBsBr,0rs]+E[BsBr,0rs]=0+Bs=Bs.\begin{align*} \mathbb{E}[B_t \mid B_r, 0 \leq r \leq s] & = \mathbb{E}[B_t - B_s + B_s \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[B_t - B_s \mid B_r, 0 \leq r \leq s] + \mathbb{E}[B_s \mid B_r, 0 \leq r \leq s] \newline & = 0 + B_s \newline & = B_s. \end{align*}

and,

E[Bt]=x N(x;0,t) dx=20x N(x;0,t) dx=20x 12πtexp(x22t) dx=212πtt=2tπ<.\begin{align*} \mathbb{E}[|B_t|] & = \int_{-\infty}^{\infty} |x| \ \mathcal{N}(x; 0, t) \ dx \newline & = 2 \int_{0}^{\infty} x \ \mathcal{N}(x; 0, t) \ dx \newline & = 2 \int_{0}^{\infty} x \ \frac{1}{\sqrt{2 \pi t}} \exp\left(-\frac{x^2}{2 t}\right) \ dx \newline & = 2 \cdot \frac{1}{\sqrt{2 \pi t}} \cdot t \newline & = \sqrt{\frac{2 t}{\pi}} < \infty. \newline \end{align*}

Further, one can define a martingale with respect to other stochastic processes.

{Yt}t0\{Y_t\}_{t \geq 0} is a martingale with respect to {Xt}t0\{X_t\}_{t \geq 0} if for t0t \geq 0,

  • E[YtXr,0rs]=Ys\mathbb{E}[Y_t \mid X_r, 0 \leq r \leq s] = Y_s for 0st0 \leq s \leq t.
  • E[Yt]<\mathbb{E}[|Y_t|] < \infty.
Derivation (Geometric Brownian Motion can be a Martingale)

Let GtG0eμt+σBtG_t \coloneqq G_0 e^{\mu t + \sigma B_t} be a geometric Brownian motion. We can derive that,

E[GtBr,0rs]E[G0eμt+σBtBr,0rs]=E[G0eμ(ts)+σ(BtBs)+μs+σBseμs+σBsBr,0rs]=E[Gts]eμs+σBs=G0e(ts)(μ+σ22)eμs+σBs.=Gse(ts)(μ+σ22).\begin{align*} \mathbb{E}[G_t \mid B_r, 0 \leq r \leq s] & \coloneqq \mathbb{E}[G_0 e^{\mu t + \sigma B_t} \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[G_0 e^{\mu(t - s) + \sigma (B_t - B_s) + \mu s + \sigma B_s} e^{\mu s + \sigma B_s} \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[G_{t - s}] e^{\mu s + \sigma B_s} \newline & = G_0 e^{(t - s)(\mu + \frac{\sigma^2}{2})} e^{\mu s + \sigma B_s}. \newline & = G_s e^{(t - s)(\mu + \frac{\sigma^2}{2})}. \newline \end{align*}

We see that GtG_t is a martingale with respect to BtB_t if and only if μ+σ22=0\mu + \frac{\sigma^2}{2} = 0, i.e., μ=σ22\mu = -\frac{\sigma^2}{2}.

Examples

Example 1 (Toy Example of MCMC)

Consider the Markov chain X0,X1,X_0, X_1, \ldots with states {0,1,2}\{0, 1, 2\} and with,

P=[0.990.01000.90.10.200.8]P = \begin{bmatrix} 0.99 & 0.01 & 0 \newline 0 & 0.9 & 0.1 \newline 0.2 & 0 & 0.8 \end{bmatrix}

One can find that the limiting distribution is v=(2023,223,123)v = (\frac{20}{23}, \frac{2}{23}, \frac{1}{23}).

Consider the function r(x)=x5r(x) = x^5. If XX is a random variable with the limiting distribution,

E[r(X)]=052023+15223+25123=3423=1.4783\mathbb{E}[r(X)] = 0^5 \cdot \frac{20}{23} + 1^5 \cdot \frac{2}{23} + 2^5 \cdot \frac{1}{23} = \frac{34}{23} = 1.4783

If Y1,,YnY_1, \ldots, Y_n are all i.i.d variables with the limiting distribution, we can check numerically that,

limnr(Y1)+r(Y2)++r(Yn)n=1.4783\lim_{n \to \infty} \frac{r(Y_1) + r(Y_2) + \ldots + r(Y_n)}{n} = 1.4783

We also get for X0,X1,X_0, X_1, \ldots that,

limnr(X1)+r(X2)++r(Xn)n=1.4783\lim_{n \to \infty} \frac{r(X_1) + r(X_2) + \ldots + r(X_n)}{n} = 1.4783

but in this case the limit is approached more slowly.

Example 2 (Counting Zeroes)

How many zeroes does BtB_t have in the interval (0,s)(0, s)?

If LsL_s is the last zero in (0,s)(0, s), we saw that,

LssBeta(12,12),\frac{L_s}{s} \sim \mathrm{Beta}\left(\frac{1}{2}, \frac{1}{2}\right),

which means that, with probability one, there exists a zero in the interval (0,s)(0, s) and for the last zero LsL_s we have 0<Ls<s0 < L_s < s.

Repeating the argument, we have that 0<LLs<Ls0 < L_{L_s} < L_s with probability one, and thus there exists another zero in (0,Ls)(0,s)(0, L_s) \subset (0, s).

The conclusion is that, with probability one, there is an infinite number of zeroes in (0,s)(0, s) for any s>0s > 0.

Example 3 (Simulating from the Ising Model using Perfect Sampling)

We saw above how to use Gibbs sampling to simulate from the Ising model. We then simulated from the conditional distributions, using,

π(σv=1σv)==11+exp(2βvwσw)\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) = \ldots = \frac{1}{1 + \exp(-2 \beta \sum_{v \sim w} \sigma_w)}

Now, we extend this method into a perfect sampling.

Define the partial ordering on configurations σ,τ\sigma, \tau be defining,

στ    σvτv for all vertices v\sigma \leq \tau \iff \sigma_v \leq \tau_v \text{ for all vertices } v

We have minimal and maximal elements m=(1,,1)m = (-1, \ldots, -1) and M=(1,,1)M = (1, \ldots, 1).

Note

iF στ\sigma \leq \tau, then for all vv π(σv=1σv)π(τv=1τv)\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) \leq \pi(\tau_v^{\star} = 1 \mid \tau_{-v}).

Defining g(σ,U)v=2I(Uπ(σv=1σv))1g(\sigma, U)_v = 2 I (U \leq \pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) ) - 1, makes gg monotone.

Given an Ising model with β>0\beta > 0, start with choosing an integer n>0n > 0 and setting configurations,

σ(n)=m=(1,,1)τ(n)=M=(1,,1)\begin{align*} \sigma^{(-n)} & = m = (-1, \ldots, -1) \newline \tau^{(-n)} & = M = (1, \ldots, 1) \newline \end{align*}
Algorithm (Perfect Sampling for the Ising Model)
  • For i=n+1,,0i = -n +1, \ldots, 0 set σ(i)=σ(i1)\sigma^{(i)} = \sigma^{(i - 1)}, τ(i)=τ(i1)\tau^{(i)} = \tau^{(i - 1)} and then,

    • Looping through all vertices vv:
      • Compute π(σv=1σv(i))\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}^{(i)}) and π(τv=1τv(i))\pi(\tau_v^{\star} = 1 \mid \tau_{-v}^{(i)}).
      • Simulate UUniform(0,1)U \sim \mathrm{Uniform}(0, 1).
      • Update σv(i)=g(σ(i),U)v\sigma_v^{(i)} = g(\sigma^{(i)}, U)_v and τv(i)=g(τ(i),U)v\tau_v^{(i)} = g(\tau^{(i)}, U)_v.
  • If σ(0)=τ(0)\sigma^{(0)} = \tau^{(0)}, this is the result. Otherwise, double nn and repeat.

Example 4 (Using a Binomial Likelihood)

Assume the offspring process is Binomial(N,p)\mathrm{Binomial}(N, p) for some parameter pp and a fixed (known) NN. By definition, we get the likelihood,

π(y1,y2,,ynp)i=1nBinomial(yi;N,p)\pi(y_1, y_2, \ldots, y_n \mid p) \coloneqq \prod_{i = 1}^n \mathrm{Binomial}(y_i; N, p)

A possibility is to use a prior pBeta(α,β)p \sim \mathrm{Beta}(\alpha, \beta). Writing S=i=1nyiS = \sum_{i = 1}^n y_i, we get the posterior,

pDBeta(α+S,β+nNS).p \mid \mathcal{D} \sim \mathrm{Beta}(\alpha + S, \beta + nN - S).

where D={y1,y2,,yn}\mathcal{D} = \{y_1, y_2, \ldots, y_n\} is the data.

More generally, if π(p)=f(p)\pi(p) = f(p) for any positive function integrating to 1 on [0,1][0, 1], we get the posterior,

π(pD)pBeta(p;S+1,nNS+1)f(p).\pi(p \mid \mathcal{D}) \propto_p \mathrm{Beta}(p; S + 1, nN - S + 1) f(p).

We can then for example compute numerically the posterior probability that the branching process is supercritical, i.e., that P(p>1ND)P(p > \frac{1}{N} \mid \mathcal{D}), with,

1N1π(pD) dp=1N1Beta(p;S+1,nNS+1)f(p) dp01Beta(p;S+1,nNS+1)f(p) dp.\int_{\frac{1}{N}}^1 \pi(p \mid \mathcal{D}) \ dp = \frac{\int_{\frac{1}{N}}^1 \mathrm{Beta}(p; S + 1, nN - S + 1) f(p) \ dp}{\int_0^1 \mathrm{Beta}(p; S + 1, nN - S + 1) f(p) \ dp}.
Example 5 (Using a Multinomial Likelihood)

Assume there is a maximum of NN offspring and that now p=(p0,p1,,pN)p = (p_0, p_1, \ldots, p_N) is an (unknown) probability vector such that pip_i is the probability of ii offspring. By definition, we get the likelihood,

π(y1,y2,,ynp)Multinomial(c;p),\pi(y_1, y_2, \ldots, y_n \mid p) \coloneqq \mathrm{Multinomial}(c; p),

where c=(c0,c1,,cN)c = (c_0, c_1, \ldots, c_N) is the vector of counts in the data of cases with 0,,N0, \ldots, N offspring, respectively.

If we use a prior pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha), where α=(α0,α1,,αN)\alpha = (\alpha_0, \alpha_1, \ldots, \alpha_N) is a vector of pseudocounts, we get the posterior,

pDDirichlet(α+c),p \mid \mathcal{D} \sim \mathrm{Dirichlet}(\alpha + c),

with expecation,

E[piD]=αi+cij=0N(αj+cj).\mathbb{E}[p_i \mid \mathcal{D}] = \frac{\alpha_i + c_i}{\sum_{j = 0}^N (\alpha_j + c_j)}.
Example 6 (Simplest Birth-and-Death Process)

The simplest example of a birth-and-death process is when λi=λ\lambda_i = \lambda and μi=μ\mu_i = \mu for all i0i \geq 0. In this case, the stationary distribution is given by,

vk=v0(λμ)k=v0(λμ)k,v0=(k=0(λμ)k)1=11+λμ+(λμ)2+=111λμ=1λμ.\begin{align*} v_k & = v_0 \left(\frac{\lambda}{\mu}\right)^k = v_0 \left(\frac{\lambda}{\mu}\right)^k, \newline v_0 & = \left(\sum_{k = 0}^{\infty} \left(\frac{\lambda}{\mu}\right)^k\right)^{-1} = \frac{1}{1 + \frac{\lambda}{\mu} + \left(\frac{\lambda}{\mu}\right)^2 + \ldots} = \frac{1}{\frac{1}{1 - \frac{\lambda}{\mu}}} = 1 - \frac{\lambda}{\mu}. \end{align*}

Which is a geometric distribution with parameter 1λμ1 - \frac{\lambda}{\mu}, Geometric(1λμ)\mathrm{Geometric}\left(1 - \frac{\lambda}{\mu}\right). Thus, the long-term average value of XtX_t is,

E[Xt]λμ1λμ=λμλ.E[X_t] \coloneqq \frac{\frac{\lambda}{\mu}}{1 - \frac{\lambda}{\mu}} = \frac{\lambda}{\mu - \lambda}.
Example 7 (Probability Computation with Brownian Motion)

Show that B1+B3+2B7N(0,50)B_1 + B_3 + 2 B_7 \sim \mathcal{N}(0, 50).

Solution

We can write,

B1+B3+2B7=B1+(B3B1)+B1+2(B7B3)+2B3=2B1+(B3B1)+2(B7B3)+2(B3B1)+2B1=4B1+2(B3B1)+2(B7B3).\begin{align*} B_1 + B_3 + 2 B_7 & = B_1 + (B_3 - B_1) + B_1 + 2 (B_7 - B_3) + 2 B_3 \newline & = 2 B_1 + (B_3 - B_1) + 2 (B_7 - B_3) + 2(B_3 - B_1) + 2 B_1 \newline & = 4 B_1 + 2 (B_3 - B_1) + 2 (B_7 - B_3). \end{align*}

Looking at each term separately, we have,

E[4B1]=0Var(4B1)=42Var(B1)=42\begin{align*} \mathbb{E}[4 B_1] & = 0 \newline \mathrm{Var}(4 B_1) & = 4^2 \mathrm{Var}(B_1) = 4^2 \end{align*}

and,

E[2(B3B1)]=0Var(2(B3B1))=22Var(B3B1)=222\begin{align*} \mathbb{E}[2 (B_3 - B_1)] & = 0 \newline \mathrm{Var}(2 (B_3 - B_1)) & = 2^2 \mathrm{Var}(B_3 - B_1) = 2^2 \cdot 2 \end{align*}

and,

E[2(B7B3)]=0Var(2(B7B3))=22Var(B7B3)=224\begin{align*} \mathbb{E}[2 (B_7 - B_3)] & = 0 \newline \mathrm{Var}(2 (B_7 - B_3)) & = 2^2 \mathrm{Var}(B_7 - B_3) = 2^2 \cdot 4 \end{align*}

Since the increments are independent, we have,

E[B1+B3+2B7]=0Var(B1+B3+2B7)=42+222+224=50.\begin{align*} \mathbb{E}[B_1 + B_3 + 2 B_7] & = 0 \newline \mathrm{Var}(B_1 + B_3 + 2 B_7) & = 4^2 + 2^2 \cdot 2 + 2^2 \cdot 4 = 50. \newline \end{align*}

Thus, B1+B3+2B7N(0,50)B_1 + B_3 + 2 B_7 \sim \mathcal{N}(0, 50). _\blacksquare

Example 8 (Conditional Distribution with Brownian Motion)

Show that P(B2>0B1=1)=0.8413P(B_2 > 0 \mid B_1 = 1) = 0.8413.

Solution

We can rewrite the conditional probability as,

P(B2B101B1=1).P(B_2 - B_1 0 - 1 \mid B_1 = 1).

Since we have independent increments, B2B1B_2 - B_1 is independent of B1B_1.

P(B2B11)P(B_2 - B_1 -1)

Further, B2B1N(0,1)B_2 - B_1 \sim \mathcal{N}(0, 1). Thus, we have,

P(B2B11)=P(Z>1)=0.8413,P(B_2 - B_1 -1) = P\left(Z > -1\right) = 0.8413,

where ZN(0,1)Z \sim \mathcal{N}(0, 1). _\blacksquare (we can compute this with 1 - pnorm(-1, 0, 1) in R).

Example 9 (Covariance Computation with Brownian Motion)

Show that Cov(Bs,Bt)=min(s,t)\mathrm{Cov}(B_s, B_t) = \min(s, t).

Solution

Without loss of generality, assume that sts \leq t. We can write,

Cov(Bs,Bt)=Cov(Bs,BtBs+Bs)=Cov(Bs,BtBs)+Cov(Bs,Bs)=0+Var(Bs)=s\begin{align*} \mathrm{Cov}(B_s, B_t) & = \mathrm{Cov}(B_s, B_t - B_s + B_s) \newline & = \mathrm{Cov}(B_s, B_t - B_s) + \mathrm{Cov}(B_s, B_s) \newline & = 0 + \mathrm{Var}(B_s) \newline & = s \newline \end{align*}

Conversely, if tst \leq s, we would get Cov(Bs,Bt)=t\mathrm{Cov}(B_s, B_t) = t. Thus, we have,

Cov(Bs,Bt)=min(s,t) .\mathrm{Cov}(B_s, B_t) = \min(s, t) \ _\blacksquare .
Example 10

What is the probability that M3>5M_3 > 5?

Solution

We have,

P(M35)=P(B3>5)=2P(B35)=2P(Z53)=0.046.\begin{align*} P(M_3 5) & = P(|B_3| > 5) \newline & = 2 P(B_3 5) \newline & = 2 P\left(Z \frac{5}{\sqrt{3}}\right) \newline & = 0.046. \end{align*}

where ZN(0,1)Z \sim \mathcal{N}(0, 1). _\blacksquare (we can compute this with 2 * (1 - pnorm(5 / sqrt(3), 0, 1)) in R).

Example 11

Find tt such that P(Mt4)=0.9P(M_t \leq 4) = 0.9.

Solution

We have,

P(Mt4)=P(Bt4)=12P(Bt4)=12P(Z4t).\begin{align*} P(M_t \leq 4) & = P(|B_t| \leq 4) \newline & = 1 - 2 P(B_t 4) \newline & = 1 - 2 P\left(Z \frac{4}{\sqrt{t}}\right). \newline \end{align*}

Setting this equal to 0.90.9 we get,

P(Z4t)=0.05.P\left(Z \frac{4}{\sqrt{t}}\right) = 0.05.

Looking up in the standard normal table (or using qnorm(0.95, 0, 1) in R) we get 4t=1.645\frac{4}{\sqrt{t}} = 1.645 and thus t=5.92t = 5.92. _\blacksquare

Example 12 (Stock Prices)

From what we have seen so far, one can model the price of a stock with:

  • Use a continuous-time stochastic model.
  • Consider the factor with which it changes, not the differences in prices.
  • Consider the log-normal distribution for such factors.
  • Use a parameter for the trend of the price (drift), and one for the volatility (variance) of the price.
  • Make a Markov assumption (although, we have to reflect on this choice).

This leads to using a geometric Brownian motion as a model,

Gt=G0eμt+σBt.G_t = G_0 e^{\mu t + \sigma B_t}.

In this context σ\sigma is called the volatility of the stock.

Example 13

A stock price is modeled with G0=67.3G_0 = 67.3, μ=0.08\mu = 0.08, and σ=0.3\sigma = 0.3. What is the probability that the price is above 100 after 3 years?

Solution

We want to compute,

P(G3>100)P(G0eμ3+σB3>100)=P(eμ3+σB3>100G0)=P(μ3+σB3>log(100G0))=P(B3>log(100G0)μ3σ)\begin{align*} P(G_3 > 100) & \coloneqq P\left(G_0 e^{\mu \cdot 3 + \sigma B_3} > 100\right) \newline & = P\left(e^{\mu \cdot 3 + \sigma B_3} > \frac{100}{G_0}\right) \newline & = P\left(\mu \cdot 3 + \sigma B_3 > \log\left(\frac{100}{G_0}\right)\right) \newline & = P\left(B_3 > \frac{\log\left(\frac{100}{G_0}\right) - \mu \cdot 3}{\sigma}\right) \newline \end{align*}

Since B3N(0,3)B_3 \sim \mathcal{N}(0, 3) we can compute this as 1 - pnorm((log(100 / 67.3) - 0.08 * 3) / 0.3, mean = 0, sd = sqrt(3)) in R.

Example 14

A stock price is modeled with G0=67.3G_0 = 67.3, μ=0.08\mu = 0.08, and σ=0.3\sigma = 0.3. What is the expected payoff from an option to buy the stock at 100 in 3 years?

Solution

We want to compute,

E[max(G3100,0)]=67.3e3(0.08+0.322)P(B1β0.333)100P(B1>β3)\begin{align*} \mathbb{E}[\max(G_3 - 100, 0)] & = 67.3 e^{3(0.08 + \frac{0.3^2}{2})} P\left(B_1 \frac{\beta - 0.3 \cdot 3}{\sqrt{3}}\right) - 100 P\left(B_1 > \frac{\beta}{\sqrt{3}}\right) \newline \end{align*}

where,

β=log(100/67.3)0.0830.31.213.\beta = \frac{\log(100 / 67.3) - 0.08 \cdot 3}{0.3} \approx 1.213.

Thus, we can compute this as 67.3 * exp(3 * (0.08 + 0.3^2 / 2)) * (1 - pnorm((1.213 - 0.3 * 3) / sqrt(3), mean = 0, sd = 1)) - 100 * (1 - pnorm(1.213 / sqrt(3), mean = 0, sd = 1)) in R.

Example 15

Let YtBt2tY_t \coloneqq B_t^2 - t for t0t \geq 0. Then, {Yt}t0\{Y_t\}_{t \geq 0} is a martingale with respect to the Brownian motion {Bt}t0\{B_t\}_{t \geq 0}.

Proof

We have,

E[YtBr,0rs]=E[Bt2tBr,0rs]=E[(BtBs+Bs)2tBr,0rs]=E[(BtBs)2Br,0rs]+2BsE[BtBsBr,0rs]+Bs2t=(ts)+0+Bs2t=Bs2s=Ys.\begin{align*} \mathbb{E}[Y_t \mid B_r, 0 \leq r \leq s] & = \mathbb{E}[B_t^2 - t \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[(B_t - B_s + B_s)^2 - t \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[(B_t - B_s)^2 \mid B_r, 0 \leq r \leq s] + 2 B_s \mathbb{E}[B_t - B_s \mid B_r, 0 \leq r \leq s] + B_s^2 - t \newline & = (t - s) + 0 + B_s^2 - t \newline & = B_s^2 - s \newline & = Y_s. \end{align*}

Further,

E[Yt]=E[Bt2t]E[Bt2]+t=Var(Bt)+(E[Bt])2+t=t+0+t=2t<.\begin{align*} \mathbb{E}[|Y_t|] & = \mathbb{E}[|B_t^2 - t|] \newline & \leq \mathbb{E}[B_t^2] + t \newline & = \mathrm{Var}(B_t) + (\mathbb{E}[B_t])^2 + t \newline & = t + 0 + t \newline & = 2 t < \infty. \newline \end{align*}

Thus, {Yt}t0\{Y_t\}_{t \geq 0} is a martingale with respect to {Bt}t0\{B_t\}_{t \geq 0}. _\blacksquare

Example 16 (Discounting Future Values of Stocks & The Black-Scholes Formula for Option Pricing)

When making investments, there is always a range of choices, some of which are sometimes called “risk-free”. Such investments may pay a fixed interest.

When interests are compounded frequently, a reasonable model is that an investment of G0G_0 has a value G0ertG_0 e^{rt} after time tt, where rr is the “risk-free” investment rate of return.

A common way to take this alternative into account is to instead “discount” all other investments with the factor erte^{-rt}.

For example, the discounted value of a stock can be modeled as,

ertGt=ertG0eμt+σBt=G0e(μr)t+σBt.e^{-rt} G_t = e^{-rt} G_0 e^{\mu t + \sigma B_t} = G_0 e^{(\mu - r) t + \sigma B_t}.

A possible assumption about the trend μ\mu of a stock price is that the discounted value behaves as a martingale with respect to the Brownian motion BtB_t.

Thus, we get,

μr+σ22=0,i.e.,μ=rσ22.\mu - r + \frac{\sigma^2}{2} = 0, \quad \text{i.e.,} \quad \mu = r - \frac{\sigma^2}{2}.

The Black-Scholes formula for option pricing is based on,

  • Assuming the discounted stock price is a martingale with respect to the Brownian motion.
  • Using discounting when computing the value of the option.

From this, we get,

ertE[max(GtK,0)]=G0P(B1>βσtt)KertP(B1>βt),e^{-rt} \mathbb{E}[\max(G_t - K, 0)] = G_0 P\left(B_1 > \frac{\beta - \sigma t}{\sqrt{t}}\right) - K e^{-rt} P\left(B_1 > \frac{\beta}{\sqrt{t}}\right),

where β=log(K/G0)(rσ22)tσ\beta = \frac{\log(K / G_0) - (r - \frac{\sigma^2}{2}) t}{\sigma}.