A stochastic process is a collection of random variables, {Xt,t∈I}.
where I is the index set of the process and S is the common state space of the random variables Xt.
Definition 2 (Markov Property)
A process fulfills the Markov property if, for any t0∈I, whenever Xt0 is known, Xt (with t>t0) is independent of the values for Xs for all s<t0.
Recall (Conditional Probability and Independence)
Given the events A and B, the conditional probability of A given B is defined as,
P(A∣B)=P(B)P(A∩B)
The events A and B are independent if P(A∩B)=P(A)P(B), or equivalently, if P(A∣B)=P(A).
The law of total probability states that if B1,B2,…,Bn is a sequence of events that partitions S. Then,
P(A)=i=1∑nP(A∩Bi)=i=1∑nP(A∣Bi)P(Bi)
Thus, Bayes’ law follow directly from the definition of conditional probability and the law of total probability,
P(B∣A)=P(A)P(A∣B)P(B)Definition 3 (Law of Total Expectation)E[Y]=E[E[Y∣X]]Definition 4 (Law of Total Variance)
Recall that, by definition,
Var(Y):=E[(Y−E[Y])2]=E[Y2]−(E[Y])2
Similarly, we have for the conditional variance,
Var(Y∣X):=EY∣X=x[(Y−E[Y∣X])2∣X]
With these definitions, we can state the law of total variance as,
Let S be a discrete set (not necessarily finite), called the state space. A Markov Chain is a sequence of random variables X0,X1,X2,… taking values in S, with the property,
π(Xn+1∣X0,X1,…,Xn)=π(Xn+1∣Xn)
for all n≥1.
The chain is said to be time-homogeneous if, for all n≥01We generally assume this unless otherwise stated.,
π(Xn+1∣Xn)=π(X1∣X0)
The transition matrix is defined with,
Pij=π(Xn+1=j∣Xn=i)
Further, a stochastic matrix is a square matrix P with non-negative entries, satisfying P1=1, where 1 is a row vector of ones (i.e., all rows sum to one).
Lastly, all transition matrices are stochastic matrices, and all stochastic matrices can be used as transition matrices.
Definition 6 (Limiting Distribution)
A limiting distribution for a Markov chain with transition matrix P is a probability vector v such that,
n→∞lim(Pn)ij=vj
for all states i and j.
An equivalent formulation is, the limit limn→∞(Pn)ij exists and does not depend on i.
Another equivalent formulation is, limn→∞Pn is a stochastic matrix with all rows identical.
Further, a Markov chain has either no or one unique limiting distribution.
If a limiting distribution exists, its probabilities correspond to the proportion of time steps the chain spends at each state in the long run.
Definition 7 (Stationary Distribution)
A stationary distribution for a Markov chain is a distribution that is unchanged when applying one step of the Markov chain.
If P is the transition matrix, then a probability vector v represents a stationary distribution if and only if,
vP=v
A Markov chain can have zero, one, or many stationary distributions.
Limiting distributions are always stationary distributions (but not necessarily vice versa).
Definition 8 (Regular Transition Matrices)
A stochastic matrix P is positive if all entries are strictly positive, i.e., Pij>0 for all i and j.
A stochastic matrix P is regular if Pn is positive for some n>0.
Theorem 1 (Limit Theorem for Regular Markov Chains)
If the transition matrix P is regular, the limiting distribution exists.
There are no other stationary distributions and the limiting distribution is positive, i.e., all states have positive probability in the limiting distribution.
Theorem 2 (Random Walks on Undirected Graphs)
Random walks in a Markov chains (i.e., undircted graphs) have the stationary distribution v with,
vi=Sdeg(i)
where deg(i) is the degree of node i (i.e., the number of edges going into node i) and S is the sum over all nodes of their degrees.
Further, weighted random walks in a Markov chains (i.e., weighted undircted graphs) have the stationary distribution v with,
vi=ew(i)
where w(i) is the sum of the weights of the edges going into node i, and e is the total sum over all nodes of their w(i) values.
Intuition (Moving Between States)
State j is accessible from state i if (Pn)ij>0 for some n≥0.
States i and j communicate if i is accessible from j and j is accessible from i.
“Communication” is transitive, i.e., if i communicates with j and j communicates with k, then i communicates with k.
Communication is an equivalence relation, subdividing all states into communication classes.
Communication classes can be found for example by drawing transition graphs.
Lastly, a Markov chain is irreducible if it has exactly one communication class.
Definition 9 (Recurrence and Transience)
Let Tj be the first passage time to state j,
Tj=min{n>0:Xn=j}
Let fj be the probability that a chain starting at j will return to j,
fj=P(Tj<∞∣X0=j)
A state j is recurrent if a chain starting at j will eventually revisit j, i.e., if fj=1.
A state j is transient if a chain starting at j has a positive probability of never revisiting j, i.e., if fj<1.
Note
The expected number of visits at j when the chain starts at i is given by ∑n=0∞(Pn)ij.
j is recurrent if and only if ∑n=0∞(Pn)jj=∞.
j is transient if and only if ∑n=0∞(Pn)jj<∞.
Note (Communication Classes)
The states of a communication class are either all recurrent or all transient.
The states of a finite irreducible Markov chain are all recurrent.
If a state is recurrent, only states inside its communication class are accessible from it.
If no states outside a finite communication class are accessible from it, then the class consists of recurrent states.
Theorem 3 (Limit Theorem for Finite Irreducible Markov Chains)Recall
In a finite irreducible Markov chain, all states are recurrent.
Let μj=E[Tj∣X0=j] be the expected return time to state j.
Then μj<∞ for all states j and the vector v with vj=μj1 is the unique stationary distribution.
Further,
vj=n→∞limn1m=0∑n−1(Pm)ijNote
All finite regular Markov chains are finite irreducible Markov chains, but not vice versa.
The overall conclusion here is weaker than that for finite regular Markov chains. Not all finite irreducible Markov chains have limiting distributions.
Definition 10 (Periodicity)
The period of a state i is the greatest common divisor of all n>0 such that (Pn)ii>0.
All states of a communication class have the same period.
A Markov chain is periodic if it is irreducible and all states have period greater than 1.
A Markov chain is aperiodic if it is irreducible and all states have period 1.
Definition 11 (Ergodic Markov Chains)
A Markov chain is ergodic if,
It is irreducible.
It is aperiodic.
All states are positive recurrent (i..e, have finite expected return times (which always is true if the state space is finite)).
Theorem 4 (Fundamental Limit Theorem for Ergodic Markov Chains)
There exists a unique stationary distribution v which is the limiting distribution of the chain.
Note
We can also show that a finite Markov chain is ergodic if and only if its transition matrix is regular.
Definition 12 (Time Reversibility)
Let P be the transition matrix of an irreducible Markov chain with stationary distribution v.
The chain is time reversible if, when running from its stationary distribution, it looks the same moving forward as backwards, i.e.,
π(Xk=i,Xk+1=j)=π(Xk=j,Xk+1=i)
This may also be written as viPij=vjPji for all states i and j.
It is called the detailed balance equations.
Note
If x is a probability vector satisfying xiPij=xjPji for all states i and j, then x is the stationary distribution of the chain, and the chain is time reversible.
Proof
We have,
(xP)j=i∑xiPij=i∑xjPji=xji∑Pji=xj
Thus, xP=x, so x is the stationary distribution.
If a Markov chain is defined as a random walk on a weighted undirected graph, then it is time reversible.
Proof
Let w(i,j) be the weight of the edge between nodes i and j (0 if no edge exists).
Let w(i)=∑kw(i,k) be the sum of weights of edges going into node i.
The transition matrix is given by Pij=w(i)w(i,j).
Let vi=ew(i), where e is the total sum over all nodes of their w(i) values.
Then,
Thus, the detailed balance equations are satisfied, so the chain is time reversible.
If a finite Markov chain is time reversible, it can be represented as a random walk on a weighted undirected graph.
Proof
Let P be the transition matrix of a finite time reversible Markov chain with stationary distribution v.
Let w(i,j)=viPij be the weight of the edge between nodes i and j.
Then,
j∑w(i,j)=j∑viPij=vij∑Pij=vi
Thus, the sum of weights of edges going into node i is vi.
The transition matrix is given by,
Pij=∑kw(i,k)w(i,j)=viviPij=Pij
Thus, the Markov chain can be represented as a random walk on a weighted undirected graph.
Definition 13 (Absorbing Chains)
State i is absorbing if Pii=1a
A Markov chain is an absorbing chain if it has at least one absorbing state.
By reordering the states, the transition matrix for an absorbing chain can be written in the block form,
P=[Q0RI]
where I is the identity matrix, 0 is a matrix of zeroes, and Q corresponds to transient states.
We can prove by induction that,
Pn=[Qn0(I+Q+Q2+…+Qn−1)RI]
Taking the limit and using limn→∞Qn=0 (since all states corresponding to Q are transient), we have,
n→∞limPn=[00(I−Q)−1RI]=[00FRI]
where F=(I−Q)−1=limn→∞(I+Q+Q2+…+Qn−1) is called the fundamental matrix.
The probability to be absorbed in a particular absorbing state given a start in a transient state is given by the entries of FR.
Further, the expected number of visits in state j for a chain that starts in the transient state i is given by Fij (as this is equal to ∑n=0∞(Pn)ij).
Thus, the expected number of steps until absorption is given by the vector F1T, where 1 is a column vector of ones.
Note
Given an irreducible Markov chain, to compute the expected number of steps needed to go from state i to the first visit to state j, one can change the chain into one where state j is absorbing, and compute the expected number of steps until absorption when starting at state i.
Definition 14 (Galton-Watson Branching Process)
A (Galton-Watson) branching process is a discrete Markov chain Z0,Z1,…,Zn,… where,
The state space is the non-negative integers.
Z0=1 (i.e., the process starts with one individual).
Zn is the sum X1+X2+…+XZn−1, where Xi are independent random non-negative integers all with the same offspring distribution,
Zn=i=1∑Zn−1Xi
Connecting each of the Zn individuals in generation n with their offspring in generation n+1 we get a tree illustrating the branching process.
The offspring distribution is described by the probability vector a=(a0,a1,…) where aj=P(Xi=j).
We will mainly focus on the interesting cases where we assume a0>0 (i.e., there is a positive probability of having no offspring) and a0+a1<1 (i.e., there is a positive probability of having more than one offspring).
Definition 15 (Expected Size of Generations in a Branching Process and Classification of Branching Processes)Note
Note that the state 0 is abosribing, this absorbtion is called extinction.
Also, as a0>0, all nonzero states are transient.
Let μ=E[Xi]=∑j=0∞j⋅aj, i.e., the expected number of offspring per individual.
From this, we can prove by induction that, for n≥1,
Var(Zn)=σ2μn−1k=0∑n−1μk={nσ2σ2μn−1μ−1μn−1if μ=1if μ=1Theorem 5 (Extinction Probability Theorem)
Let GX be the probability generating function for the offspring distribution for a branching process.
The probability of eventual extinction is the smallest positive root of the equation GX(s)=s.
Intuition (Probability Generating Functions)
For any discrete random variable X taking values in {0,1,2,…}, define the probability function GX(s) as,
GX(s)=E[sX]=k=0∑∞P(X=k)sk
The series converges absolutely for ∥s∥≤1. We assume s is a real number in [0,1].
We get a 1-1 correspondence between probability vectors on {0,1,2,…} and functions represented by a series where the coefficients sum to 1.
Specifically, if GX(s)=GY(s) for all s for random variables X and Y, then X and Y have the same distribution.
Note (Properties of Probability Generating Functions)
To go from X to GX(s), we compute the (in)finite sum.
To go from GX(s) to X, we use that we have,
P(X=k)=k!GX(k)(0)
If X and Y are independent random variables, then,
GX+Y(s)=E[sX+Y]=E[sXsY]=E[sX]E[sY]=:GX(s)GY(s)
E[X]=GX′(1)
E[X(X−1)]=GX′′(1)
As a consequence, Var(X)=GX′′(1)+GX′(1)−(GX′(1))2
Further, assume we have a branching process Z0,Z1,… with independent offspring variables Xi all with the same distribution as X.
By writing Gn(s)=GZn(s)=E[sZn] and GX(s)=E[sX], we get,
Gn(s)=n timesGX(GX(…GX(s)…))Theorem 6 (Extinction Probability Theorem (Addition))
Let GX be the probability generating function for the offspring distribution for a branching process.
The probability of eventual extinction is the smallest positive root of the equation GX(s)=s.
Also, if the process is critical (μ=1) then the extinction probability is 1.
Proof (Of the last sentence)
In the critical case,
GX′(1)=E[X]=μ=1
As GX′′(s)>0 for s∈(0,1), we get that GX′(s)<1 for s∈(0,1), and dsd(G(s)−s)<0 for s∈(0,1).
As G(1)−1=0 for any probability generating function, we get that G(s)=s has its smallest positive root at s=1.
Theorem 7 (Strong Law of Large Numbers)
Strong law of large numbers of samples: If Y1,Y2,…,Ym and Y are i.i.d random variables from a distribution with finite mean, and if E[r(Y)] exists, then, with probability 1,
m→∞limmr(Y1)+r(Y2)+…+r(Ym)=E[r(Y)]Theorem 8 (Strong Law of Large Numbers for Markov Chains)
Strong law of large numbers for Markov chains: If X0,X1,… is an ergodic Markov chain with stationary distribution π, and if EX∼π[r(X)] exists, then, with probability 1,
m→∞limmr(X0)+r(X1)+…+r(Xm)=EX∼π[r(X)]
where X has distribution π.
Note
When using this theorem in practice, one might improve accuracy by throwing away the first sequence X1,…,Xs for s<m before computing the average. The first sequence is then called the burn-in.
Definition 17 (The Metropolis-Hastings Algorithm)
The Metropolis-Hastings algorithm is a method to construct a Markov chain having a desired limiting distribution.
Let θ be a random variable with probability mass function π(θ).
We also assume given a proposal distribution T(θnew∣θ), which, for every given θ, provides a PMF for a new θnew.
Finally, we define, for θ and θnew, the acceptance probability,
a=min(1,π(θ)T(θnew∣θ)π(θnew)T(θ∣θnew))
The Metropolis-Hastings algorithm is: Starting with some initial value θ0, generate θ1,θ2,… by, at each step, proposing a new θ based on the old using the proposal function and accepting it with probability a.
If it is not accepted, the old value is used again.
If this defines an ergodic Markov chain, its unique stationary distribution is π(θ).
Proof
Let P be the transition matrix of the Markov chain defined by the Metropolis-Hastings algorithm.
We need to show that πP=π.
It suffices to show that the detailed balance equations are satisfied, i.e., that,
Thus, in both cases, the detailed balance equations are satisfied, so π is the stationary distribution of the chain.
Definition 18 (Gibbs Sampling)
For any probability model over a vector θ=(θ1,θ2,…,θn), consider a Metropolis-Hastings proposal function changing only one coordinate, with the value of this coordinate simulated from the conditional distribution given the remaining coordinates.
The acceptance probability is always 1, so the proposal is always accepted.
Proof
Let θ=(θ1,θ2,…,θn) and θnew=(θ1,…,θi−1,θi,new,θi+1,…,θn) differ only in coordinate i.
The proposal distribution is,
T(θnew∣θ)=P(θi=θi,new∣θ1,…,θi−1,θi+1,…,θn)
and,
T(θ∣θnew)=P(θi=θi∣θ1,…,θi−1,θi+1,…,θn)
Using the definition of conditional probability, we have,
Putting together an algorithm updating different coordinates in different steps may create an ergodic Markov chain.
This is called Gibbs sampling.
Sometimes the conditional distributions are easy to derive. Then this is an easy-to-use version of Metropolis-Hastings.
Definition 19 (The Ising Model)
Uses a grid of vertices; We will assume an n×n grid.
Two vertices v and w are neighbours, denoted v∼w, if they are next to each other in the grid.
Each vertex v can have value +1 or −1 (called its “spin”); We denote this by σv=1 or σv=−1.
A configuration σ consists of a choice of +1 or −1 for each vertex. Thus, the set Ω of possible configurations has 2(n2) elements.
We define the energy of a configuration as E(σ)=−∑v∼wσvσw.
The Gibbs distribution is the probability mass function on Ω defined by,
π(σ)∝exp(−βE(σ))
where β is a parameter of the model; β1 is called the temperature.
It turns out that when the temperature is high, samples from the model will show a chaotic pattern of spins, but when the temperature sinks below the phase transition value, samples will show chunks of neighbouring vertices with the same spin, i.e., the system will be “magnetized”.
Intuition (Simulating from the Ising Model using Gibbs Sampling)
For a vertex configuration σ and a vertex v, let σ−v denote the part of σ that does not involve v.
Prpose a new configuration σ⋆ given an old configuration σ by first choosing a vertex v, then, let σ⋆ be identical to σ except possibly at v.
Decide the spin at v using the conditional distribution given σ−v,
Given an ergodic Markov chain with finite state space of size k and limiting distribution π.
The idea is that, given n prove that Xn actually has reached the limit distribution π.
If we can prove that the distribution at Xn is independent of the starting state X0, then we might be able to conclude that Xn has the limiting distribution π.
Construct k Markov chains that are dependent (“coupled”) but which are marginally Markov chains as above.
If they start at the k possible value at X0 but have identical values at Xn, then we are done.
Note
n cannot be determined as the first value where the k chains meet, it must be determined independently of such information.
Thus, usually one wants to generate chains X−n,X−n+1,…,X0 where X0 has the limiting distribution, and we stepwise increase n to make all chains coalesce to one chain at time 0.
Intuition (Using same source of randomness for all chains)
Consider the chains X−n(j),…,X0(j) for j=1,2,…,k.
Instead of simulating Xi+1(j) based on Xi(j) independently for each j, we define a function g so that Xi+1(j)=g(Xi(j),Ui) for all j where Ui∼Uniform(0,1).
Thus, if two chains have identical values in Xi, they will also be identical in Xi+1.
Thus, for a particular n, if all chains have not converged at X0, we simulate k chains from X−2n to X−n.
They might only hit a subset of the k states at X−n and thus might coalesce to one state at X0 using the old simulations. If not, double n again.
Definition 20 (Counting Process)
A counting process {Nt}t∈I is a stochastic process where I=R0+, where the state space is the non-negative integers, and where 0≤s≤t implies Ns≤Nt.
Generally, when s<t,Nt−Ns counts the number of “events” (in the loose meaning “things happening”) in (s,t].
A realization of Nt is then a function of t that is a right-continuous step function.
Definition 21 (Poisson Process I)
A Poisson process {Nt}t≥0 with parameter λ>0 is a counting process, fulfilling,
N0=0.
Nt∼Poisson(λt) for all t>0.
Stationary Increments: Nt+s−Ns has the same distribution as Nt for all s>0,t>0.
Independent Increments: Nt−Ns and Nr−Nq are independent, when 0≤q<r≤s<t.
Note
It is not obvious that such a process necessarily exists. Further, E[Nt]=λt, thus what one is counting occurs with a rate of λ items per time unit.
Definition 22 (Poisson Process II)
Let X1,X2,… be a sequence of i.i.d. exponential random variables with parameter λ.
Let N0=0 and, for t>0,
Nt:=max{n:X1+…+Xn≤t}.
Then, {Nt}t≥0 is a Poisson process with parameter λ.
Proof
Firstly, if we start with a Poisson process (definition I), and let X1,X2,… be the inter-arrival times, i.e., the time between event 1 and event 2, event 2 and event 3, etc., then these are i.i.d. exponential random variables with parameter λ.
Thus, Nt will be defined as above.
Conversely, if we construct Nt as above, all properties of definition I can be verified.
Let {Mt}t≥0 be defined as Mt:=Nt+s−Ns for some s>0.
It is a counting process by definition.
N0=0 by definition.
Stationary increments follow from the memoryless property of the exponential distribution: Mk=Ns+t+k−Ns+t=max{n:XNs+t+1+…+XNs+t+n≤k}, which has the same distribution as Nk.
Independent increments follow from the independence of the Xi’s.
Mk=Mk+s−Ns=max{n:XNs+1+…+XNs+n≤k}, which has a Poisson(λk) distribution.
We call Sn:=X1+…+Xn the arrival times of the process.
Let M=min{X1,…,Xn}, where, independently for each i, Xi∼Exponential(λi), then,
P(M=X1)=P(X2≥X1,…,Xn≥X1)=E[I[X2≥X1,…,Xn≥X1]]=E[E[I[X2≥X1,…,Xn≥X1]∣X1]]=E[P(X2≥X1,…,Xn≥X1∣X1)]=E[e−λ2X1…e−λnX1∣X1]=E[e−(λ2+…+λn)X1]=∫0∞e−(λ2+…+λn)xλ1e−λ1xdx=λ1∫0∞e−(λ1+λ2+…+λn)xdx=λ1+λ2+…+λnλ1=1[−λ1+λ2+…+λn1e−(λ1+λ2+…+λn)x]0∞=λ1+λ2+…+λnλ1■Definition 23 (Poisson Process III)
A Poisson process {Nt}t≥0 with parameter λ>0 is a counting process, fulfilling,
N0=0.
The process has stationary and independent increments.
P(Nh=0)P(Nh=1)P(Nh>1)=1−λh+o(h)=λh+o(h)=o(h)
for all h>0, where o(h) is a function such that limh→0ho(h)=0.
Lemma 1 (Superposition of Poisson Processes)
Let {Nt(1)}t≥0,…,{Nt(n)}t≥0 be independent Poisson processes with parameters λp1,…,λpn, respectively, where p=(p1,…,pn) is a probability vector.
If c=(c1,…,cn) are the counts after time t (i.e., ci=Nt(i)), then, the conditional distribution of (Nt(1),…,Nt(n)), an equivalent model is,
c∼Multinomial(Nt,p),
where {Nt}t≥0 is a Poisson process with parameter λ.
Proof
Using the model with independent Poisson processes, the probability of observing the count vector c after time t is (denoting N=c1+…+cn),
The process for N inherits independent and stationary increments from the sub-processes, so it follows it also a Poisson process with parameter λ.
Lemma 2 (Uniformly Distributed Arrival Times)
Let {Nt}t≥0 be a Poisson process with parameter λ.
If we fix that Nt=k, and we select uniformly randomly one of the k arrivals, then its arrival time is uniformly distributed on the interval [0,t].
Proof
Let S1,S2,…,Sk be the arrival times given that Nt=k.
P(Sk≥s∣k uniformly random in 1,…,n,Nt=n)=n1k=1∑nP(Sk≥s∣Nt=n)=n1k=1∑nj=0∑k−1P(Ns=j∣Nt=n)=n1k=1∑nj=0∑k−1P(Nt=n)P(Ns=j)P(Nt−s=n−j)=n1k=1∑nj=0∑k−1e−λtn!(λt)ne−λsj!(λs)je−λ(t−s)(n−j)!(λ(t−s))n−j=n1j=0∑n−1(n−j)j!(n−j)!n!(ts)j(1−ts)n−j=[j=0∑n−1j!(n−j−1)!n!(ts)j(1−ts)n−j−1]⋅(1−ts)=1−tsDefinition 24 (Spatial Poisson Process)
A collection of random variables {NA}A⊆Rd is a spatial Poisson process with parameter λ if,
For each bounded set A⊆Rd, NA has a Poisson distribution with parameter λ∣A∣.
The second property follows directly from the definition of the transition function.
Intuition (Holding Times)
Define Ti as the time the continuous-time Markov chain {Xt}t≥0 that always starts in state i stays in i before moving to a different state, so that for any s>0,
P(Ti>s)=P(Xu=i,0≤u≤s)
The distribution of Ti is memoryless and thus exponential.
We define qi such that,
Ti∼Exponential(qi)
Remember that this means that the average time the process stays in i is qi1.
The rate of transition out of the state is qi.
Note
We can have qi=0, meaning that the state i is absorbing,
P(Ti>s)=1.Intuition (Embedded Markov Chain)
We can define a new stochastic process by listing the states the chain visits.
This will be a discrete time Markov chain called the embedded Markov chain with transition matrix P~.
Note
The transition matrix P~ has zeros along its diagonal! Why?
Because the process cannot stay in the same state when it makes a transition.
Further, note that the continuous-time Markov chain is completely determined by the expected holding times (q11,…,qk1) and the transition matrix of the embedded Markov chain P~.
Intuition (Constructing a Continuous-Time Markov Chain)
A way to describe a continuous-time Markov chain is to describe k×(k−1) independent “alarm clocks”.
For states i and j so that i=j, let qij be the parameter of an Exponentially distributed random variable representing the time until an “alarm clock” rings.
When in state i, wait until the first alarm clock qi rings, then move to the state given by the index j of that alarm clock.
This defines a continuous-time Markov chain.
The time until the first alarm clock qi rings is Exponentially distributed with parameter given by,
qi=qi1+qi2+…+qi,i−1+qi,i+1+…+qik.
i.e., the parameter of the holding time distribution at i.
The chain is completely described by the rates qij.
Further, we saw also that the chain is completely described by the pij and the qi.
The relationship is described by the equation above, and,
Since we arbitrarily chose to write P(t+h) as P(t)P(h), we could have equally written it as P(h)P(t), leading to the other equation, thus completing the proof.
Intuition (Matrix Exponential Solution)
The solution to the Kolmogorov Forward and Backward equations is given by the matrix exponential.
For any square matrix A we can define the matrix exponential as,
eA:=n=0∑∞n!1An=I+A+21A2+61A3+…
The series converges for all square matrices A (but we will not prove this here).
Some important properties of the matrix exponential are:
e0=I
eAe−A=I
e(s+t)A=esAetA
If AB=BA, then eA+B=eAeB=eBeA
∂t∂etA=AetA=etAA
Note that P(t)=etQ is the unique solution to the differential equations given by the Kolmogorov Forward and Backward equations with initial condition P(0)=I.
Further, note that P′(t)=QP(t) for all t≥0 and P(0)=I.
In R one can use the expm package to compute matrix exponentials.
Definition 27 (Multinomial Distribution)
A vector x=(x1,x2,…,xk) of non-negative integers has a Multinomial distribution with parameters n and p, where n>0 is an integer and p is a probability vector of length k, if ∑i=1kxi=n and the probability mass function is given by,
π(x∣n,p)=x1!x2!…xk!n!p1x1p2x2…pkxk.
We write x∼Multinomial(n,p).
Definition 28 (Dirichlet Distribution)
A vector p=(p1,p2,…,pk) of non-negative real numbers satisfying ∑i=1kpi=1 has a Dirichlet distribution with parameter vector α=(α1,α2,…,αk), if it has probability density function,
where αi>0 for all i=1,2,…,k.
We write p∼Dirichlet(α).
Theorem 11 (Multinomial-Dirichlet Conjugacy)
Let x=(x1,x2,…,xk) be a vector of counts with x∼Multinomial(n,p) and prior p∼Dirichlet(α).
Then the posterior distribution of p given x is,
p∣x∼Dirichlet(α1+x1,α2+x2,…,αk+xk).
Further, if p∼Dirichlet(α), then E[pi]=∑j=1kαjαi for i=1,2,…,k.
Proof (Multinomial-Dirichlet Conjugacy)
Let π(x∣p)=Multinomial(x;n,p) and π(p)=Dirichlet(p;α), respectively.
Then,
π(p∣x)∝π(x∣p)π(p)=Multinomial(x;n,p)Dirichlet(p;α)∝pp1x1p2x2…pkxk⋅p1α1−1p2α2−1…pkαk−1∝pp1α1+x1−1p2α2+x2−1…pkαk+xk−1=Dirichlet(p;α1+x1,α2+x2,…,αk+xk)■Proof (Expectation of Dirichlet Distribution)
Let p∼Dirichlet(α).
Then,
E[pi]=∫⋯∫piΓ(α1)Γ(α2)…Γ(αk)Γ(α1+α2+…+αk)p1α1−1p2α2−1…pkαk−1dp1dp2…dpk=Γ(α1)Γ(α2)…Γ(αk)Γ(α1+α2+…+αk)∫⋯∫p1α1−1p2α2−1…piαi…pkαk−1dp1dp2…dpk=Γ(α1)Γ(α2)…Γ(αk)Γ(α1+α2+…+αk)⋅Γ(α1+α2+…+αk+1)Γ(α1)Γ(α2)…Γ(αi+1)…Γ(αk)=Γ(αi)Γ(αi+1)⋅Γ(α1+α2+…+αk+1)Γ(α1+α2+…+αk)=αi⋅α1+α2+…+αk1=∑j=1kαjαi■Intuition (Predictions for The Multinomial-Dirichlet Model)
If p∼Dirichlet(α) and x∣p∼Multinomial(n,p), then the predictive distribution is given by,
For example, if ei is the vector with 1 at position i and zeros elsewhere, then π(x=ei)=∑j=1kαjαi.
Further, if xnew is a vector of new counts, then, as p∣x∼Multinomial(x+α), we get,
π(xnew=ei∣x)=∑j=1kαj+nαi+xi.
The αi in the prior are often called pseudocounts, as they can be interpreted as prior counts added to the observed counts.
Definition 29 (Hidden Markov Model)
A Hidden Markov Model (HMMs) consists of:
A Markov chain X0,…,Xn,… and,
another sequence Y0,…,Yn,… such that,
P(Yk∣Y0,…,Yk−1,X0,…,Xn)=P(Yk∣Xk).
In some models we may instead have,
P(Yk∣Y0,…,Yk−1,X0,…,Xn)=P(Yk∣Yk−1,Xk).
Generally, Y0,…,Yn are called observations and X0,…,Xn are called hidden states.
The Xi’s represent the “underlying process” that the observed values Yi’s depend on.
Further, generally the Xk have a finite state space.
Intuition (Inference in HMMs)
When the parameters of the HMM are known, we want to know about the values of the hidden variables Xi, for example,
What is the most likely sequence X0,X1,…,Xn given the data?
What is the probability distribution for a single Xi given the data?
However, when the parameters of the HMM are unknown, we need to infer these from some data.
If data with all Xi and Yi known is available, inference for parameters is based on counts of transitions.
The inference for the Markov chain is exactly as for the Markov chains we have looked at before.
The inference for the emission probabilities, i.e., the parameters of P(Yk∣Xk), can be done independently of the inference for the Markov chain.
Intuition (Bayesian Inference for Poisson Processes)
For a homogeneous Poisson process, we setup up a prior π(λ) for its parameter λ, and find the posterior given the observations.
The likelihood for observing y events in an interval of length t is given by,
Poisson(y;λt)=exp(−λt)y!(λt)y∝λexp(−λt)λy.
A convenient prior to use is λ∼Gamma(α,β).
In this case, the posterior becomes,
λ∣D∼Gamma(α+y,β+t),
and the predictive distribution for the number of observations yn in a different interval of length u becomes,
π(yn)=Negative−Binomial(yn;α,β+uβ).Intuition (Bayesian Inference for Continuous-Time Markov Chains)
Lastyl, recall that a continuous-time Markov chain with finite state space is completely characterized by its holding times parameter vector q and the transition matrix P~ of its embedded Markov chain.
Parametrizing instead with the “alarm clock” parameters qij gives an equivalent description of the process.
The two parts of the data can be considered independently are,
We learn about P~ from the counts of transitions between states, and
We learn about q from the observed lengths of stays the process has in each state.
For P~ the situation is analogous to the one for discrete-time Markov chains, except that the diagonal of P~ must be zero, so the prior must exclude the possibility of non-zero diagonal elements.
For example, for P~1, the first row of P~, we might use the prior Dirichlet(0,1,…,1), i.e., a Dirichlet prior with a zero pseudocount for the first element.
The holding times in state i are distributed as Exponential(qi).
If we have observed a total holding time of h over n intervals, that data has likelihood proportional to e−hqiqin.
Using qi∼Gamma(α,β) as prior results in the posterior qi∣D∼Gamma(α+n,β+h).
Definition 30 (Stationary Distribution for Continuous-Time Markov Chains)
In the context of continuous-time Markov chains, v is a stationary distribution if and only if 0=vQ, where Q is the generator matrix of the chain.
Proof (Stationary Distribution and Generator Matrix)
Firstly, let vQ=0,
dtd(vP(t)):=vdtdP(t)=:vQP(t)=0
as vQ=0 by assumption.
Thus, vP(t) is constant in t, i.e., vP(t)=vP(0)=vI=v for all t≥0.
Conversely, let v be a stationary distribution, i.e., v=vP(t) for all t≥0.
Then,
dtd(vP(t)):=vdtdP(t)=:vQP(t)=0
as vP(t) is constant in t by assumption.
In particular, for t=0 we get vQP(0)=vQI=vQ=0.
Thus, the proof is complete ■.
Intuition (Long-Term Behavior of Continuous-Time Markov Chains)
A continuous-time Markov chain is irreducible if for all i and j there exists a t>0 such that Pij(t)>0, i.e., it is possible to get to any state from any state.
However, periodic continuous-time Markov chains do not exist. If Pij(t)>0 for some t>0, then Pij(t)>0 for all t>0.
Theorem 12 (Fundamental Limit Theorem for Continuous-Time Markov Chains)
Let {Xt}t≥0 be a finite, irreducible, continuous-time Markov chain with transition function P(t).
Then there exists a unique positive stationary distribution vector v which is also the limiting distribution.
The limiting distribution of such a chain can be found as the unique v satisfying vQ=0
Definition 31 (Absorbing State)
An absorbing state is one where the rate of leaving it is zero.
Assume {Xt}t≥0 is a continuous-time Markov chain with k states.
Assume the last state is absorbing and the rest are not (If the chain is irreducible they are then transient).
WE have that qk=0 and the entire last row must consist of zeros. We thus get,
Q=[V0⋆0].
Let F be the (k−1)×(k−1) matrix such that Fij (with i<k,j<k) is the expected time spent in state j when the chain starts in i.
We can show that F=−V−1.
Proof (Expected Time in Transient States Before Absorption)
Generally, we can define D as the matrix with (q11,…,qk1) along its diagonal, with all other entries zero.
If there are no absorbing states,
P~=DQ+Ik.
Write A_ for a square matrix without its last row and column, so, e.g., Q_=V.
If the last state is absorbing, i.e., qk=0, we get,
P~_=D_Q_+Ik−1.
Further, let F′ be the matrix where Fij′ is the expected number of stays in state j before absorbtion when starting in state i.
As the lengths of stays and changes in states are independent, we get Fij′=Fijqj1 and thus,
F=F′D_.
By the theory for discrete-time Markov chains with absorbing states, we have,
F is called the fundamental matrix (similar to the discrete-time case).
Note
If the chain starts in state i, the expected time until absorption is the sum of the i-th row of F.
Thus, the expected times until absorption are given by the matrix product F1 of F with a column of 1s.
Intuition (Stationary Distribution of the Embedded Markov Chain)
The embedded chain of a continuous-time Markov chain: The discrete-time Markov chain where holding times are ignored.
The stationary distribution for the embedded chain and for the continuous-time chain are generally not the same!
However, there is a simple relationship: A probability vector π is a stationary distribution for a continuous-time Markov chain if and only if ψ is a stationary distribution for the embedded chain, where ψj=Cπjqj for a constant C making the entries sum to 1.
Proof (Stationary Distribution of the Embedded Markov Chain)
Using notation as above, we have P~=DQ+I. For any vector v, we get,
vP~=v(DQ+I)=vDQ+v.
so vP~=v if and only if vDQ=0.
Definition 32 (Global Balance Equations)
Let v=(v1,v2,…,vn) be the stationary distribution of a continuous-time Markov chain with generator matrix Q.
At v, the flow into a state must be equal to the flow out of the state, by the definition of stationary distribution.
Which are exactly the equations we get from vQ=0,
One can generalize this to: If A is a set of states, then the long term rates of movement into and out of A are the same,
i∈A∑j∈/A∑viqij=i∈/A∑j∈A∑viqij.Definition 33 (Local Balance Equations and Time-Reversible Chains)
A stronger condition than global balance is local balance. The flow between every pair of states is balanced.
Formally, an irreducible continuous-time Markov chain with stationary distribution v is said to be time-reversible if, for all states i and j,
viqij=vjqji.Note
The rate of observed changes from i to j is the same as the rate of observed change from j to i.
Thus, this is also called time-reversibility, as the process looks the same when observed backward in time.
Further, If a probability vector v satisfies the local balance equations, then it is a stationary distribution of the chain.
Definition 34 (Markov Process on a Tree)
Firstly, a tree is a connected (undirected) graph with no cycles.
Thus, a transition graph gives rise to an undirected graph by connecting all nodes between which there is some rate of transition.
Further, assume that the transition graph of an irreducible continuous-time Markov chain is (or gives rise to) a tree.
In a tree, any edge between two states divides all states into two groups (disjoint sets), each on each side of the edge.
Thus, the flow must be balanced between across each edge.
In this scenario, the global balance condition then implies the local balance condition over this edge.
Definition 35 (Birth-and-Death Process)
A birth-and-death process is a continuous-time Markov chain where the state space is the set of non-negative integers and transitions only occur to neighboring states.
The process is necessarily time-reversible, as the transition graph is a tree (in fact a line, if we further assume irreducibility).
We denote the births from i to i+1 with λi, and the rate of deaths from i to i−1 with μi.
Thus, the generator matrix has the form,
Brownian motion is a continuous-time stochastic process {Bt}t≥0 with the following properties:
B0=0.
For t>0, Bt∼N(0,t) (i.e., normally distributed with mean 0 and variance t).
For s,t>0, Bt+s−Bs∼N(0,t) (i.e., the increments are stationary).
For 0≤q<r≤s<t, Bt−Bs is independent of Br−Bq (i.e., the increments are independent).
The function t↦Bt is continuous with probability 1 (almost surely).
Intuition (Brownian Motion as Limit of Random Walks)
A random walk is a discrete-time Markov chain S0,S1,S2,… where S0=0 and,
Sn=Y1+Y2+…+Yn,
and Y1,Y2,… are i.i.d. random variables. Assume E[Yi]=0.
Further, if we assume Var(Yi)=1, we get Var(Sn)=n.
Interpolating between the values Sn we can make this into a continuous-time process St where Var(St)≈t.
We may scale with an s>0 to get processes St(s)=sSst where we get lims→∞Var(St(s))=t.
It turns out that the processes St(s) when s→∞ are exactly Brownian motion, no matter what type of Yi we start with.
This is the Donsker’s invariance principle, which is a generalization of the central limit theorem to stochastic processes [^2].
Definition 37 (Gaussian Process)
A Gaussian process is a continuous-time stochastic process {Xt}t≥0 with the property that for all n≥1 and 0≤t1<t2<…<tn, Xt1,Xt2,…,Xtn have a multivariate normal distribution.
Thus, a Gaussian process is completely determined by its mean function E[Xt] and its covariance function Cov(Xs,Xt).
Intuition (Brownian Motion as a Gaussian Process)
Brownian motion is a Gaussian process, as we can show that any a1Bt1+a2Bt2+…+anBtn is normally distributed.
A Gaussian process {Xt}t≥0 is a Brownian motion if and only if,
X0=0.
E[Xt]=0 for all t.
Cov(Xs,Xt)=min(s,t) for all s,t.
The function t↦Xt is continuous with probability 1 (almost surely).
Intuition (Transformations of Brownian Motion)
The following transformations of Brownian motion also yield Brownian motion:
{−Bt}t≥0, negating the process yields another (reflected) Brownian motion.
{Bt+s−Bs}t≥0 for any fixed s≥0, shifting the time origin yields another Brownian motion.
{a1Bat}t≥0 for any fixed a>0, scaling time and space yields another Brownian motion.
The process {Xt}t≥0 where X0=0 and Xt=tBt1 for t>0, time inversion yields another Brownian motion.
Intuition (Stopping Times)
We saw above that, for any fixed t(Bt+s−Bs)t≥0 is a Brownian motion.
Does this phenomenon also hold if we start the chain anew from T when T is random? It depends.
If T is the largest value less than 1 where BT=0, then (BT+s−BT)s≥0 is not a Brownian motion.
If T is the smallest value where BT=a for some constant a, then (BT+s−BT)s≥0 is a Brownian motion.
The reason is that the event T=t can be determined based on Br where 0≤r≤t.
Random T’s that have this property are called stopping times. For these BT+s−BT is a Brownian motion.
Intuition (The Distribution of the First Hitting Time)
Given that a=0, what is the distribution of the first hitting time Ta=min{t:Bt=a}?
We will prove that,
Ta1∼Gamma(21,2a2).
Assuming that a>0 and using that Ta is a stopping time we get for any t>0 that P(Bt1>a∣Ta<t1)=P(Bt1−Ta>0)=21.
Further, it follows that P(Ta<t1)=2P(Bt1>a) and thus,
P(Ta1≤t)=2P(Bt1>a)−1=2P(B1≤at)−1
Taking the derivative with respect to t we get the Gamma density,
π1/Ta(t)=22π1exp(−21(at)2)2at−1/2.Intuition (Maximum of Brownian Motion)
We can define Mt:=max0≤s≤tBs.
We may compute for a>0 (using the results above),
P(Mt>a)=P(Ta<t)=2P(Bt>a)=P(∣Bt∣>a).
Thus, Mt has the same d istribution as ∣Bt∣, i.e., the absolute value of Bt.
Intuition (Zeroes of Brownian Motion)
Let L be the last zero in (0,1) of a Brownian motion. (i.e., L=max{t:0<t<1,Bt=0}.
Then,
L∼Beta(21,21).Proof (Outline of Proof)P(L>s)=∫−∞∞P(L>s∣Bs=t)N(t;0,s)dt=2∫−∞0P(L>s∣Bs=t)N(t;0,s)dt=2∫−∞0P(M1−s>−t)N(t;0,s)dt=2∫0∞2P(B1−s>t)N(t;0,s)dt=4∫0∞∫t∞N(r;0,1−s)N(t;0,s)drdt=…=π1∫s1x(1−x)1dx=∫s1Beta(x;21,21)dx.
We can then, let Lt be the last zero in (0,t). Then,
tLt∼Beta(21,21).Note
The probability that a Brownian motion has at least one zero in (r,t) for 0≤r<t is 1−P(Lt<r).
Further, the cumulative distribution for the Beta(21,21) density can be computed with the arcsin function,
We see that Gt is a martingale with respect to Bt if and only if μ+2σ2=0, i.e., μ=−2σ2.
Examples
Example 1 (Toy Example of MCMC)
Consider the Markov chain X0,X1,… with states {0,1,2} and with,
P=0.9900.20.010.9000.10.8
One can find that the limiting distribution is v=(2320,232,231).
Consider the function r(x)=x5. If X is a random variable with the limiting distribution,
E[r(X)]=05⋅2320+15⋅232+25⋅231=2334=1.4783
If Y1,…,Yn are all i.i.d variables with the limiting distribution, we can check numerically that,
n→∞limnr(Y1)+r(Y2)+…+r(Yn)=1.4783
We also get for X0,X1,… that,
n→∞limnr(X1)+r(X2)+…+r(Xn)=1.4783
but in this case the limit is approached more slowly.
Example 2 (Counting Zeroes)
How many zeroes does Bt have in the interval (0,s)?
If Ls is the last zero in (0,s), we saw that,
sLs∼Beta(21,21),
which means that, with probability one, there exists a zero in the interval (0,s) and for the last zero Ls we have 0<Ls<s.
Repeating the argument, we have that 0<LLs<Ls with probability one, and thus there exists another zero in (0,Ls)⊂(0,s).
The conclusion is that, with probability one, there is an infinite number of zeroes in (0,s) for any s>0.
Example 3 (Simulating from the Ising Model using Perfect Sampling)
We saw above how to use Gibbs sampling to simulate from the Ising model. We then simulated from the conditional distributions, using,
π(σv⋆=1∣σ−v)=…=1+exp(−2β∑v∼wσw)1
Now, we extend this method into a perfect sampling.
Define the partial ordering on configurations σ,τ be defining,
σ≤τ⟺σv≤τv for all vertices v
We have minimal and maximal elements m=(−1,…,−1) and M=(1,…,1).
Note
iF σ≤τ, then for all vπ(σv⋆=1∣σ−v)≤π(τv⋆=1∣τ−v).
Defining g(σ,U)v=2I(U≤π(σv⋆=1∣σ−v))−1, makes g monotone.
Given an Ising model with β>0, start with choosing an integer n>0 and setting configurations,
σ(−n)τ(−n)=m=(−1,…,−1)=M=(1,…,1)Algorithm (Perfect Sampling for the Ising Model)
For i=−n+1,…,0 set σ(i)=σ(i−1), τ(i)=τ(i−1) and then,
Looping through all vertices v:
Compute π(σv⋆=1∣σ−v(i)) and π(τv⋆=1∣τ−v(i)).
Simulate U∼Uniform(0,1).
Update σv(i)=g(σ(i),U)v and τv(i)=g(τ(i),U)v.
If σ(0)=τ(0), this is the result. Otherwise, double n and repeat.
Example 4 (Using a Binomial Likelihood)
Assume the offspring process is Binomial(N,p) for some parameter p and a fixed (known) N.
By definition, we get the likelihood,
π(y1,y2,…,yn∣p):=i=1∏nBinomial(yi;N,p)
A possibility is to use a prior p∼Beta(α,β). Writing S=∑i=1nyi, we get the posterior,
p∣D∼Beta(α+S,β+nN−S).
where D={y1,y2,…,yn} is the data.
More generally, if π(p)=f(p) for any positive function integrating to 1 on [0,1], we get the posterior,
π(p∣D)∝pBeta(p;S+1,nN−S+1)f(p).
We can then for example compute numerically the posterior probability that the branching process is supercritical, i.e., that P(p>N1∣D), with,
∫N11π(p∣D)dp=∫01Beta(p;S+1,nN−S+1)f(p)dp∫N11Beta(p;S+1,nN−S+1)f(p)dp.Example 5 (Using a Multinomial Likelihood)
Assume there is a maximum of N offspring and that now p=(p0,p1,…,pN) is an (unknown) probability vector such that pi is the probability of i offspring.
By definition, we get the likelihood,
π(y1,y2,…,yn∣p):=Multinomial(c;p),
where c=(c0,c1,…,cN) is the vector of counts in the data of cases with 0,…,N offspring, respectively.
If we use a prior p∼Dirichlet(α), where α=(α0,α1,…,αN) is a vector of pseudocounts, we get the posterior,
Thus, {Yt}t≥0 is a martingale with respect to {Bt}t≥0. ■
Example 16 (Discounting Future Values of Stocks & The Black-Scholes Formula for Option Pricing)
When making investments, there is always a range of choices, some of which are sometimes called “risk-free”.
Such investments may pay a fixed interest.
When interests are compounded frequently, a reasonable model is that an investment of G0 has a value G0ert after time t, where r is the “risk-free” investment rate of return.
A common way to take this alternative into account is to instead “discount” all other investments with the factor e−rt.
For example, the discounted value of a stock can be modeled as,
e−rtGt=e−rtG0eμt+σBt=G0e(μ−r)t+σBt.
A possible assumption about the trend μ of a stock price is that the discounted value behaves as a martingale with respect to the Brownian motion Bt.
Thus, we get,
μ−r+2σ2=0,i.e.,μ=r−2σ2.
The Black-Scholes formula for option pricing is based on,
Assuming the discounted stock price is a martingale with respect to the Brownian motion.
Using discounting when computing the value of the option.