Theory
Definition: Stochastic Process
A stochastic process is a collection of random variables, $\{X_t, t \in \mathcal{I} \}$. where $\mathcal{I}$ is the index set of the process and $\mathcal{S}$ is the common state space of the random variables $X_t$.
Definition: Markov Property
A process fulfills the Markov property if, for any $t_0 \in \mathcal{I}$, whenever $X_{t_0}$ is known, $X_t$ (with $t > t_0$) is independent of the values for $X_s$ for all $s < t_0$.
Recall: Conditional Probability and Independence
Given the events $A$ and $B$, the conditional probability of $A$ given $B$ is defined as, $$ P(A \mid B) = \frac{P(A \cap B)}{P(B)} $$ The events $A$ and $B$ are independent if $P(A \cap B) = P(A) P(B)$, or equivalently, if $P(A \mid B) = P(A)$.
The law of total probability states that if $B_1, B_2, \ldots, B_n$ is a sequence of events that partitions S. Then, $$ P(A) = \sum_{i=1}^{n} P(A \cap B_i) = \sum_{i=1}^{n} P(A \mid B_i) P(B_i) $$ Thus, Bayes’ law follow directly from the definition of conditional probability and the law of total probability, $$ P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)} $$
Definition: Law of Total Expectation
$$ \mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y \mid X]] $$
Definition: Law of Total Variance
Recall that, by definition, $$ \mathrm{Var}(Y) \coloneqq \mathbb{E}\left[(Y - \mathbb{E}[Y])^2\right] = \mathbb{E}[Y^2] - (\mathbb{E}[Y])^2 $$ Similarly, we have for the conditional variance, $$ \mathrm{Var}(Y \mid X) \coloneqq \mathbb{E}_{Y \mid X = x}\left[(Y - \mathbb{E}[Y \mid X])^2 \mid X\right] $$ With these definitions, we can state the law of total variance as, $$ \mathrm{Var}(Y) = \mathbb{E}[\mathrm{Var}(Y \mid X)] + \mathrm{Var}(\mathbb{E}[Y \mid X]) $$
Definition: Markov Chain
Let $S$ be a discrete set (not necessarily finite), called the state space. A Markov Chain is a sequence of random variables $X_0, X_1, X_2, \ldots$ taking values in $S$, with the property, $$ \pi(X_{n+1} \mid X_0, X_1, \ldots, X_n) = \pi(X_{n+1} \mid X_n) $$ for all $n \geq 1$.
The chain is said to be time-homogeneous if, for all $n \geq 0$ 1We generally assume this unless otherwise stated., $$ \pi(X_{n+1} \mid X_n) = \pi(X_1 \mid X_0) $$ The transition matrix is defined with, $$ P_{ij} = \pi(X_{n+1} = j \mid X_n = i) $$ Further, a stochastic matrix is a square matrix $P$ with non-negative entries, satisfying $P \mathbf{1} = \mathbf{1}$, where $\mathbf{1}$ is a row vector of ones (i.e., all rows sum to one).
Lastly, all transition matrices are stochastic matrices, and all stochastic matrices can be used as transition matrices.
Definition: Limiting Distribution
A limiting distribution for a Markov chain with transition matrix $P$ is a probability vector $v$ such that, $$ \lim_{n \to \infty} (P^n)_{ij} = v_j $$ for all states $i$ and $j$.
An equivalent formulation is, the limit $\lim_{n \to \infty} (P^n)_{ij}$ exists and does not depend on $i$.
Another equivalent formulation is, $\lim_{n \to \infty} P^n$ is a stochastic matrix with all rows identical.
Further, a Markov chain has either no or one unique limiting distribution. If a limiting distribution exists, its probabilities correspond to the proportion of time steps the chain spends at each state in the long run.
Definition: Stationary Distribution
A stationary distribution for a Markov chain is a distribution that is unchanged when applying one step of the Markov chain. If $P$ is the transition matrix, then a probability vector $v$ represents a stationary distribution if and only if, $$ vP = v $$ A Markov chain can have zero, one, or many stationary distributions. Limiting distributions are always stationary distributions (but not necessarily vice versa).
Definition: Regular Transition Matrices
A stochastic matrix $P$ is positive if all entries are strictly positive, i.e., $P_{ij} > 0$ for all $i$ and $j$. A stochastic matrix $P$ is regular if $P^n$ is positive for some $n > 0$.
Theorem: Limit Theorem for Regular Markov Chains
If the transition matrix $P$ is regular, the limiting distribution exists. There are no other stationary distributions and the limiting distribution is positive, i.e., all states have positive probability in the limiting distribution.
Theorem: Random Walks on Undirected Graphs
Random walks in a Markov chains (i.e., undircted graphs) have the stationary distribution $v$ with, $$ v_i = \frac{\mathrm{deg}(i)}{S} $$ where $\mathrm{deg}(i)$ is the degree of node $i$ (i.e., the number of edges going into node $i$) and $S$ is the sum over all nodes of their degrees. Further, weighted random walks in a Markov chains (i.e., weighted undircted graphs) have the stationary distribution $v$ with, $$ v_i = \frac{w(i)}{e} $$ where $w(i)$ is the sum of the weights of the edges going into node $i$, and $e$ is the total sum over all nodes of their $w(i)$ values.
Intuition: Moving Between States
State $j$ is accessible from state $i$ if $(P^n)_{ij} > 0$ for some $n \geq 0$. States $i$ and $j$ communicate if $i$ is accessible from $j$ and $j$ is accessible from $i$.
“Communication” is transitive, i.e., if $i$ communicates with $j$ and $j$ communicates with $k$, then $i$ communicates with $k$.
Communication is an equivalence relation, subdividing all states into communication classes. Communication classes can be found for example by drawing transition graphs.
Lastly, a Markov chain is irreducible if it has exactly one communication class.
Definition: Recurrence and Transience
Let $T_j$ be the first passage time to state $j$, $$ T_j = \min {n > 0 : X_n = j } $$ Let $f_j$ be the probability that a chain starting at $j$ will return to $j$, $$ f_j = P(T_j < \infty \mid X_0 = j) $$ A state $j$ is recurrent if a chain starting at $j$ will eventually revisit $j$, i.e., if $f_j = 1$.
A state $j$ is transient if a chain starting at $j$ has a positive probability of never revisiting $j$, i.e., if $f_j < 1$.
Note
The expected number of visits at $j$ when the chain starts at $i$ is given by $\sum_{n = 0}^{\infty} (P^n)_{ij}$.
$j$ is recurrent if and only if $\sum_{n = 0}^{\infty} (P^n)_{jj} = \infty$.
$j$ is transient if and only if $\sum_{n = 0}^{\infty} (P^n)_{jj} < \infty$.
Note: Communication Classes
The states of a communication class are either all recurrent or all transient. The states of a finite irreducible Markov chain are all recurrent.
If a state is recurrent, only states inside its communication class are accessible from it. If no states outside a finite communication class are accessible from it, then the class consists of recurrent states.
Theorem: Limit Theorem for Finite Irreducible Markov Chains
Recall
In a finite irreducible Markov chain, all states are recurrent.
Let $\mu_j = \mathbb{E}[T_j \mid X_0 = j]$ be the expected return time to state $j$. Then $\mu_j < \infty$ for all states $j$ and the vector $v$ with $v_j = \frac{1}{\mu_j}$ is the unique stationary distribution.
Further, $$ v_j = \lim_{n \to \infty} \frac{1}{n} \sum_{m = 0}^{n - 1} (P^m)_{ij} $$
Note
All finite regular Markov chains are finite irreducible Markov chains, but not vice versa. The overall conclusion here is weaker than that for finite regular Markov chains. Not all finite irreducible Markov chains have limiting distributions.
Definition: Periodicity
The period of a state $i$ is the greatest common divisor of all $n > 0$ such that $(P^n)_{ii} > 0$. All states of a communication class have the same period.
A Markov chain is periodic if it is irreducible and all states have period greater than 1.
A Markov chain is aperiodic if it is irreducible and all states have period 1.
Definition: Ergodic Markov Chains
A Markov chain is ergodic if,
-
It is irreducible.
-
It is aperiodic.
-
All states are positive recurrent (i..e, have finite expected return times (which always is true if the state space is finite)).
Theorem: Fundamental Limit Theorem for Ergodic Markov Chains
There exists a unique stationary distribution $v$ which is the limiting distribution of the chain.
Note
We can also show that a finite Markov chain is ergodic if and only if its transition matrix is regular.
Definition: Time Reversibility
Let $P$ be the transition matrix of an irreducible Markov chain with stationary distribution $v$. The chain is time reversible if, when running from its stationary distribution, it looks the same moving forward as backwards, i.e., $$ \pi(X_k = i, X_{k + 1} = j) = \pi(X_k = j, X_{k + 1} = i) $$ This may also be written as $v_i P_{ij} = v_j P_{ji}$ for all states $i$ and $j$. It is called the detailed balance equations.
Note
If $x$ is a probability vector satisfying $x_i P_{ij} = x_j P_{ji}$ for all states $i$ and $j$, then $x$ is the stationary distribution of the chain, and the chain is time reversible.
Proof
We have, $$ \begin{align*} (xP)_j & = \sum_{i} x_i P_{ij} \newline & = \sum_i x_j P_{ji} \newline & = x_j \sum_i P_{ji} \newline & = x_j \end{align*} $$ Thus, $xP = x$, so $x$ is the stationary distribution.
If a Markov chain is defined as a random walk on a weighted undirected graph, then it is time reversible.
Proof
Let $w(i, j)$ be the weight of the edge between nodes $i$ and $j$ (0 if no edge exists). Let $w(i) = \sum_{k} w(i, k)$ be the sum of weights of edges going into node $i$. The transition matrix is given by $P_{ij} = \frac{w(i, j)}{w(i)}$. Let $v_i = \frac{w(i)}{e}$, where $e$ is the total sum over all nodes of their $w(i)$ values. Then, $$ \begin{align*} v_i P_{ij} & = \frac{w(i)}{e} \cdot \frac{w(i, j)}{w(i)} \newline & = \frac{w(i, j)}{e} \newline & = \frac{w(j, i)}{e} \newline & = \frac{w(j)}{e} \cdot \frac{w(j, i)}{w(j)} \newline & = v_j P_{ji} \end{align*} $$ Thus, the detailed balance equations are satisfied, so the chain is time reversible.
If a finite Markov chain is time reversible, it can be represented as a random walk on a weighted undirected graph.
Proof
Let $P$ be the transition matrix of a finite time reversible Markov chain with stationary distribution $v$. Let $w(i, j) = v_i P_{ij}$ be the weight of the edge between nodes $i$ and $j$. Then, $$ \sum_{j} w(i, j) = \sum_{j} v_i P_{ij} = v_i \sum_{j} P_{ij} = v_i $$ Thus, the sum of weights of edges going into node $i$ is $v_i$. The transition matrix is given by, $$ P_{ij} = \frac{w(i, j)}{\sum_{k} w(i, k)} = \frac{v_i P_{ij}}{v_i} = P_{ij} $$ Thus, the Markov chain can be represented as a random walk on a weighted undirected graph.
Definition: Absorbing Chains
State $i$ is absorbing if $P_{ii} = 1$a A Markov chain is an absorbing chain if it has at least one absorbing state.
By reordering the states, the transition matrix for an absorbing chain can be written in the block form, $$ P = \begin{bmatrix} Q & R \newline 0 & I \end{bmatrix} $$ where $I$ is the identity matrix, $0$ is a matrix of zeroes, and $Q$ corresponds to transient states. We can prove by induction that, $$ P^n = \begin{bmatrix} Q^n & (I + Q + Q^2 + \ldots + Q^{n - 1}) R \newline 0 & I \end{bmatrix} $$
Taking the limit and using $\lim_{n \to \infty} Q^n = 0$ (since all states corresponding to $Q$ are transient), we have, $$ \lim_{n \to \infty} P^n = \begin{bmatrix} 0 & (I - Q)^{-1} R \newline 0 & I \end{bmatrix} = \begin{bmatrix} 0 & FR \newline 0 & I \end{bmatrix} $$ where $F = (I - Q)^{-1} = \lim_{n \to \infty} (I + Q + Q^2 + \ldots + Q^{n - 1})$ is called the fundamental matrix.
The probability to be absorbed in a particular absorbing state given a start in a transient state is given by the entries of $FR$.
Further, the expected number of visits in state $j$ for a chain that starts in the transient state $i$ is given by $F_{ij}$ (as this is equal to $\sum_{n = 0}^{\infty} (P^n)_{ij}$).
Thus, the expected number of steps until absorption is given by the vector $F \mathbf{1}^T$, where $\mathbf{1}$ is a column vector of ones.
Note
Given an irreducible Markov chain, to compute the expected number of steps needed to go from state $i$ to the first visit to state $j$, one can change the chain into one where state $j$ is absorbing, and compute the expected number of steps until absorption when starting at state $i$.
Definition: Galton-Watson Branching Process
A (Galton-Watson) branching process is a discrete Markov chain $Z_0, Z_1, \ldots, Z_n, \ldots$ where,
-
The state space is the non-negative integers.
-
$Z_0 = 1$ (i.e., the process starts with one individual).
-
$Z_n$ is the sum $X_1 + X_2 + \ldots + X_{Z_{n - 1}}$, where $X_i$ are independent random non-negative integers all with the same offspring distribution, $$ Z_n = \sum_{i = 1}^{Z_{n - 1}} X_i $$ Connecting each of the $Z_n$ individuals in generation $n$ with their offspring in generation $n + 1$ we get a tree illustrating the branching process.
The offspring distribution is described by the probability vector $a = (a_0, a_1, \ldots)$ where $a_j = P(X_i = j)$.
We will mainly focus on the interesting cases where we assume $a_0 > 0$ (i.e., there is a positive probability of having no offspring) and $a_0 + a_1 < 1$ (i.e., there is a positive probability of having more than one offspring).
Definition: Expected Size of Generations in a Branching Process and Classification of Branching Processes
Note
Note that the state 0 is abosribing, this absorbtion is called extinction. Also, as $a_0 > 0$, all nonzero states are transient. Let $\mu = \mathbb{E}[X_i] = \sum_{j=0}^{\infty} j \cdot a_j$, i.e., the expected number of offspring per individual.
Then, we may compute that, $$ \begin{align*} \mathbb{E}[Z_n] & = \mathbb{E}\left(\sum_{i = 1}^{Z_{n-1}}\right) X_i \newline & = \mathbb{E}\left(\mathbb{E}\left(\sum_{i = 1}^{Z_{n-1}} X_i \mid Z_{n - 1} \right)\right) & = \ldots & = \mathbb{E}[Z_{n - 1}] \cdot \mu \newline \end{align*} $$ Thus, we directly get that, $$ \mathbb{E}[Z_n] = \mu^n \cdot \underbrace{\mathbb{E}[Z_0]}_{\coloneqq 1} = \mu^n $$ Based on the value of $\mu$, we classify branching processes into three categories:
-
Subcritical branching processes: $\mu < 1$. The expected generation size decreases over time, $\lim_{n \to \infty} \mathbb{E}[Z_n] = 0$.
-
Critical branching processes: $\mu = 1$. The expected generation size remains constant over time, $\lim_{n \to \infty} \mathbb{E}[Z_n] = 1$.
-
Supercritical branching processes: $\mu > 1$. The expected generation size increases over time, $\lim_{n \to \infty} \mathbb{E}[Z_n] = \infty$.
Note
One can prove that if $\lim_{n \to \infty} Z_n = 0$, then the probability of extinction is 1. We will do this later.
Definition: Variance of Generation Sizes in a Branching Process
Let $\mu$ be defined as above, and let $\sigma^2 = \mathrm{Var}(X_i)$ be the variance of the number of children.
Using the law of total variance we get, $$ \begin{align*} \mathrm{Var}(Z_n) & = \mathrm{Var}(\mathbb{E}[Z_n \mid Z_{n - 1}]) + \mathbb{E}[\mathrm{Var}(Z_n \mid Z_{n - 1})] \newline & = \mathrm{Var}\left(\mathbb{E}\left(\sum_{i = 1}^{Z_{n - 1}} X_i \mid Z_{n - 1}\right)\right) + \mathbb{E}\left[\mathrm{Var}\left(\sum_{i = 1}^{Z_{n - 1}} X_i \mid Z_{n - 1}\right)\right] \newline & = \mathrm{Var}(\mu Z_{n - 1}) + \mathbb{E}[\sigma^2 Z_{n - 1}] \newline & = \mu^2 \mathrm{Var}(Z_{n - 1}) + \sigma^2 \mathbb{E}[Z_{n - 1}] \newline \end{align*} $$ From this, we can prove by induction that, for $n \geq 1$, $$ \mathrm{Var}(Z_n) = \sigma^2 \mu^{n - 1} \sum_{k = 0}^{n - 1} \mu^k = \begin{cases} n \sigma^2 & \text{if } \mu = 1 \newline \sigma^2 \mu^{n - 1} \frac{\mu^n - 1}{\mu - 1} & \text{if } \mu \neq 1 \end{cases} $$
Theorem: Extinction Probability Theorem
Let $G_X$ be the probability generating function for the offspring distribution for a branching process. The probability of eventual extinction is the smallest positive root of the equation $G_X(s) = s$.
Intuition: Probability Generating Functions
For any discrete random variable $X$ taking values in $\{0, 1, 2, \ldots\}$, define the probability function $G_X(s)$ as, $$ G_X(s) = \mathbb{E}[s^X] = \sum_{k = 0}^{\infty} P(X = k) s^k $$ The series converges absolutely for $\Vert s \Vert \leq 1$. We assume $s$ is a real number in $[0, 1]$.
We get a 1-1 correspondence between probability vectors on $\{0, 1, 2, \ldots\}$ and functions represented by a series where the coefficients sum to 1.
Specifically, if $G_X(s) = G_Y(s)$ for all $s$ for random variables $X$ and $Y$, then $X$ and $Y$ have the same distribution.
Note: Properties of Probability Generating Functions
-
To go from $X$ to $G_X(s)$, we compute the (in)finite sum.
-
To go from $G_X(s)$ to $X$, we use that we have, $$ P(X = k) = \frac{G_X^{(k)}(0)}{k!} $$
-
If $X$ and $Y$ are independent random variables, then, $$ G_{X + Y}(s) = \mathbb{E}[s^{X + Y}] = \mathbb{E}[s^X s^Y] = \mathbb{E}[s^X] \mathbb{E}[s^Y] \eqqcolon G_X(s) G_Y(s) $$
-
$\mathbb{E}[X] = G_X^{\prime}(1)$
-
$\mathbb{E}[X(X - 1)] = G_X^{\prime \prime}(1)$
-
As a consequence, $\mathrm{Var}(X) = G_X^{\prime \prime}(1) + G_X^{\prime}(1) - (G_X^{\prime}(1))^2$
Further, assume we have a branching process $Z_0, Z_1, \ldots$ with independent offspring variables $X_i$ all with the same distribution as $X$. By writing $G_n(s) = G_{Z_n}(s) = \mathbb{E}[s^{Z_n}]$ and $G_X(s) = \mathbb{E}[s^X]$, we get, $$ \begin{align*} G_n(s) & = \mathbb{E}[s^{\sum_{k = 1}^{Z_{n - 1}} X_k}] \newline & = \mathbb{E}[\mathbb{E}[s^{\sum_{k = 1}^{Z_{n - 1}} X_k} \mid Z_{n - 1}]] \newline & = \mathbb{E}\left[\mathbb{E}\left[\prod_{k = 1}^{Z_{n - 1}} s^{X_k} \mid Z_{n - 1}\right]\right] \newline & = \mathbb{E}[G(s)^{Z_{n - 1}}] \newline & = G_{n - 1}(G_X(s)) \newline \end{align*} $$ As $G_0(s) = \mathbb{E}[s^{Z_0}] = s$, it follows that, $$ G_n(s) = \underbrace{G_X(G_X(\ldots G_X(s) \ldots))}_{n \text{ times}} $$
Theorem: Extinction Probability Theorem (Addition)
Let $G_X$ be the probability generating function for the offspring distribution for a branching process. The probability of eventual extinction is the smallest positive root of the equation $G_X(s) = s$. Also, if the process is critical ($\mu = 1$) then the extinction probability is 1.
Proof: Of the last sentence
In the critical case, $$ G_X^{\prime}(1) = \mathbb{E}[X] = \mu = 1 $$ As $G_X^{\prime \prime}(s) > 0$ for $s \in (0, 1)$, we get that $G_X^{\prime}(s) < 1$ for $s \in (0, 1)$, and $\frac{d}{ds}(G(s) - s) < 0$ for $s \in (0, 1)$. As $G(1) - 1 = 0$ for any probability generating function, we get that $G(s) = s$ has its smallest positive root at $s = 1$.
Theorem: Strong Law of Large Numbers
Strong law of large numbers of samples: If $Y_1, Y_2, \ldots, Y_m$ and $Y$ are i.i.d random variables from a distribution with finite mean, and if $\mathbb{E}[r(Y)]$ exists, then, with probability 1, $$ \lim_{m \to \infty} \frac{r(Y_1) + r(Y_2) + \ldots + r(Y_m)}{m} = \mathbb{E}[r(Y)] $$
Theorem: Strong Law of Large Numbers for Markov Chains
Strong law of large numbers for Markov chains: If $X_0, X_1, \ldots$ is an ergodic Markov chain with stationary distribution $\pi$, and if $\mathbb{E}_{X \sim \pi}[r(X)]$ exists, then, with probability 1, $$ \lim_{m \to \infty} \frac{r(X_0) + r(X_1) + \ldots + r(X_m)}{m} = \mathbb{E}_{X \sim \pi}[r(X)] $$ where $X$ has distribution $\pi$.
Note
When using this theorem in practice, one might improve accuracy by throwing away the first sequence $X_1, \ldots, X_s$ for $s < m$ before computing the average. The first sequence is then called the burn-in.
Definition: The Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm is a method to construct a Markov chain having a desired limiting distribution. Let $\theta$ be a random variable with probability mass function $\pi( \theta)$. We also assume given a proposal distribution $T(\theta_{\text{new}} \mid \theta)$, which, for every given $\theta$, provides a PMF for a new $\theta_{\text{new}}$.
Finally, we define, for $\theta$ and $\theta_{\text{new}}$, the acceptance probability, $$ a = \min\left(1, \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}\right) $$ The Metropolis-Hastings algorithm is: Starting with some initial value $\theta_0$, generate $\theta_1, \theta_2, \ldots$ by, at each step, proposing a new $\theta$ based on the old using the proposal function and accepting it with probability $a$. If it is not accepted, the old value is used again.
If this defines an ergodic Markov chain, its unique stationary distribution is $\pi(\theta)$.
Proof
Let $P$ be the transition matrix of the Markov chain defined by the Metropolis-Hastings algorithm. We need to show that $\pi P = \pi$. It suffices to show that the detailed balance equations are satisfied, i.e., that, $$ \pi(\theta) P_{\theta \theta_{\text{new}}} = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} $$ for all states $\theta$ and $\theta_{\text{new}}$. We have, $$ P_{\theta \theta_{\text{new}}} = T(\theta_{\text{new}} \mid \theta) \cdot a = T(\theta_{\text{new}} \mid \theta) \cdot \min\left(1, \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}\right) $$ and, $$ P_{\theta_{\text{new}} \theta} = T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} = T(\theta \mid \theta_{\text{new}}) \cdot \min\left(1, \frac{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}\right) $$ If $\pi(\theta) T(\theta_{\text{new}} \mid \theta) \leq \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})$, then, $$ \begin{align*} \pi(\theta) P_{\theta \theta_{\text{new}}} & = \pi(\theta) T(\theta_{\text{new}} \mid \theta) \cdot 1 \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot \frac{\pi(\theta) T(\theta_{\text{new}} \mid \theta)}{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})} \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} \newline & = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} \end{align*} $$ If $\pi(\theta) T(\theta_{\text{new}} \mid \theta) > \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})$, then, $$ \begin{align*} \pi(\theta) P_{\theta \theta_{\text{new}}} & = \pi(\theta) T(\theta_{\text{new}} \mid \theta) \cdot \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})}{\pi(\theta) T(\theta_{\text{new}} \mid \theta)} \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot 1 \newline & = \pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}}) \cdot a^{\prime} \newline & = \pi(\theta_{\text{new}}) P_{\theta_{\text{new}} \theta} \end{align*} $$ Thus, in both cases, the detailed balance equations are satisfied, so $\pi$ is the stationary distribution of the chain.
Definition: Gibbs Sampling
For any probability model over a vector $\theta = (\theta_1, \theta_2, \ldots, \theta_n)$, consider a Metropolis-Hastings proposal function changing only one coordinate, with the value of this coordinate simulated from the conditional distribution given the remaining coordinates. The acceptance probability is always 1, so the proposal is always accepted.
Proof
Let $\theta = (\theta_1, \theta_2, \ldots, \theta_n)$ and $\theta_{\text{new}} = (\theta_1, \ldots, \theta_{i - 1}, \theta_{i,\text{new}}, \theta_{i + 1}, \ldots, \theta_n)$ differ only in coordinate $i$. The proposal distribution is, $$ T(\theta_{\text{new}} \mid \theta) = P(\theta_i = \theta_{i,\text{new}} \mid \theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n) $$ and, $$ T(\theta \mid \theta_{\text{new}}) = P(\theta_i = \theta_i \mid \theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n) $$ Using the definition of conditional probability, we have, $$ \begin{align*} T(\theta_{\text{new}} \mid \theta) & = \frac{P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i,\text{new}}, \theta_{i + 1}, \ldots, \theta_n)} {P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)} \ & = \frac{\pi(\theta_{\text{new}})} {\sum_x \pi(\theta_1, \ldots, \theta_{i - 1}, x, \theta_{i + 1}, \ldots, \theta_n)} , \end{align*} $$ and, $$ \begin{align*} T(\theta \mid \theta_{\text{new}}) & = \frac{P(\theta_1, \ldots, \theta_{i - 1}, \theta_i, \theta_{i + 1}, \ldots, \theta_n)} {P(\theta_1, \ldots, \theta_{i - 1}, \theta_{i + 1}, \ldots, \theta_n)} \ & = \frac{\pi(\theta)} {\sum_x \pi(\theta_1, \ldots, \theta_{i - 1}, x, \theta_{i + 1}, \ldots, \theta_n)} , \end{align*} $$ Thus, we have, $$ \begin{align*} \frac{\pi(\theta_{\text{new}}) T(\theta \mid \theta_{\text{new}})} {\pi(\theta) T(\theta_{\text{new}} \mid \theta)} & = \frac{ \pi(\theta_{\text{new}}) \frac{\pi(\theta)}{\sum_x \pi(\theta_1,\ldots,\theta_{i-1},x,\theta_{i+1},\ldots,\theta_n)} }{ \pi(\theta) \frac{\pi(\theta_{\text{new}})}{\sum_x \pi(\theta_1,\ldots,\theta_{i-1},x,\theta_{i+1},\ldots,\theta_n)} } \ & = \frac{\pi(\theta_{\text{new}})\pi(\theta)} {\pi(\theta)\pi(\theta_{\text{new}})} = 1 , \end{align*} $$ Thus, the proposal is always accepted.
Putting together an algorithm updating different coordinates in different steps may create an ergodic Markov chain.
This is called Gibbs sampling. Sometimes the conditional distributions are easy to derive. Then this is an easy-to-use version of Metropolis-Hastings.
Definition: The Ising Model
Uses a grid of vertices; We will assume an $n \times n$ grid. Two vertices $v$ and $w$ are neighbours, denoted $v \sim w$, if they are next to each other in the grid.
Each vertex $v$ can have value $+1$ or $-1$ (called its “spin”); We denote this by $\sigma_{v} = 1$ or $\sigma_{v} = -1$.
A configuration $\sigma$ consists of a choice of $+1$ or $-1$ for each vertex. Thus, the set $\Omega$ of possible configurations has $2^{(n^2)}$ elements. We define the energy of a configuration as $E(\sigma) = - \sum_{v \sim w} \sigma_v \sigma_w$.
The Gibbs distribution is the probability mass function on $\Omega$ defined by, $$ \pi(\sigma) \propto \exp(-\beta E(\sigma)) $$ where $\beta$ is a parameter of the model; $\frac{1}{\beta}$ is called the temperature.
It turns out that when the temperature is high, samples from the model will show a chaotic pattern of spins, but when the temperature sinks below the phase transition value, samples will show chunks of neighbouring vertices with the same spin, i.e., the system will be “magnetized”.
Intuition: Simulating from the Ising Model using Gibbs Sampling
For a vertex configuration $\sigma$ and a vertex $v$, let $\sigma_{-v}$ denote the part of $\sigma$ that does not involve $v$. Prpose a new configuration $\sigma^{\star}$ given an old configuration $\sigma$ by first choosing a vertex $v$, then, let $\sigma^{\star}$ be identical to $\sigma$ except possibly at $v$. Decide the spin at $v$ using the conditional distribution given $\sigma_{-v}$, $$ \begin{align*} \pi(\sigma^{\star} \mid \sigma_{-v}) & = \frac{\pi(\sigma_v = 1, \sigma_{-v})}{\pi(\sigma_{-v})} \newline & = \frac{\pi(\sigma_v = 1, \sigma_{-v})}{\pi(\sigma_v = 1, \sigma_{-v}) + \pi(\sigma_v = -1, \sigma_{-v})} \newline & = \frac{1}{1 + \frac{\pi(\sigma_v = -1, \sigma_{-v})}{\pi(\sigma_v = 1, \sigma_{-v})}} \newline & = \frac{1}{1 + \exp(-\beta E(\sigma_v = -1, \sigma_{-v}) + \beta E(\sigma_v = 1, \sigma_{-v}))} \newline & = \frac{1}{1 + \exp(\beta \sum_{v \sim w} \sigma_v \sigma_w \mid \sigma_v = -1 - \beta \sum_{v \sim w} \sigma_v \sigma_w \mid \sigma_v = 1)} \newline & = \frac{1}{1 + \exp(-2 \beta \sum_{v \sim w} \sigma_w)} \newline \end{align*} $$
Intuition: Perfect Sampling
Given an ergodic Markov chain with finite state space of size $k$ and limiting distribution $\pi$. The idea is that, given $n$ prove that $X_n$ actually has reached the limit distribution $\pi$.
If we can prove that the distribution at $X_n$ is independent of the starting state $X_0$, then we might be able to conclude that $X_n$ has the limiting distribution $\pi$.
Construct $k$ Markov chains that are dependent (“coupled”) but which are marginally Markov chains as above. If they start at the $k$ possible value at $X_0$ but have identical values at $X_n$, then we are done.
Note
$n$ cannot be determined as the first value where the $k$ chains meet, it must be determined independently of such information. Thus, usually one wants to generate chains $X_{-n}, X_{-n + 1}, \ldots, X_0$ where $X_0$ has the limiting distribution, and we stepwise increase $n$ to make all chains coalesce to one chain at time 0.
Intuition: Using same source of randomness for all chains
Consider the chains $X^{(j)}_{-n}, \ldots, X^{(j)}_0$ for $j = 1, 2, \ldots, k$. Instead of simulating $X^{(j)}_{i + 1}$ based on $X^{(j)}_i$ independently for each j, we define a function $g$ so that $X^{(j)}_{i + 1} = g(X^{(j)}_i, U_i)$ for all $j$ where $U_i \sim \mathrm{Uniform}(0, 1)$.
Thus, if two chains have identical values in $X_i$, they will also be identical in $X_{i + 1}$.
Thus, for a particular $n$, if all chains have not converged at $X_0$, we simulate $k$ chains from $X_{-2n}$ to $X_{-n}$. They might only hit a subset of the $k$ states at $X_{-n}$ and thus might coalesce to one state at $X_0$ using the old simulations. If not, double $n$ again.
Definition: Counting Process
A counting process $\{N_t\}_{t \in \mathcal{I}}$ is a stochastic process where $\mathcal{I} = \mathbb{R}_0^+$, where the state space is the non-negative integers, and where $0 \leq s \leq t$ implies $N_s \leq N_t$.
Generally, when $s < t, N_t - N_s$ counts the number of “events” (in the loose meaning “things happening”) in $(s, t]$.
A realization of $N_t$ is then a function of $t$ that is a right-continuous step function.
Definition: Poisson Process I
A Poisson process $\{N_t\}_{t \geq 0}$ with parameter $\lambda > 0$ is a counting process, fulfilling,
- $N_0 = 0$.
- $N_t \sim \mathrm{Poisson}(\lambda t)$ for all $t > 0$.
- Stationary Increments: $N_{t + s} - N_s$ has the same distribution as $N_t$ for all $s > 0, t > 0$.
- Independent Increments: $N_t - N_s$ and $N_r - N_q$ are independent, when $0 \leq q < r \leq s < t$.
Note
It is not obvious that such a process necessarily exists. Further, $\mathbb{E}[N_t] = \lambda t$, thus what one is counting occurs with a rate of $\lambda$ items per time unit.
Definition: Poisson Process II
Let $X_1, X_2, \ldots$ be a sequence of i.i.d. exponential random variables with parameter $\lambda$. Let $N_0 = 0$ and, for $t > 0$, $$ N_t \coloneqq \max\{n : X_1 + \ldots + X_n \leq t\}. $$ Then, $\{N_t\}_{t \geq 0}$ is a Poisson process with parameter $\lambda$.
Proof
Firstly, if we start with a Poisson process (definition I), and let $X_1, X_2, \ldots$ be the inter-arrival times, i.e., the time between event 1 and event 2, event 2 and event 3, etc., then these are i.i.d. exponential random variables with parameter $\lambda$. Thus, $N_t$ will be defined as above.
Conversely, if we construct $N_t$ as above, all properties of definition I can be verified.
Let $\{M_t\}_{t \geq 0}$ be defined as $M_t \coloneqq N_{t + s} - N_s$ for some $s > 0$.
- It is a counting process by definition.
- $N_0 = 0$ by definition.
- Stationary increments follow from the memoryless property of the exponential distribution: $M_k = N_{s + t + k} - N_{s + t} = \max\{n : X_{N_{s + t} + 1} + \ldots + X_{N_{s + t} + n} \leq k\}$, which has the same distribution as $N_k$.
- Independent increments follow from the independence of the $X_i$‘s.
- $M_k = M_{k + s} - N_s = \max\{n : X_{N_s + 1} + \ldots + X_{N_s + n} \leq k\}$, which has a $\mathrm{Poisson}(\lambda k)$ distribution.
We call $S_n \coloneqq X_1 + \ldots + X_n$ the arrival times of the process. Let $M = \min\{X_1, \ldots, X_n\}$, where, independently for each $i$, $X_i \sim \mathrm{Exponential}(\lambda_i)$, then, $$ M \sim \mathrm{Exponential}(\lambda_1 + \ldots + \lambda_n), \quad P(M = X_k) = \frac{\lambda_k}{\lambda_1 + \ldots + \lambda_n}. $$
Proof
Since the $X_i$‘s are independent, $$ \begin{align*} P(X_1 > t) P(X_2 > t) \ldots P(X_n > t) & = e^{-\lambda_1 t} e^{-\lambda_2 t} \ldots e^{-\lambda_n t} \newline & = e^{-(\lambda_1 + \ldots + \lambda_n) t} \newline \end{align*} $$ Thus, $M \sim \mathrm{Exponential}(\lambda_1 + \ldots + \lambda_n)$.
Further, $$ \begin{align*} P(M = X_1) & = P(X_2 \geq X_1, \ldots, X_n \geq X_1) \newline & = \mathbb{E}[\mathbb{I}[X_2 \geq X_1, \ldots, X_n \geq X_1]] \newline & = \mathbb{E}[\mathbb{E}[\mathbb{I}[X_2 \geq X_1, \ldots, X_n \geq X_1] \mid X_1]] \newline & = \mathbb{E}[P(X_2 \geq X_1, \ldots, X_n \geq X_1 \mid X_1)] \newline & = \mathbb{E}[e^{-\lambda_2 X_1} \ldots e^{-\lambda_n X_1} \mid X_1] \newline & = \mathbb{E}[e^{-(\lambda_2 + \ldots + \lambda_n) X_1}] \newline & = \int_{0}^{\infty} e^{-(\lambda_2 + \ldots + \lambda_n) x} \lambda_1 e^{-\lambda_1 x} \ dx \newline & = \lambda_1 \int_{0}^{\infty} e^{-(\lambda_1 + \lambda_2 + \ldots + \lambda_n) x} \ dx \newline & = \frac{\lambda_1}{\lambda_1 + \lambda_2 + \ldots + \lambda_n} \underbrace{\left[-\frac{1}{\lambda_1 + \lambda_2 + \ldots + \lambda_n} e^{-(\lambda_1 + \lambda_2 + \ldots + \lambda_n) x}\right]_{0}^{\infty}}_{= 1} \newline & = \frac{\lambda_1}{\lambda_1 + \lambda_2 + \ldots + \lambda_n} \ _\blacksquare \end{align*} $$
Definition: Poisson Process III
A Poisson process $\{N_t\}_{t \geq 0}$ with parameter $\lambda > 0$ is a counting process, fulfilling,
- $N_0 = 0$.
- The process has stationary and independent increments. $$ \begin{align*} P(N_h = 0) & = 1 - \lambda h + o(h) \newline P(N_h = 1) & = \lambda h + o(h) \newline P(N_h > 1) & = o(h) \newline \end{align*} $$ for all $h > 0$, where $o(h)$ is a function such that $\lim_{h \to 0} \frac{o(h)}{h} = 0$.
Lemma: Superposition of Poisson Processes
Let $\{N_t^{(1)}\}_{t \geq 0}, \ldots, \{N_t^{(n)}\}_{t \geq 0}$ be independent Poisson processes with parameters $\lambda p_1, \ldots, \lambda p_n$, respectively, where $p = (p_1, \ldots, p_n)$ is a probability vector. If $c = (c_1, \ldots, c_n)$ are the counts after time $t$ (i.e., $c_i = N_t^{(i)}$), then, the conditional distribution of $(N_t^{(1)}, \ldots, N_t^{(n)})$, an equivalent model is, $$ c \sim \mathrm{Multinomial}(N_t, p), $$ where $\{N_t\}_{t \geq 0}$ is a Poisson process with parameter $\lambda$.
Proof
Using the model with independent Poisson processes, the probability of observing the count vector $c$ after time $t$ is (denoting $N = c_1 + \ldots + c_n$), $$ \begin{align*} \prod_{i = 1}^n \mathrm{Poisson}(c_i; \lambda p_i t) & = \prod_{i = 1}^n e^{-\lambda p_i t} \frac{(\lambda p_i t)^{c_i}}{c_i!} \newline & = e^{-\lambda t} (\lambda t)^N \prod_{i = 1}^n \frac{p_i^{c_i}}{c_i!} \newline & = e^{-\lambda t} \frac{(\lambda t)^N}{N!} \cdot \frac{N!}{c_1! c_2! \ldots c_n!} p_1^{c_1} p_2^{c_2} \ldots p_n^{c_n} \newline & = \mathrm{Poisson}(N; \lambda t) \cdot \mathrm{Multinomial}(c; N, p) \newline \end{align*} $$ The process for $N$ inherits independent and stationary increments from the sub-processes, so it follows it also a Poisson process with parameter $\lambda$.
Lemma: Uniformly Distributed Arrival Times
Let $\{N_t\}_{t \geq 0}$ be a Poisson process with parameter $\lambda$. If we fix that $N_t = k$, and we select uniformly randomly one of the $k$ arrivals, then its arrival time is uniformly distributed on the interval $[0, t]$.
Proof
Let $S_1, S_2, \ldots, S_k$ be the arrival times given that $N_t = k$. $$ \begin{align*} P(S_k \geq s \mid k \text{ uniformly random in } {1, \ldots, n}, N_t = n) & = \frac{1}{n} \sum_{k = 1}^{n} P(S_k \geq s \mid N_t = n) \newline & = \frac{1}{n} \sum_{k = 1}^{n} \sum_{j = 0}^{k - 1} P(N_s = j \mid N_t = n) \newline & = \frac{1}{n} \sum_{k = 1}^{n} \sum_{j = 0}^{k - 1} \frac{P(N_s = j) P(N_{t - s} = n - j)}{P(N_t = n)} \newline & = \frac{1}{n} \sum_{k = 1}^{n} \sum_{j = 0}^{k - 1} \frac{e^{-\lambda s} \frac{(\lambda s)^j}{j!} e^{-\lambda (t - s)} \frac{(\lambda (t - s))^{n - j}}{(n - j)!}}{e^{-\lambda t} \frac{(\lambda t)^n}{n!}} \newline & = \frac{1}{n} \sum_{j = 0}^{n - 1} (n - j) \frac{n!}{j! (n - j)!} \left(\frac{s}{t}\right)^j \left(1 - \frac{s}{t}\right)^{n - j} \newline & = \left[\sum_{j = 0}^{n - 1} \frac{n!}{j! (n - j - 1)!} \left(\frac{s}{t}\right)^j \left(1 - \frac{s}{t}\right)^{n - j - 1}\right] \cdot \left(1 - \frac{s}{t}\right) \newline & = 1 - \frac{s}{t} \end{align*} $$
Definition: Spatial Poisson Process
A collection of random variables $\{N_A\}_{A \subseteq \mathbb{R}^d}$ is a spatial Poisson process with parameter $\lambda$ if,
- For each bounded set $A \subseteq \mathbb{R}^d$, $N_A$ has a Poisson distribution with parameter $\lambda |A|$.
- Whenever $A \subseteq B, N_A \leq N_B$ (i.e., spatial counting process).
- Whenever $A$ and $B$ are disjoint, $N_A$ and $N_B$ are independent.
Definition: Non-Homogeneous Poisson Process
A counting process $\{N_t\}_{t \geq 0}$ is a non-homogeneous Poisson process with intensity function $\lambda(t)$ if,
- $N_0 = 0$.
- For $0 < s < t$, $$ N_t - N_s \sim \mathrm{Poisson}\left(\int_{s}^{t} \lambda(x) \ dx\right). $$
- It has independent increments.
Definition: Continuous-Time Markov Chain
A continuous-time stochastic process $\{X_t\}_{t \geq 0}$ with discrete state space $\mathcal{S}$ is a continuous-time Markov chain if, $$ P(X_{t + s} = j \mid X_s = i, X_u, 0 \leq u < s) = P(X_{t + s} = j \mid X_s = i), $$ where $s, t \geq 0$ and $i, j, X_u \in \mathcal{S}$.
The process is time-homogeneous if for $s, t \geq 0$ and all $i, j \in \mathcal{S}$, $$ P(X_{t + s} = j \mid X_s = i) = P(X_t = j \mid X_0 = i). $$ We then define the transition function as the matrix function $P(t)$ with entries of the matrix given by, $$ P(t)_{ij} = P(X_t = j \mid X_0 = i). $$
Theorem: Chapman-Kolmogorov Equations
For the transition function $P(t)$ we have,
- $P(s + t) = P(s) P(t)$
- $P(0) = I$ (identity matrix)
Proof
For the first property; If they are same for an arbitrary element $i, j$ of the matrix, then they are equal. $$ \begin{align*} P(s + t)_{ij} & = P(X_{s + t} = j \mid X_s = i) \newline & = \sum_k P(X_{s + t} = j, X_s = k \mid X_0 = i) \newline & = \sum_k P(X_s = k \mid X_0 = i) P(X_{s + t} = j \mid X_s = k, X_0 = i) \newline & = \sum_k P(X_s = k \mid X_0 = i) P(X_{s + t} = j \mid X_s = k) \newline & = \sum_k P(X_s = k \mid X_0 = i) P(X_t = j \mid X_0 = k) \newline & = \sum_k P(s)_{ik} P(t)_{kj} \newline & = (P(s) P(t))_{ij} \ _\blacksquare \end{align*} $$ The second property follows directly from the definition of the transition function.
Intuition: Holding Times
Define $T_i$ as the time the continuous-time Markov chain $\{X_t\}_{t \geq 0}$ that always starts in state $i$ stays in $i$ before moving to a different state, so that for any $s > 0$, $$ P(T_i > s) = P(X_u = i, 0 \leq u \leq s) $$ The distribution of $T_i$ is memoryless and thus exponential. We define $q_i$ such that, $$ T_i \sim \mathrm{Exponential}(q_i) $$ Remember that this means that the average time the process stays in $i$ is $\frac{1}{q_i}$. The rate of transition out of the state is $q_i$.
Note
We can have $q_i = 0$, meaning that the state $i$ is absorbing, $$ P(T_i > s) = 1. $$
Intuition: Embedded Markov Chain
We can define a new stochastic process by listing the states the chain visits. This will be a discrete time Markov chain called the embedded Markov chain with transition matrix $\tilde{P}$.
Note
The transition matrix $\tilde{P}$ has zeros along its diagonal! Why? Because the process cannot stay in the same state when it makes a transition.
Further, note that the continuous-time Markov chain is completely determined by the expected holding times $(\frac{1}{q_1}, \ldots, \frac{1}{q_k})$ and the transition matrix of the embedded Markov chain $\tilde{P}$.
Intuition: Constructing a Continuous-Time Markov Chain
A way to describe a continuous-time Markov chain is to describe $k \times (k - 1)$ independent “alarm clocks”.
For states $i$ and $j$ so that $i \neq j$, let $q_{ij}$ be the parameter of an Exponentially distributed random variable representing the time until an “alarm clock” rings.
When in state $i$, wait until the first alarm clock $q_i$ rings, then move to the state given by the index $j$ of that alarm clock. This defines a continuous-time Markov chain.
The time until the first alarm clock $q_i$ rings is Exponentially distributed with parameter given by, $$ q_i = q_{i1} + q_{i2} + \ldots + q_{i, i - 1} + q_{i, i + 1} + \ldots + q_{ik}. $$ i.e., the parameter of the holding time distribution at $i$.
The chain is completely described by the rates $q_{ij}$.
Further, we saw also that the chain is completely described by the $p_{ij}$ and the $q_i$. The relationship is described by the equation above, and, $$ p_{ij} = \frac{q_{ij}}{q_{i1} + q_{i2} + \ldots + q_{i, i - 1} + q_{i, i + 1} + \ldots + q_{ik}} = \frac{q_{ij}}{q_i}. $$ for $i \neq j$.
Intuition: Infinitesimal Generator Matrix
To relate $P(t)$ to the $q_{ij}$‘s, we first relate them to $P^{\prime}(0)$.
Assuming $P(t)$ is differentiable, we can show that, $$ P^{\prime}(0) = \begin{bmatrix} -q_1 & q_{12} & q_{13} & \ldots & q_{1k} \newline q_{21} & -q_2 & q_{23} & \ldots & q_{2k} \newline \vdots & \vdots & \vdots & \ddots & \vdots \newline q_{k1} & q_{k2} & q_{k3} & \ldots & -q_k \newline \end{bmatrix} = Q, $$
Proof: Lazy “proof”
Firstly, let’s show that $Q_{12} = P^{\prime}(0)_{12} = q_{12}$. $$ \begin{align*} Q_{12} & = \lim_{h \to 0} \frac{P(X_h = 2 \mid X_0 = 1)}{h} \newline & = \lim_{h \to 0} \frac{P(h)_{12}}{h} \newline & = \lim_{h \to 0} \frac{P(h)_{12} - P(0)_{12}}{h} \newline & \eqqcolon P^{\prime}(0)_{12} \newline \end{align*} $$ Secondly, show that $-Q_1 = P^{\prime}(0)_{11}$. $$ \begin{align*} -Q_1 & = -[Q_{12} + Q_{13} + \ldots + Q_{1k}] \newline & = - \lim_{h \to 0} \frac{P(h)_{12} + P(h)_{13} + \ldots + P(h)_{1k}}{h} \newline & = - \lim_{h \to 0} \frac{1 - P(h)_{11}}{h} \newline & = \lim_{h \to 0} \frac{P(h)_{11} - 1}{h} \newline & = \lim_{h \to 0} \frac{P(h)_{11} - P(0)_{11}}{h} \newline & \eqqcolon P^{\prime}(0)_{11} \newline \end{align*} $$ The other elements can be shown similarly $_\blacksquare$.
Note that the rows of $P^{\prime}(0)$, i.e., $Q$, sum to zero!
Theorem: Kolmogorov Forward and Backward Equations
The Kolmogorov Forward and Backward equations describe how the transition function $P(t)$ evolves over time.
We get that for all $t \geq 0$, $$ P^{\prime}(t) = P(t) Q = Q P(t). $$ Thus, what this means in terms of the components of $P(t)$, $$ \begin{align*} P^{\prime}(t)_{ij} & = -P_{ij}(t) q_j + \sum_{k \neq j} P_{ik}(t) q_{kj} \newline P^{\prime}(t)_{ij} & = -q_i P_{ij}(t) + \sum_{k \neq i} q_{ik} P_{kj}(t) \newline \end{align*} $$ Thus, the above equations define a set of differential equations which the components of the matrix function $P(t)$ must satisfy.
Proof
$$ \begin{align*} P^{\prime}(t) & = \lim_{h \to 0} \frac{P(t + h) - P(t)}{h} \newline & = \lim_{h \to 0} \frac{P(t) P(h) - P(t)}{h} \newline & = \lim_{h \to 0} P(t) \frac{P(h) - I}{h} \newline & = P(t) \lim_{h \to 0} \frac{P(h) - I}{h} \newline & = P(t) \underbrace{\lim_{h \to 0} \frac{P(h) - P(0)}{h}}_{P^{\prime}(0) = Q} \newline & = P(t) Q \ _\blacksquare \end{align*} $$ Since we arbitrarily chose to write $P(t + h)$ as $P(t) P(h)$, we could have equally written it as $P(h) P(t)$, leading to the other equation, thus completing the proof.
Intuition: Matrix Exponential Solution
The solution to the Kolmogorov Forward and Backward equations is given by the matrix exponential. For any square matrix $A$ we can define the matrix exponential as, $$ e^A \coloneqq \sum_{n = 0}^{\infty} \frac{1}{n!} A^n = I + A + \frac{1}{2} A^2 + \frac{1}{6} A^3 + \ldots $$ The series converges for all square matrices $A$ (but we will not prove this here).
Some important properties of the matrix exponential are:
- $e^0 = I$
- $e^A e^{-A} = I$
- $e^{(s + t) A} = e^{sA} e^{tA}$
- If $AB = BA$, then $e^{A + B} = e^A e^B = e^B e^A$
- $\frac{\partial}{\partial t} e^{tA} = A e^{tA} = e^{tA} A$
Note that $P(t) = e^{tQ}$ is the unique solution to the differential equations given by the Kolmogorov Forward and Backward equations with initial condition $P(0) = I$.
Further, note that $P^\prime(t) = Q P(t)$ for all $t \geq 0$ and $P(0) = I$.
In R one can use the expm package to compute matrix exponentials.
Definition: Multinomial Distribution
A vector $x = (x_1, x_2, \ldots, x_k)$ of non-negative integers has a Multinomial distribution with parameters $n$ and $p$, where $n > 0$ is an integer and $p$ is a probability vector of length $k$, if $\sum_{i = 1}^k x_i = n$ and the probability mass function is given by, $$ \pi(x \mid n, p) = \frac{n!}{x_1! x_2! \ldots x_k!} p_1^{x_1} p_2^{x_2} \ldots p_k^{x_k}. $$ We write $x \sim \mathrm{Multinomial}(n, p)$.
Definition: Dirichlet Distribution
A vector $p = (p_1, p_2, \ldots, p_k)$ of non-negative real numbers satisfying $\sum_{i = 1}^k p_i = 1$ has a Dirichlet distribution with parameter vector $\alpha = (\alpha_1, \alpha_2, \ldots, \alpha_k)$, if it has probability density function, $$ \pi(p \mid \alpha) = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1}, $$ where $\alpha_i > 0$ for all $i = 1, 2, \ldots, k$. We write $p \sim \mathrm{Dirichlet}(\alpha)$.
Theorem: Multinomial-Dirichlet Conjugacy
Let $x = (x_1, x_2, \ldots, x_k)$ be a vector of counts with $x \sim \mathrm{Multinomial}(n, p)$ and prior $p \sim \mathrm{Dirichlet}(\alpha)$. Then the posterior distribution of $p$ given $x$ is, $$ p \mid x \sim \mathrm{Dirichlet}(\alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_k + x_k). $$ Further, if $p \sim \mathrm{Dirichlet}(\alpha)$, then $\mathbb{E}[p_i] = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j}$ for $i = 1, 2, \ldots, k$.
Proof: Multinomial-Dirichlet Conjugacy
Let $\pi(x \mid p) = \mathrm{Multinomial}(x; n, p)$ and $\pi(p) = \mathrm{Dirichlet}(p; \alpha)$, respectively. Then, $$ \begin{align*} \pi(p \mid x) & \propto \pi(x \mid p) \pi(p) \newline & = \mathrm{Multinomial}(x; n, p) \mathrm{Dirichlet}(p; \alpha) \newline & \propto_p p_1^{x_1} p_2^{x_2} \ldots p_k^{x_k} \cdot p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1} \newline & \propto_p p_1^{\alpha_1 + x_1 - 1} p_2^{\alpha_2 + x_2 - 1} \ldots p_k^{\alpha_k + x_k - 1} \newline & = \mathrm{Dirichlet}(p; \alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_k + x_k) \ _\blacksquare \end{align*} $$
Proof: Expectation of Dirichlet Distribution
Let $p \sim \mathrm{Dirichlet}(\alpha)$. Then, $$ \begin{align*} \mathbb{E}[p_i] & = \int \cdots \int p_i \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1} \ dp_1 dp_2 \ldots dp_k \newline & = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} \int \cdots \int p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_i^{\alpha_i} \ldots p_k^{\alpha_k - 1} \ dp_1 dp_2 \ldots dp_k \newline & = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} \cdot \frac{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_i + 1) \ldots \Gamma(\alpha_k)}{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k + 1)} \newline & = \frac{\Gamma(\alpha_i + 1)}{\Gamma(\alpha_i)} \cdot \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k + 1)} \newline & = \alpha_i \cdot \frac{1}{\alpha_1 + \alpha_2 + \ldots + \alpha_k} \newline & = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j} \ _\blacksquare \end{align*} $$
Intuition: Predictions for The Multinomial-Dirichlet Model
If $p \sim \mathrm{Dirichlet}(\alpha)$ and $x \mid p \sim \mathrm{Multinomial}(n, p)$, then the predictive distribution is given by, $$ \pi(x) = \frac{n!}{x_1! x_2! \ldots x_k!} \cdot \frac{\Gamma(\alpha_1 + x_1)}{\Gamma(\alpha_1)} \cdot \frac{\Gamma(\alpha_2 + x_2)}{\Gamma(\alpha_2)} \cdots \frac{\Gamma(\alpha_k + x_k)}{\Gamma(\alpha_k)} \cdot \frac{\Gamma(\sum_{i = 1}^k \alpha_i)}{\Gamma(n + \sum_{i = 1}^k \alpha_i)}. $$ For example, if $e_i$ is the vector with 1 at position $i$ and zeros elsewhere, then $\pi(x = e_i) = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j}$. Further, if $x_{\text{new}}$ is a vector of new counts, then, as $p \mid x \sim \mathrm{Multinomial}(x + \alpha)$, we get, $$ \pi(x_{\text{new}} = e_i \mid x) = \frac{\alpha_i + x_i}{\sum_{j = 1}^k \alpha_j + n}. $$ The $\alpha_i$ in the prior are often called pseudocounts, as they can be interpreted as prior counts added to the observed counts.
Definition: Hidden Markov Model
A Hidden Markov Model (HMMs) consists of:
- A Markov chain $X_0, \ldots, X_n, \ldots$ and,
- another sequence $Y_0, \ldots, Y_n, \ldots$ such that, $$ P(Y_k \mid Y_0, \ldots, Y_{k - 1}, X_0, \ldots, X_n) = P(Y_k \mid X_k). $$ In some models we may instead have, $$ P(Y_k \mid Y_0, \ldots, Y_{k - 1}, X_0, \ldots, X_n) = P(Y_k \mid Y_{k - 1}, X_k). $$ Generally, $Y_0, \ldots, Y_n$ are called observations and $X_0, \ldots, X_n$ are called hidden states. The $X_i$‘s represent the “underlying process” that the observed values $Y_i$‘s depend on. Further, generally the $X_k$ have a finite state space.
Intuition: Inference in HMMs
When the parameters of the HMM are known, we want to know about the values of the hidden variables $X_i$, for example,
- What is the most likely sequence $X_0, X_1, \ldots, X_n$ given the data?
- What is the probability distribution for a single $X_i$ given the data? However, when the parameters of the HMM are unknown, we need to infer these from some data.
If data with all $X_i$ and $Y_i$ known is available, inference for parameters is based on counts of transitions. The inference for the Markov chain is exactly as for the Markov chains we have looked at before. The inference for the emission probabilities, i.e., the parameters of $P(Y_k \mid X_k)$, can be done independently of the inference for the Markov chain.
Intuition: Bayesian Inference for Poisson Processes
For a homogeneous Poisson process, we setup up a prior $\pi(\lambda)$ for its parameter $\lambda$, and find the posterior given the observations.
The likelihood for observing $y$ events in an interval of length $t$ is given by, $$ \mathrm{Poisson}(y; \lambda t) = \exp(-\lambda t) \frac{(\lambda t)^y}{y!} \propto_{\lambda} \exp(-\lambda t) \lambda^y. $$ A convenient prior to use is $\lambda \sim \mathrm{Gamma}(\alpha, \beta)$.
In this case, the posterior becomes, $$ \lambda \mid \mathcal{D} \sim \mathrm{Gamma}(\alpha + y, \beta + t), $$ and the predictive distribution for the number of observations $y_n$ in a different interval of length $u$ becomes, $$ \pi(y_n) = \mathrm{Negative-Binomial}\left(y_n; \alpha, \frac{\beta}{\beta + u}\right). $$
Intuition: Bayesian Inference for Continuous-Time Markov Chains
Lastyl, recall that a continuous-time Markov chain with finite state space is completely characterized by its holding times parameter vector $q$ and the transition matrix $\tilde{P}$ of its embedded Markov chain.
Parametrizing instead with the “alarm clock” parameters $q_{ij}$ gives an equivalent description of the process.
The two parts of the data can be considered independently are,
- We learn about $\tilde{P}$ from the counts of transitions between states, and
- We learn about $q$ from the observed lengths of stays the process has in each state.
For $\tilde{P}$ the situation is analogous to the one for discrete-time Markov chains, except that the diagonal of $\tilde{P}$ must be zero, so the prior must exclude the possibility of non-zero diagonal elements.
For example, for $\tilde{P}_1$, the first row of $\tilde{P}$, we might use the prior $\mathrm{Dirichlet}(0, 1, \ldots, 1)$, i.e., a Dirichlet prior with a zero pseudocount for the first element.
The holding times in state $i$ are distributed as $\mathrm{Exponential}(q_i)$. If we have observed a total holding time of $h$ over $n$ intervals, that data has likelihood proportional to $e^{-hq_i}q_i^n$. Using $q_i \sim \mathrm{Gamma}(\alpha, \beta)$ as prior results in the posterior $q_i \mid \mathcal{D} \sim \mathrm{Gamma}(\alpha + n, \beta + h)$.
Definition: Stationary Distribution for Continuous-Time Markov Chains
In the context of continuous-time Markov chains, $v$ is a stationary distribution if and only if $0 = v Q$, where $Q$ is the generator matrix of the chain.
Proof: Stationary Distribution and Generator Matrix
Firstly, let $vQ = 0$, $$ \begin{align*} \frac{d}{dt} (v P(t)) & \coloneqq v \frac{d}{dt} P(t) \newline & \eqqcolon v Q P(t) \newline & = 0 \newline \end{align*} $$ as $vQ = 0$ by assumption. Thus, $v P(t)$ is constant in $t$, i.e., $v P(t) = v P(0) = v I = v$ for all $t \geq 0$.
Conversely, let $v$ be a stationary distribution, i.e., $v = v P(t)$ for all $t \geq 0$. Then, $$ \begin{align*} \frac{d}{dt} (v P(t)) & \coloneqq v \frac{d}{dt} P(t) \newline & \eqqcolon v Q P(t) \newline & = 0 \newline \end{align*} $$ as $v P(t)$ is constant in $t$ by assumption. In particular, for $t = 0$ we get $v Q P(0) = v Q I = v Q = 0$. Thus, the proof is complete $_\blacksquare$.
Intuition: Long-Term Behavior of Continuous-Time Markov Chains
A continuous-time Markov chain is irreducible if for all $i$ and $j$ there exists a $t > 0$ such that $P_{ij}(t) > 0$, i.e., it is possible to get to any state from any state.
However, periodic continuous-time Markov chains do not exist. If $P_{ij}(t) > 0$ for some $t > 0$, then $P_{ij}(t) > 0$ for all $t > 0$.
Theorem: Fundamental Limit Theorem for Continuous-Time Markov Chains
Let $\{X_t\}_{t \geq 0}$ be a finite, irreducible, continuous-time Markov chain with transition function $P(t)$. Then there exists a unique positive stationary distribution vector $v$ which is also the limiting distribution.
The limiting distribution of such a chain can be found as the unique $v$ satisfying $v Q = 0$
Definition: Absorbing State
An absorbing state is one where the rate of leaving it is zero.
Assume $\{X_t\}_{t \geq 0}$ is a continuous-time Markov chain with $k$ states. Assume the last state is absorbing and the rest are not (If the chain is irreducible they are then transient).
WE have that $q_k = 0$ and the entire last row must consist of zeros. We thus get, $$ Q = \begin{bmatrix} V & \star \newline \mathbf{0} & 0 \newline \end{bmatrix}. $$ Let $F$ be the $(k - 1) \times (k - 1)$ matrix such that $F_{ij}$ (with $i < k, j < k$) is the expected time spent in state $j$ when the chain starts in $i$. We can show that $F = -V^{-1}$.
Proof: Expected Time in Transient States Before Absorption
Generally, we can define $D$ as the matrix with $(\frac{1}{q_1}, \ldots, \frac{1}{q_k})$ along its diagonal, with all other entries zero. If there are no absorbing states, $$ \tilde{P} = DQ + I_k. $$ Write $A\_$ for a square matrix without its last row and column, so, e.g., $Q\_ = V$.
If the last state is absorbing, i.e., $q_k = 0$, we get, $$ \tilde{P}\_ = D\_ Q\_ + I_{k - 1}. $$ Further, let $F^{\prime}$ be the matrix where $F^{\prime}_{ij}$ is the expected number of stays in state $j$ before absorbtion when starting in state $i$. As the lengths of stays and changes in states are independent, we get $F^{\prime}_{ij} = F_{ij} \frac{1}{q_j}$ and thus, $$ F = F^{\prime} D\_. $$ By the theory for discrete-time Markov chains with absorbing states, we have, $$ F^{\prime} = (I_{k - 1} - \tilde{P}\_)^{-1}. $$ Thus, $$ \begin{align*} F & = F^{\prime} D\_ \newline & = (I_{k - 1} - \tilde{P}\_)^{-1} D\_ \newline & = (I_{k - 1} - (D\_ Q\_ + I_{k - 1}))^{-1} D\_ \newline & = (-D\_ Q\_)^{-1} D\_ \newline & = -Q\_^{-1} D\_^{-1} D\_ \newline & = -Q\_^{-1} \newline & \eqqcolon -V^{-1} \ _\blacksquare \end{align*} $$
$F$ is called the fundamental matrix (similar to the discrete-time case).
Note
If the chain starts in state $i$, the expected time until absorption is the sum of the $i$-th row of $F$. Thus, the expected times until absorption are given by the matrix product $F1$ of $F$ with a column of 1s.
Intuition: Stationary Distribution of the Embedded Markov Chain
The embedded chain of a continuous-time Markov chain: The discrete-time Markov chain where holding times are ignored.
The stationary distribution for the embedded chain and for the continuous-time chain are generally not the same!
However, there is a simple relationship: A probability vector $\pi$ is a stationary distribution for a continuous-time Markov chain if and only if $\psi$ is a stationary distribution for the embedded chain, where $\psi_j = C \pi_j q_j$ for a constant $C$ making the entries sum to 1.
Proof: Stationary Distribution of the Embedded Markov Chain
Using notation as above, we have $\tilde{P} = DQ + I$. For any vector $v$, we get, $$ v \tilde{P} = v (DQ + I) = v DQ + v. $$ so $v \tilde{P} = v$ if and only if $v DQ = 0$.
Definition: Global Balance Equations
Let $v = (v_1, v_2, \ldots, v_n)$ be the stationary distribution of a continuous-time Markov chain with generator matrix $Q$. At $v$, the flow into a state must be equal to the flow out of the state, by the definition of stationary distribution. Which are exactly the equations we get from $v Q = 0$, $$ (v_1, v_2, \ldots, v_n) \begin{bmatrix} -q_1 & q_{12} & \cdots & q_{1n} \newline q_{21} & -q_2 & \cdots & q_{2n} \newline \vdots & \vdots & \ddots & \vdots \newline q_{n1} & q_{n2} & \cdots & -q_n \newline \end{bmatrix} $$ Thus, because $vQ = 0$, for each state $j$ we have, $$ \sum_{i \neq j} v_i q_{ij} = v_j q_j. $$ This is called the global balance equations.
One can generalize this to: If $A$ is a set of states, then the long term rates of movement into and out of $A$ are the same, $$ \sum_{i \in A} \sum_{j \notin A} v_i q_{ij} = \sum_{i \notin A} \sum_{j \in A} v_i q_{ij}. $$
Definition: Local Balance Equations and Time-Reversible Chains
A stronger condition than global balance is local balance. The flow between every pair of states is balanced.
Formally, an irreducible continuous-time Markov chain with stationary distribution $v$ is said to be time-reversible if, for all states $i$ and $j$, $$ v_i q_{ij} = v_j q_{ji}. $$
Note
The rate of observed changes from $i$ to $j$ is the same as the rate of observed change from $j$ to $i$. Thus, this is also called time-reversibility, as the process looks the same when observed backward in time.
Further, If a probability vector $v$ satisfies the local balance equations, then it is a stationary distribution of the chain.
Definition: Markov Process on a Tree
Firstly, a tree is a connected (undirected) graph with no cycles. Thus, a transition graph gives rise to an undirected graph by connecting all nodes between which there is some rate of transition. Further, assume that the transition graph of an irreducible continuous-time Markov chain is (or gives rise to) a tree.
In a tree, any edge between two states divides all states into two groups (disjoint sets), each on each side of the edge. Thus, the flow must be balanced between across each edge.
In this scenario, the global balance condition then implies the local balance condition over this edge.
Definition: Birth-and-Death Process
A birth-and-death process is a continuous-time Markov chain where the state space is the set of non-negative integers and transitions only occur to neighboring states. The process is necessarily time-reversible, as the transition graph is a tree (in fact a line, if we further assume irreducibility).
We denote the births from $i$ to $i + 1$ with $\lambda_i$, and the rate of deaths from $i$ to $i - 1$ with $\mu_i$. Thus, the generator matrix has the form, $$ Q = \begin{bmatrix} -\lambda_0 & \lambda_0 & 0 & 0 & \cdots \newline \mu_1 & -(\lambda_1 + \mu_1) & \lambda_1 & 0 & \cdots \newline 0 & \mu_2 & -(\lambda_2 + \mu_2) & \lambda_2 & \cdots \newline 0 & 0 & \mu_3 & -(\lambda_3 + \mu_3) & \cdots \newline \vdots & \vdots & \vdots & \vdots & \ddots \newline \end{bmatrix} $$ where $\lambda_i, \mu_i > 0$ for all $i \geq 0$.
Provided that $\sum_{k = 1}^{\infty} \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_{i}} < \infty$, (i.e., the stationary distribution exists), the stationary distribution is given by, $$ \begin{align*} v_k & = v_0 \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_i}, \newline v_0 & = \left(1 + \sum_{k = 1}^{\infty} \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_i}\right)^{-1}. \newline \end{align*} $$
Proof: Stationary Distribution of Birth-and-Death Processes
Using the local balance equations, we have, $$ \begin{align*} v_k \lambda_k & = v_{k + 1} \mu_{k + 1}, \newline \end{align*} $$ which gives, $$ \begin{align*} v_{k + 1} & = v_k \frac{\lambda_k}{\mu_{k + 1}} \newline & = v_0 \prod_{i = 1}^{k + 1} \frac{\lambda_{i - 1}}{\mu_i}. \newline \end{align*} $$ Further, as $v$ is a probability vector, we have, $$ \begin{align*} 1 & = \sum_{k = 0}^{\infty} v_k \newline & = v_0 \left(1 + \sum_{k = 1}^{\infty} \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_i}\right). \newline & \Rightarrow v_0 = \left(1 + \sum_{k = 1}^{\infty} \prod_{i = 1}^{k} \frac{\lambda_{i - 1}}{\mu_i}\right)^{-1}. \newline \end{align*} $$ Thus, the proof is complete. $_\blacksquare$
Definition: Brownian Motion
Brownian motion is a continuous-time stochastic process $\{B_t\}_{t \geq 0}$ with the following properties:
-
$B_0 = 0$.
-
For $t >0$, $B_t \sim \mathcal{N}(0, t)$ (i.e., normally distributed with mean $0$ and variance $t$).
-
For $s, t > 0$, $B_{t + s} - B_s \sim \mathcal{N}(0, t)$ (i.e., the increments are stationary).
-
For $0 \leq q < r \leq s < t$, $B_t - B_s$ is independent of $B_r - B_q$ (i.e., the increments are independent).
-
The function $t \mapsto B_t$ is continuous with probability $1$ (almost surely).
Intuition: Brownian Motion as Limit of Random Walks
A random walk is a discrete-time Markov chain $S_0, S_1, S_2, \ldots$ where $S_0 = 0$ and, $$ S_n = Y_1 + Y_2 + \ldots + Y_n, $$ and $Y_1, Y_2, \ldots$ are i.i.d. random variables. Assume $\mathbb{E}[Y_i] = 0$.
Further, if we assume $\mathrm{Var}(Y_i) = 1$, we get $\mathrm{Var}(S_n) = n$.
Interpolating between the values $S_n$ we can make this into a continuous-time process $S_t$ where $\mathrm{Var}(S_t) \approx t$.
We may scale with an $s > 0$ to get processes $S_t^{(s)} = \frac{S_{st}}{\sqrt{s}}$ where we get $\lim_{s \to \infty} \mathrm{Var}(S_t^{(s)}) = t$.
It turns out that the processes $S_t^{(s)}$ when $s \to \infty$ are exactly Brownian motion, no matter what type of $Y_i$ we start with.
This is the Donsker’s invariance principle, which is a generalization of the central limit theorem to stochastic processes 1.
Definition: Gaussian Process
A Gaussian process is a continuous-time stochastic process $\{X_t\}_{t \geq 0}$ with the property that for all $n \geq 1$ and $0 \leq t_1 < t_2 < \ldots < t_n$, $X_{t_1}, X_{t_2}, \ldots, X_{t_n}$ have a multivariate normal distribution.
Thus, a Gaussian process is completely determined by its mean function $\mathbb{E}[X_t]$ and its covariance function $\mathrm{Cov}(X_s, X_t)$.
Intuition: Brownian Motion as a Gaussian Process
Brownian motion is a Gaussian process, as we can show that any $a_1 B_{t_1} + a_2 B_{t_2} + \ldots + a_n B_{t_n}$ is normally distributed.
A Gaussian process $\{X_t\}_{t \geq 0}$ is a Brownian motion if and only if,
-
$X_0 = 0$.
-
$\mathbb{E}[X_t] = 0$ for all $t$.
-
$\mathrm{Cov}(X_s, X_t) = \min(s, t)$ for all $s, t$.
-
The function $t \mapsto X_t$ is continuous with probability $1$ (almost surely).
Intuition: Transformations of Brownian Motion
The following transformations of Brownian motion also yield Brownian motion:
- $\{-B_t\}_{t \geq 0}$, negating the process yields another (reflected) Brownian motion.
- $\{B_{t + s} - B_s\}_{t \geq 0}$ for any fixed $s \geq 0$, shifting the time origin yields another Brownian motion.
- $\{\frac{1}{\sqrt{a}} B_{a t}\}_{t \geq 0}$ for any fixed $a > 0$, scaling time and space yields another Brownian motion.
- The process $\{X_t\}_{t \geq 0}$ where $X_0 = 0$ and $X_t = t B_{\frac{1}{t}}$ for $t > 0$, time inversion yields another Brownian motion.
Intuition: Stopping Times
We saw above that, for any fixed $t$ $(B_{t + s} - B_s)_{t \geq 0}$ is a Brownian motion. Does this phenomenon also hold if we start the chain anew from $T$ when $T$ is random? It depends.
If $T$ is the largest value less than 1 where $B_T = 0$, then $(B_{T + s} - B_T)_{s \geq 0}$ is not a Brownian motion.
If $T$ is the smallest value where $B_T = a$ for some constant $a$, then $(B_{T + s} - B_T)_{s \geq 0}$ is a Brownian motion. The reason is that the event $T = t$ can be determined based on $B_r$ where $0 \leq r \leq t$.
Random $T$‘s that have this property are called stopping times. For these $B_{T + s} - B_T$ is a Brownian motion.
Intuition: The Distribution of the First Hitting Time
Given that $a \neq 0$, what is the distribution of the first hitting time $T_a = \min\{t : B_t = a\}$?
We will prove that, $$ \frac{1}{T_a} \sim \mathrm{Gamma}\left(\frac{1}{2}, \frac{a^2}{2}\right). $$
Assuming that $a > 0$ and using that $T_a$ is a stopping time we get for any $t > 0$ that $P(B_{\frac{1}{t}} > a \mid T_a < \frac{1}{t}) = P(B_{\frac{1}{t} - T_a} > 0) = \frac{1}{2}$.
We also have, $$ \begin{align*} P(B_{\frac{1}{t}} > a \mid T_a < \frac{1}{t}) & = \frac{P(B_{\frac{1}{t}} > a, T_a < \frac{1}{t})}{P(T_a < \frac{1}{t})} \newline & = \frac{P(B_{\frac{1}{t}} > a)}{P(T_a < \frac{1}{t})}. \end{align*} $$ Further, it follows that $P(T_a < \frac{1}{t}) = 2 P(B_{\frac{1}{t}} > a)$ and thus, $$ \begin{align*} P\left(\frac{1}{T_a} \leq t\right) & = 2 P\left(B_{\frac{1}{t}} > a\right) -1 \newline & = 2 P(B_1 \leq a \sqrt{t}) - 1 \newline \end{align*} $$ Taking the derivative with respect to $t$ we get the Gamma density, $$ \pi_{1/T_a}(t) = 2 \frac{1}{\sqrt{2 \pi}} \exp\left(-\frac{1}{2} \left(a \sqrt{t}\right)^2\right) \frac{a}{2} t^{-1/2}. $$
Intuition: Maximum of Brownian Motion
We can define $M_t \coloneqq \max_{0 \leq s \leq t} B_s$.
We may compute for $a > 0$ (using the results above), $$ \begin{align*} P(M_t > a) & = P(T_a < t) \newline & = 2 P(B_t > a) \newline & = P(|B_t| > a). \end{align*} $$ Thus, $M_t$ has the same d istribution as $|B_t|$, i.e., the absolute value of $B_t$.
Intuition: Zeroes of Brownian Motion
Let $L$ be the last zero in $(0, 1)$ of a Brownian motion. (i.e., $L = \max \ \{t : 0 < t < 1, B_t = 0\}$. Then, $$ L \sim \mathrm{Beta}\left(\frac{1}{2}, \frac{1}{2}\right). $$
Proof: Outline of Proof
$$ \begin{align*} P(L > s) & = \int_{-\infty}^{\infty} P(L > s \mid B_s = t) \ \mathcal{N}(t; 0, s) \ dt \newline & = 2 \int_{-\infty}^{0} P(L > s \mid B_s = t) \ \mathcal{N}(t; 0, s) \ dt \newline & = 2 \int_{-\infty}^{0} P(M_{1 - s} > -t) \ \mathcal{N}(t; 0, s) \ dt \newline & = 2 \int_{0}^{\infty} 2 P(B_{1 - s} > t) \ \mathcal{N}(t; 0, s) \ dt \newline & = 4 \int_{0}^{\infty} \int_{t}^{\infty} \mathcal{N}(r; 0, 1 - s) \ \mathcal{N}(t; 0, s) \ dr \ dt \newline & = \ldots \newline & = \frac{1}{\pi} \int_s^1 \frac{1}{\sqrt{x(1 - x)}} \ dx \newline & = \int_s^1 \mathrm{Beta}\left(x; \frac{1}{2}, \frac{1}{2}\right) \ dx. \newline \end{align*} $$ We can then, let $L_t$ be the last zero in $(0, t)$. Then, $$ \frac{L_t}{t} \sim \mathrm{Beta}\left(\frac{1}{2}, \frac{1}{2}\right). $$
Note
The probability that a Brownian motion has at least one zero in $(r, t)$ for $0 \leq r < t$ is $1 - P(L_t < r)$.
Further, the cumulative distribution for the $\mathrm{Beta}\left(\frac{1}{2}, \frac{1}{2}\right)$ density can be computed with the arcsin function, $$ P(L_t < r) = \int_0^{\frac{r}{t}} \mathrm{Beta}\left(s; \frac{1}{2}, \frac{1}{2}\right) \ ds = \frac{2}{\pi} \arcsin\left(\sqrt{\frac{r}{t}}\right). $$
Definition: Brownian Bridge
Define a Gaussian process $X_t$ by conditioning a Brownian motion $B_t$ on $B_1 = 0$. Then $X_t$ is a Brownian bridge.
If $0 < s < t < 1$, then $(B_s, B_t, B_1)$ is a multivariate normal with, $$ \begin{align*} \mathbb{E}[(B_s, B_t, B_1)] & = (0, 0, 0) \newline \mathrm{Var}((B_s, B_t, B_1)) & = \Sigma = \begin{pmatrix} s & s & s \newline s & t & t \newline s & t & 1 \newline \end{pmatrix}. \newline \end{align*} $$ Conditioning on $B_1 = 0$ and using the properties of the multivariate normal we get that $\mathbb{E}[X_t] = 0$ and, $$ \mathrm{Cov}(X_s, X_t) = s(1 - t) = s - st. $$ It follows that this is identical to the Brownian bridge defined above.
Definition: Brownian Motion with Drift
For any real $\mu > 0$ and $\sigma > 0$ we can define the Gaussian process $X_t$ as, $$ X_t = \mu t + \sigma B_t. $$ This is a Brownian motion with a draft, and is often a more useful model than standard Brownian motion.
Note
Note that $X_t$ is Normal with expectation $\mu t$ and variance $\sigma^2 t$.
This is a Gaussian process with continuous paths and stationary and independent increments.
Definition: Geometric Brownian Motion
The stochastic process, $$ G_t = G_0 e^{\mu t + \sigma B_t}, $$ where $G_0 > 0 $ is called a geometric Brownian motion with drift parameter $\mu$ and variance parameter $\sigma^2$.
$\log(G_t)$ is a Gaussian process with expectation $\log(G_0) + \mu t$ and variance $\sigma^2 t$.
Note
One can show that, $$ \begin{align*} \mathbb{E}[G_t] & = G_0 e^{t(\mu + \frac{1}{2} \sigma^2)} \newline \mathrm{Var}(G_t) & = G_0^2 e^{2 t(\mu + \sigma^2)} (e^{\sigma^2 t} - 1). \newline \end{align*} $$
Proof
Using that $\log(G_t) \sim \mathcal{N}(\log(G_0) + \mu t, \sigma^2 t)$ we have, $$ \begin{align*} \mathbb{E}[G_t] & = \mathbb{E}[e^{\log(G_t)}] \newline & = \int_{-\infty}^{\infty} e^x \ \mathcal{N}(x; \log(G_0) + \mu t, \sigma^2 t) \ dx \newline & = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t))^2}{2 \sigma^2 t} + x\right) \ dx \newline & = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t - \sigma^2 t))^2}{2 \sigma^2 t} + \log(G_0) + \mu t + \frac{1}{2} \sigma^2 t\right) \ dx \newline & = e^{\log(G_0) + \mu t + \frac{1}{2} \sigma^2 t} \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t - \sigma^2 t))^2}{2 \sigma^2 t}\right) \ dx \newline & = e^{\log(G_0) + \mu t + \frac{1}{2} \sigma^2 t} \newline & = G_0 e^{t(\mu + \frac{1}{2} \sigma^2)}. \newline \end{align*} $$ Similarly, we can compute $\mathbb{E}[G_t^2]$ and use $\mathrm{Var}(G_t) = \mathbb{E}[G_t^2] - (\mathbb{E}[G_t])^2$ to get the variance, $$ \begin{align*} \mathbb{E}[G_t^2] & = \mathbb{E}[e^{2 \log(G_t)}] \newline & = \int_{-\infty}^{\infty} e^{2x} \ \mathcal{N}(x; \log(G_0) + \mu t, \sigma^2 t) \ dx \newline & = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t))^2}{2 \sigma^2 t} + 2x\right) \ dx \newline & = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t - 2 \sigma^2 t))^2}{2 \sigma^2 t} + \log(G_0) + \mu t + 2 \sigma^2 t\right) \ dx \newline & = e^{\log(G_0) + \mu t + 2 \sigma^2 t} \int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi \sigma^2 t}} \exp\left(-\frac{(x - (\log(G_0) + \mu t - 2 \sigma^2 t))^2}{2 \sigma^2 t}\right) \ dx \newline & = e^{\log(G_0) + \mu t + 2 \sigma^2 t} \newline & = G_0^2 e^{t(2 \mu + 2 \sigma^2)}. \newline \end{align*} $$ Thus, $$ \begin{align*} \mathrm{Var}(G_t) & = G_0^2 e^{t(2 \mu + 2 \sigma^2)} - \left(G_0 e^{t(\mu + \frac{1}{2} \sigma^2)}\right)^2 \newline & = G_0^2 e^{2 t(\mu + \sigma^2)} (e^{\sigma^2 t} - 1). \newline \end{align*} $$ $_\blacksquare$
Intuition: Stock Options
A (European) stock option is a right (but not an obligation) to buy a stock at a given time $t$ in the future for a given price $K$.
How much can you expect to earn from a stock option at that future time?
We get that (we will derive this), $$ \mathbb{E}[\max(G_t - K, 0)] = G_0 e^{t(\mu + \frac{\sigma^2}{2})} P\left(B_1 > \frac{\beta - \sigma t}{\sqrt{t}}\right) - K P\left(B_1 > \frac{\beta}{\sqrt{t}}\right), $$ where $\beta = \frac{\log(K / G_0) - \mu t}{\sigma}$.
Derivation
Firstly, we need to prove the algebraic identity, $$ e^{\sigma x} \mathcal{N}(x; 0, t) = e^{\frac{\sigma^2 t}{2}} \mathcal{N}(x; \sigma t, t). $$
Proof
We have, $$ \begin{align*} e^{\sigma x} \mathcal{N}(x; 0, t) & = \frac{1}{\sqrt{2 \pi t}} \exp\left(-\frac{x^2}{2 t} + \sigma x\right) \newline & = \frac{1}{\sqrt{2 \pi t}} \exp\left(-\frac{(x - \sigma t)^2}{2 t} + \frac{\sigma^2 t}{2}\right) \newline & = e^{\frac{\sigma^2 t}{2}} \mathcal{N}(x; \sigma t, t). \ _\blacksquare \end{align*} $$ Then, defining $\beta = \frac{\log(K / G_0) - \mu t}{\sigma}$ we get, $$ \begin{align*} \mathbb{E}[\max(G_t - K, 0)] & \coloneqq \mathbb{E}[\max(G_0 e^{\mu t + \sigma B_t} - K, 0)] \newline & = \int_{-\infty}^{\infty} \max(G_0 e^{\mu t + \sigma x} - K, 0) \ \mathcal{N}(x; 0, t) \ dx \newline & = \int_{\beta}^{\infty} (G_0 e^{\mu t + \sigma x} - K) \ \mathcal{N}(x; 0, t) \ dx \newline & = G_0 e^{\mu t} \int_{\beta}^{\infty} e^{\sigma x} \ \mathcal{N}(x; 0, t) \ dx - K \int_{\beta}^{\infty} \mathcal{N}(x; 0, t) \ dx \newline & = G_0 e^{t(\mu + \frac{\sigma^2}{2})} \int_{\beta}^{\infty} \mathcal{N}(x; \sigma t, t) \ dx - K \int_{\beta}^{\infty} \mathcal{N}(x; 0, t) \ dx \newline & = G_0 e^{t(\mu + \frac{\sigma^2}{2})} P\left(B_1 > \frac{\beta - \sigma t}{\sqrt{t}}\right) - K P\left(B_1 > \frac{\beta}{\sqrt{t}}\right). \ _\blacksquare \end{align*} $$
Definition: Martingale
A stochastic process $\{Y_t\}_{t \geq 0}$ is a martingale if for $t \geq 0$,
- $\mathbb{E}[Y_t \mid Y_r, 0 \leq r \leq s] = Y_s$ for $0 \leq s \leq t$.
- $\mathbb{E}[|Y_t|] < \infty$. A Brownian motion $B_t$ is a martingale.
Proof
We have, $$ \begin{align*} \mathbb{E}[B_t \mid B_r, 0 \leq r \leq s] & = \mathbb{E}[B_t - B_s + B_s \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[B_t - B_s \mid B_r, 0 \leq r \leq s] + \mathbb{E}[B_s \mid B_r, 0 \leq r \leq s] \newline & = 0 + B_s \newline & = B_s. \end{align*} $$ and, $$ \begin{align*} \mathbb{E}[|B_t|] & = \int_{-\infty}^{\infty} |x| \ \mathcal{N}(x; 0, t) \ dx \newline & = 2 \int_{0}^{\infty} x \ \mathcal{N}(x; 0, t) \ dx \newline & = 2 \int_{0}^{\infty} x \ \frac{1}{\sqrt{2 \pi t}} \exp\left(-\frac{x^2}{2 t}\right) \ dx \newline & = 2 \cdot \frac{1}{\sqrt{2 \pi t}} \cdot t \newline & = \sqrt{\frac{2 t}{\pi}} < \infty. \newline \end{align*} $$
Further, one can define a martingale with respect to other stochastic processes.
$\{Y_t\}_{t \geq 0}$ is a martingale with respect to $\{X_t\}_{t \geq 0}$ if for $t \geq 0$,
- $\mathbb{E}[Y_t \mid X_r, 0 \leq r \leq s] = Y_s$ for $0 \leq s \leq t$.
- $\mathbb{E}[|Y_t|] < \infty$.
Derivation: Geometric Brownian Motion can be a Martingale
Let $G_t \coloneqq G_0 e^{\mu t + \sigma B_t}$ be a geometric Brownian motion. We can derive that, $$ \begin{align*} \mathbb{E}[G_t \mid B_r, 0 \leq r \leq s] & \coloneqq \mathbb{E}[G_0 e^{\mu t + \sigma B_t} \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[G_0 e^{\mu(t - s) + \sigma (B_t - B_s) + \mu s + \sigma B_s} e^{\mu s + \sigma B_s} \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[G_{t - s}] e^{\mu s + \sigma B_s} \newline & = G_0 e^{(t - s)(\mu + \frac{\sigma^2}{2})} e^{\mu s + \sigma B_s}. \newline & = G_s e^{(t - s)(\mu + \frac{\sigma^2}{2})}. \newline \end{align*} $$ We see that $G_t$ is a martingale with respect to $B_t$ if and only if $\mu + \frac{\sigma^2}{2} = 0$, i.e., $\mu = -\frac{\sigma^2}{2}$.
Examples
Example: Toy Example of MCMC
Consider the Markov chain $X_0, X_1, \ldots$ with states $\{0, 1, 2\}$ and with, $$ P = \begin{bmatrix} 0.99 & 0.01 & 0 \newline 0 & 0.9 & 0.1 \newline 0.2 & 0 & 0.8 \end{bmatrix} $$ One can find that the limiting distribution is $v = (\frac{20}{23}, \frac{2}{23}, \frac{1}{23})$.
Consider the function $r(x) = x^5$. If $X$ is a random variable with the limiting distribution, $$ \mathbb{E}[r(X)] = 0^5 \cdot \frac{20}{23} + 1^5 \cdot \frac{2}{23} + 2^5 \cdot \frac{1}{23} = \frac{34}{23} = 1.4783 $$ If $Y_1, \ldots, Y_n$ are all i.i.d variables with the limiting distribution, we can check numerically that, $$ \lim_{n \to \infty} \frac{r(Y_1) + r(Y_2) + \ldots + r(Y_n)}{n} = 1.4783 $$ We also get for $X_0, X_1, \ldots$ that, $$ \lim_{n \to \infty} \frac{r(X_1) + r(X_2) + \ldots + r(X_n)}{n} = 1.4783 $$ but in this case the limit is approached more slowly.
Example: Counting Zeroes
How many zeroes does $B_t$ have in the interval $(0, s)$?
If $L_s$ is the last zero in $(0, s)$, we saw that, $$ \frac{L_s}{s} \sim \mathrm{Beta}\left(\frac{1}{2}, \frac{1}{2}\right), $$ which means that, with probability one, there exists a zero in the interval $(0, s)$ and for the last zero $L_s$ we have $0 < L_s < s$.
Repeating the argument, we have that $0 < L_{L_s} < L_s$ with probability one, and thus there exists another zero in $(0, L_s) \subset (0, s)$.
The conclusion is that, with probability one, there is an infinite number of zeroes in $(0, s)$ for any $s > 0$.
Example: Simulating from the Ising Model using Perfect Sampling
We saw above how to use Gibbs sampling to simulate from the Ising model. We then simulated from the conditional distributions, using, $$ \pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) = \ldots = \frac{1}{1 + \exp(-2 \beta \sum_{v \sim w} \sigma_w)} $$ Now, we extend this method into a perfect sampling.
Define the partial ordering on configurations $\sigma, \tau$ be defining, $$ \sigma \leq \tau \iff \sigma_v \leq \tau_v \text{ for all vertices } v $$ We have minimal and maximal elements $m = (-1, \ldots, -1)$ and $M = (1, \ldots, 1)$.
Note
iF $\sigma \leq \tau$, then for all $v$ $\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) \leq \pi(\tau_v^{\star} = 1 \mid \tau_{-v})$.
Defining $g(\sigma, U)_v = 2 I (U \leq \pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}) ) - 1$, makes $g$ monotone.
Given an Ising model with $\beta > 0$, start with choosing an integer $n > 0$ and setting configurations, $$ \begin{align*} \sigma^{(-n)} & = m = (-1, \ldots, -1) \newline \tau^{(-n)} & = M = (1, \ldots, 1) \newline \end{align*} $$
Algorithm: Perfect Sampling for the Ising Model
-
For $i = -n +1, \ldots, 0$ set $\sigma^{(i)} = \sigma^{(i - 1)}$, $\tau^{(i)} = \tau^{(i - 1)}$ and then,
- Looping through all vertices $v$:
- Compute $\pi(\sigma_v^{\star} = 1 \mid \sigma_{-v}^{(i)})$ and $\pi(\tau_v^{\star} = 1 \mid \tau_{-v}^{(i)})$.
- Simulate $U \sim \mathrm{Uniform}(0, 1)$.
- Update $\sigma_v^{(i)} = g(\sigma^{(i)}, U)_v$ and $\tau_v^{(i)} = g(\tau^{(i)}, U)_v$.
- Looping through all vertices $v$:
-
If $\sigma^{(0)} = \tau^{(0)}$, this is the result. Otherwise, double $n$ and repeat.
Example: Using a Binomial Likelihood
Assume the offspring process is $\mathrm{Binomial}(N, p)$ for some parameter $p$ and a fixed (known) $N$. By definition, we get the likelihood, $$ \pi(y_1, y_2, \ldots, y_n \mid p) \coloneqq \prod_{i = 1}^n \mathrm{Binomial}(y_i; N, p) $$ A possibility is to use a prior $p \sim \mathrm{Beta}(\alpha, \beta)$. Writing $S = \sum_{i = 1}^n y_i$, we get the posterior, $$ p \mid \mathcal{D} \sim \mathrm{Beta}(\alpha + S, \beta + nN - S). $$ where $\mathcal{D} = {y_1, y_2, \ldots, y_n}$ is the data.
More generally, if $\pi(p) = f(p)$ for any positive function integrating to 1 on $[0, 1]$, we get the posterior, $$ \pi(p \mid \mathcal{D}) \propto_p \mathrm{Beta}(p; S + 1, nN - S + 1) f(p). $$ We can then for example compute numerically the posterior probability that the branching process is supercritical, i.e., that $P(p > \frac{1}{N} \mid \mathcal{D})$, with, $$ \int_{\frac{1}{N}}^1 \pi(p \mid \mathcal{D}) \ dp = \frac{\int_{\frac{1}{N}}^1 \mathrm{Beta}(p; S + 1, nN - S + 1) f(p) \ dp}{\int_0^1 \mathrm{Beta}(p; S + 1, nN - S + 1) f(p) \ dp}. $$
Example: Using a Multinomial Likelihood
Assume there is a maximum of $N$ offspring and that now $p = (p_0, p_1, \ldots, p_N)$ is an (unknown) probability vector such that $p_i$ is the probability of $i$ offspring. By definition, we get the likelihood, $$ \pi(y_1, y_2, \ldots, y_n \mid p) \coloneqq \mathrm{Multinomial}(c; p), $$ where $c = (c_0, c_1, \ldots, c_N)$ is the vector of counts in the data of cases with $0, \ldots, N$ offspring, respectively.
If we use a prior $p \sim \mathrm{Dirichlet}(\alpha)$, where $\alpha = (\alpha_0, \alpha_1, \ldots, \alpha_N)$ is a vector of pseudocounts, we get the posterior, $$ p \mid \mathcal{D} \sim \mathrm{Dirichlet}(\alpha + c), $$ with expecation, $$ \mathbb{E}[p_i \mid \mathcal{D}] = \frac{\alpha_i + c_i}{\sum_{j = 0}^N (\alpha_j + c_j)}. $$
Example: Simplest Birth-and-Death Process
The simplest example of a birth-and-death process is when $\lambda_i = \lambda$ and $\mu_i = \mu$ for all $i \geq 0$. In this case, the stationary distribution is given by, $$ \begin{align*} v_k & = v_0 \left(\frac{\lambda}{\mu}\right)^k = v_0 \left(\frac{\lambda}{\mu}\right)^k, \newline v_0 & = \left(\sum_{k = 0}^{\infty} \left(\frac{\lambda}{\mu}\right)^k\right)^{-1} = \frac{1}{1 + \frac{\lambda}{\mu} + \left(\frac{\lambda}{\mu}\right)^2 + \ldots} = \frac{1}{\frac{1}{1 - \frac{\lambda}{\mu}}} = 1 - \frac{\lambda}{\mu}. \end{align*} $$ Which is a geometric distribution with parameter $1 - \frac{\lambda}{\mu}$, $\mathrm{Geometric}\left(1 - \frac{\lambda}{\mu}\right)$. Thus, the long-term average value of $X_t$ is, $$ E[X_t] \coloneqq \frac{\frac{\lambda}{\mu}}{1 - \frac{\lambda}{\mu}} = \frac{\lambda}{\mu - \lambda}. $$
Example: Probability Computation with Brownian Motion
Show that $B_1 + B_3 + 2 B_7 \sim \mathcal{N}(0, 50)$.
Solution
We can write, $$ \begin{align*} B_1 + B_3 + 2 B_7 & = B_1 + (B_3 - B_1) + B_1 + 2 (B_7 - B_3) + 2 B_3 \newline & = 2 B_1 + (B_3 - B_1) + 2 (B_7 - B_3) + 2(B_3 - B_1) + 2 B_1 \newline & = 4 B_1 + 2 (B_3 - B_1) + 2 (B_7 - B_3). \end{align*} $$ Looking at each term separately, we have, $$ \begin{align*} \mathbb{E}[4 B_1] & = 0 \newline \mathrm{Var}(4 B_1) & = 4^2 \mathrm{Var}(B_1) = 4^2 \end{align*} $$ and, $$ \begin{align*} \mathbb{E}[2 (B_3 - B_1)] & = 0 \newline \mathrm{Var}(2 (B_3 - B_1)) & = 2^2 \mathrm{Var}(B_3 - B_1) = 2^2 \cdot 2 \end{align*} $$ and, $$ \begin{align*} \mathbb{E}[2 (B_7 - B_3)] & = 0 \newline \mathrm{Var}(2 (B_7 - B_3)) & = 2^2 \mathrm{Var}(B_7 - B_3) = 2^2 \cdot 4 \end{align*} $$ Since the increments are independent, we have, $$ \begin{align*} \mathbb{E}[B_1 + B_3 + 2 B_7] & = 0 \newline \mathrm{Var}(B_1 + B_3 + 2 B_7) & = 4^2 + 2^2 \cdot 2 + 2^2 \cdot 4 = 50. \newline \end{align*} $$ Thus, $B_1 + B_3 + 2 B_7 \sim \mathcal{N}(0, 50)$. $_\blacksquare$
Example: Conditional Distribution with Brownian Motion
Show that $P(B_2 > 0 \mid B_1 = 1) = 0.8413$.
Solution
We can rewrite the conditional probability as,
$$
P(B_2 - B_1 0 - 1 \mid B_1 = 1).
$$
Since we have independent increments, $B_2 - B_1$ is independent of $B_1$.
$$
P(B_2 - B_1 -1)
$$
Further, $B_2 - B_1 \sim \mathcal{N}(0, 1)$.
Thus, we have,
$$
P(B_2 - B_1 -1) = P\left(Z > -1\right) = 0.8413,
$$
where $Z \sim \mathcal{N}(0, 1)$. $_\blacksquare$ (we can compute this with 1 - pnorm(-1, 0, 1) in R).
Example: Covariance Computation with Brownian Motion
Show that $\mathrm{Cov}(B_s, B_t) = \min(s, t)$.
Solution
Without loss of generality, assume that $s \leq t$. We can write, $$ \begin{align*} \mathrm{Cov}(B_s, B_t) & = \mathrm{Cov}(B_s, B_t - B_s + B_s) \newline & = \mathrm{Cov}(B_s, B_t - B_s) + \mathrm{Cov}(B_s, B_s) \newline & = 0 + \mathrm{Var}(B_s) \newline & = s \newline \end{align*} $$ Conversely, if $t \leq s$, we would get $\mathrm{Cov}(B_s, B_t) = t$. Thus, we have, $$ \mathrm{Cov}(B_s, B_t) = \min(s, t) \ _\blacksquare . $$
Example
What is the probability that $M_3 > 5$?
Solution
We have,
$$
\begin{align*}
P(M_3 5) & = P(|B_3| > 5) \newline
& = 2 P(B_3 5) \newline
& = 2 P\left(Z \frac{5}{\sqrt{3}}\right) \newline
& = 0.046.
\end{align*}
$$
where $Z \sim \mathcal{N}(0, 1)$. $_\blacksquare$ (we can compute this with 2 * (1 - pnorm(5 / sqrt(3), 0, 1)) in R).
Example
Find $t$ such that $P(M_t \leq 4) = 0.9$.
Solution
We have,
$$
\begin{align*}
P(M_t \leq 4) & = P(|B_t| \leq 4) \newline
& = 1 - 2 P(B_t 4) \newline
& = 1 - 2 P\left(Z \frac{4}{\sqrt{t}}\right). \newline
\end{align*}
$$
Setting this equal to $0.9$ we get,
$$
P\left(Z \frac{4}{\sqrt{t}}\right) = 0.05.
$$
Looking up in the standard normal table (or using qnorm(0.95, 0, 1) in R) we get $\frac{4}{\sqrt{t}} = 1.645$ and thus $t = 5.92$. $_\blacksquare$
Example: Stock Prices
From what we have seen so far, one can model the price of a stock with:
- Use a continuous-time stochastic model.
- Consider the factor with which it changes, not the differences in prices.
- Consider the log-normal distribution for such factors.
- Use a parameter for the trend of the price (drift), and one for the volatility (variance) of the price.
- Make a Markov assumption (although, we have to reflect on this choice).
This leads to using a geometric Brownian motion as a model, $$ G_t = G_0 e^{\mu t + \sigma B_t}. $$ In this context $\sigma$ is called the volatility of the stock.
Example
A stock price is modeled with $G_0 = 67.3$, $\mu = 0.08$, and $\sigma = 0.3$. What is the probability that the price is above 100 after 3 years?
Solution
We want to compute,
$$
\begin{align*}
P(G_3 > 100) & \coloneqq P\left(G_0 e^{\mu \cdot 3 + \sigma B_3} > 100\right) \newline
& = P\left(e^{\mu \cdot 3 + \sigma B_3} > \frac{100}{G_0}\right) \newline
& = P\left(\mu \cdot 3 + \sigma B_3 > \log\left(\frac{100}{G_0}\right)\right) \newline
& = P\left(B_3 > \frac{\log\left(\frac{100}{G_0}\right) - \mu \cdot 3}{\sigma}\right) \newline
\end{align*}
$$
Since $B_3 \sim \mathcal{N}(0, 3)$ we can compute this as 1 - pnorm((log(100 / 67.3) - 0.08 * 3) / 0.3, mean = 0, sd = sqrt(3)) in R.
Example
A stock price is modeled with $G_0 = 67.3$, $\mu = 0.08$, and $\sigma = 0.3$. What is the expected payoff from an option to buy the stock at 100 in 3 years?
Solution
We want to compute,
$$
\begin{align*}
\mathbb{E}[\max(G_3 - 100, 0)] & = 67.3 e^{3(0.08 + \frac{0.3^2}{2})} P\left(B_1 \frac{\beta - 0.3 \cdot 3}{\sqrt{3}}\right) - 100 P\left(B_1 > \frac{\beta}{\sqrt{3}}\right) \newline
\end{align*}
$$
where,
$$
\beta = \frac{\log(100 / 67.3) - 0.08 \cdot 3}{0.3} \approx 1.213.
$$
Thus, we can compute this as 67.3 * exp(3 * (0.08 + 0.3^2 / 2)) * (1 - pnorm((1.213 - 0.3 * 3) / sqrt(3), mean = 0, sd = 1)) - 100 * (1 - pnorm(1.213 / sqrt(3), mean = 0, sd = 1)) in R.
Example
Let $Y_t \coloneqq B_t^2 - t$ for $t \geq 0$. Then, $\{Y_t\}_{t \geq 0}$ is a martingale with respect to the Brownian motion $\{B_t\}_{t \geq 0}$.
Proof
We have, $$ \begin{align*} \mathbb{E}[Y_t \mid B_r, 0 \leq r \leq s] & = \mathbb{E}[B_t^2 - t \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[(B_t - B_s + B_s)^2 - t \mid B_r, 0 \leq r \leq s] \newline & = \mathbb{E}[(B_t - B_s)^2 \mid B_r, 0 \leq r \leq s] + 2 B_s \mathbb{E}[B_t - B_s \mid B_r, 0 \leq r \leq s] + B_s^2 - t \newline & = (t - s) + 0 + B_s^2 - t \newline & = B_s^2 - s \newline & = Y_s. \end{align*} $$ Further, $$ \begin{align*} \mathbb{E}[|Y_t|] & = \mathbb{E}[|B_t^2 - t|] \newline & \leq \mathbb{E}[B_t^2] + t \newline & = \mathrm{Var}(B_t) + (\mathbb{E}[B_t])^2 + t \newline & = t + 0 + t \newline & = 2 t < \infty. \newline \end{align*} $$ Thus, $\{Y_t\}_{t \geq 0}$ is a martingale with respect to $\{B_t\}_{t \geq 0}$. $_\blacksquare$
Example: Discounting Future Values of Stocks & The Black-Scholes Formula for Option Pricing
When making investments, there is always a range of choices, some of which are sometimes called “risk-free”. Such investments may pay a fixed interest.
When interests are compounded frequently, a reasonable model is that an investment of $G_0$ has a value $G_0 e^{rt}$ after time $t$, where $r$ is the “risk-free” investment rate of return.
A common way to take this alternative into account is to instead “discount” all other investments with the factor $e^{-rt}$.
For example, the discounted value of a stock can be modeled as, $$ e^{-rt} G_t = e^{-rt} G_0 e^{\mu t + \sigma B_t} = G_0 e^{(\mu - r) t + \sigma B_t}. $$ A possible assumption about the trend $\mu$ of a stock price is that the discounted value behaves as a martingale with respect to the Brownian motion $B_t$.
Thus, we get, $$ \mu - r + \frac{\sigma^2}{2} = 0, \quad \text{i.e.,} \quad \mu = r - \frac{\sigma^2}{2}. $$
The Black-Scholes formula for option pricing is based on,
- Assuming the discounted stock price is a martingale with respect to the Brownian motion.
- Using discounting when computing the value of the option.
From this, we get, $$ e^{-rt} \mathbb{E}[\max(G_t - K, 0)] = G_0 P\left(B_1 > \frac{\beta - \sigma t}{\sqrt{t}}\right) - K e^{-rt} P\left(B_1 > \frac{\beta}{\sqrt{t}}\right), $$ where $\beta = \frac{\log(K / G_0) - (r - \frac{\sigma^2}{2}) t}{\sigma}$.