Part 10 - Parameter Inference for Stochastic Processes

Introduction

In this part, we will look at parameter inference for discrete-time finite state space Markov chains. We will start by looking at the multinomial-Dirichlet conjugacy, which can be seen as a generalization of the binomial-Beta conjugacy we saw earlier.

Further, we will look att Hidden Markov Models and inference for these. Then we will look at inference for branching processes, Poisson processes, and finally continuous-time finite state space Markov chains.

Multinomial-Dirichlet Conjugacy

Definition: Multinomial Distribution

A vector $x = (x_1, x_2, \ldots, x_k)$ of non-negative integers has a Multinomial distribution with parameters $n$ and $p$, where $n > 0$ is an integer and $p$ is a probability vector of length $k$, if $\sum_{i = 1}^k x_i = n$ and the probability mass function is given by, $$ \pi(x \mid n, p) = \frac{n!}{x_1! x_2! \ldots x_k!} p_1^{x_1} p_2^{x_2} \ldots p_k^{x_k}. $$ We write $x \sim \mathrm{Multinomial}(n, p)$.

Definition: Dirichlet Distribution

A vector $p = (p_1, p_2, \ldots, p_k)$ of non-negative real numbers satisfying $\sum_{i = 1}^k p_i = 1$ has a Dirichlet distribution with parameter vector $\alpha = (\alpha_1, \alpha_2, \ldots, \alpha_k)$, if it has probability density function, $$ \pi(p \mid \alpha) = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1}, $$ where $\alpha_i > 0$ for all $i = 1, 2, \ldots, k$. We write $p \sim \mathrm{Dirichlet}(\alpha)$.

Theorem: Multinomial-Dirichlet Conjugacy

Let $x = (x_1, x_2, \ldots, x_k)$ be a vector of counts with $x \sim \mathrm{Multinomial}(n, p)$ and prior $p \sim \mathrm{Dirichlet}(\alpha)$. Then the posterior distribution of $p$ given $x$ is, $$ p \mid x \sim \mathrm{Dirichlet}(\alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_k + x_k). $$ Further, if $p \sim \mathrm{Dirichlet}(\alpha)$, then $\mathbb{E}[p_i] = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j}$ for $i = 1, 2, \ldots, k$.

Proof: Multinomial-Dirichlet Conjugacy

Let $\pi(x \mid p) = \mathrm{Multinomial}(x; n, p)$ and $\pi(p) = \mathrm{Dirichlet}(p; \alpha)$, respectively. Then, $$ \begin{align*} \pi(p \mid x) & \propto \pi(x \mid p) \pi(p) \newline & = \mathrm{Multinomial}(x; n, p) \mathrm{Dirichlet}(p; \alpha) \newline & \propto_p p_1^{x_1} p_2^{x_2} \ldots p_k^{x_k} \cdot p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1} \newline & \propto_p p_1^{\alpha_1 + x_1 - 1} p_2^{\alpha_2 + x_2 - 1} \ldots p_k^{\alpha_k + x_k - 1} \newline & = \mathrm{Dirichlet}(p; \alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_k + x_k) \ _\blacksquare \end{align*} $$

Proof: Expectation of Dirichlet Distribution

Let $p \sim \mathrm{Dirichlet}(\alpha)$. Then, $$ \begin{align*} \mathbb{E}[p_i] & = \int \cdots \int p_i \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1} \ dp_1 dp_2 \ldots dp_k \newline & = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} \int \cdots \int p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_i^{\alpha_i} \ldots p_k^{\alpha_k - 1} \ dp_1 dp_2 \ldots dp_k \newline & = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} \cdot \frac{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_i + 1) \ldots \Gamma(\alpha_k)}{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k + 1)} \newline & = \frac{\Gamma(\alpha_i + 1)}{\Gamma(\alpha_i)} \cdot \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k + 1)} \newline & = \alpha_i \cdot \frac{1}{\alpha_1 + \alpha_2 + \ldots + \alpha_k} \newline & = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j} \ _\blacksquare \end{align*} $$

Intuition: Predictions for The Multinomial-Dirichlet Model

If $p \sim \mathrm{Dirichlet}(\alpha)$ and $x \mid p \sim \mathrm{Multinomial}(n, p)$, then the predictive distribution is given by, $$ \pi(x) = \frac{n!}{x_1! x_2! \ldots x_k!} \cdot \frac{\Gamma(\alpha_1 + x_1)}{\Gamma(\alpha_1)} \cdot \frac{\Gamma(\alpha_2 + x_2)}{\Gamma(\alpha_2)} \cdots \frac{\Gamma(\alpha_k + x_k)}{\Gamma(\alpha_k)} \cdot \frac{\Gamma(\sum_{i = 1}^k \alpha_i)}{\Gamma(n + \sum_{i = 1}^k \alpha_i)}. $$ For example, if $e_i$ is the vector with 1 at position $i$ and zeros elsewhere, then $\pi(x = e_i) = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j}$. Further, if $x_{\text{new}}$ is a vector of new counts, then, as $p \mid x \sim \mathrm{Multinomial}(x + \alpha)$, we get, $$ \pi(x_{\text{new}} = e_i \mid x) = \frac{\alpha_i + x_i}{\sum_{j = 1}^k \alpha_j + n}. $$ The $\alpha_i$ in the prior are often called pseudocounts, as they can be interpreted as prior counts added to the observed counts.

Hidden Markov Models (HMMs)

Definition: Hidden Markov Model

A Hidden Markov Model (HMMs) consists of:

A Markov chain $X_0, \ldots, X_n, \ldots$ and,
another sequence $Y_0, \ldots, Y_n, \ldots$ such that, $$ P(Y_k \mid Y_0, \ldots, Y_{k - 1}, X_0, \ldots, X_n) = P(Y_k \mid X_k). $$ In some models we may instead have, $$ P(Y_k \mid Y_0, \ldots, Y_{k - 1}, X_0, \ldots, X_n) = P(Y_k \mid Y_{k - 1}, X_k). $$ Generally, $Y_0, \ldots, Y_n$ are called observations and $X_0, \ldots, X_n$ are called hidden states. The $X_i$‘s represent the “underlying process” that the observed values $Y_i$‘s depend on. Further, generally the $X_k$ have a finite state space.

Intuition: Inference in HMMs

When the parameters of the HMM are known, we want to know about the values of the hidden variables $X_i$, for example,

What is the most likely sequence $X_0, X_1, \ldots, X_n$ given the data?
What is the probability distribution for a single $X_i$ given the data? However, when the parameters of the HMM are unknown, we need to infer these from some data.

If data with all $X_i$ and $Y_i$ known is available, inference for parameters is based on counts of transitions. The inference for the Markov chain is exactly as for the Markov chains we have looked at before. The inference for the emission probabilities, i.e., the parameters of $P(Y_k \mid X_k)$, can be done independently of the inference for the Markov chain.

Bayesian Inference for Branching Processes

Intuition: Bayesian Inference for Galton-Watson Branching Processes

Say you have observed some data, and you want to find a Galton-Watson branching process that appropriately models the data, to then make predictions about future observations.

Recall that a branching process is characterized by the probability vector $a = (a_0, a_1, a_2, \ldots)$, where $a_i$ is the probability for $i$ offspring in the offspring process.

Let $y_1, y_2, \ldots, y_n$ be the counts of offspring in $n$ observations of the offspring process. If $a$ is given, we have the likelihood, $$ \pi(y_1, y_2, \ldots, y_n \mid a) \coloneqq \prod_{i = 1}^n a_{y_i}. $$ Thus, to complete the model, we need a prior on $a$.

Since $a$ has an infinite length, and we have a finite number of observations, we need to put information from the context into the prior, to get a sensible posterior.

Example: Using a Binomial Likelihood

Assume the offspring process is $\mathrm{Binomial}(N, p)$ for some parameter $p$ and a fixed (known) $N$. By definition, we get the likelihood, $$ \pi(y_1, y_2, \ldots, y_n \mid p) \coloneqq \prod_{i = 1}^n \mathrm{Binomial}(y_i; N, p) $$ A possibility is to use a prior $p \sim \mathrm{Beta}(\alpha, \beta)$. Writing $S = \sum_{i = 1}^n y_i$, we get the posterior, $$ p \mid \mathcal{D} \sim \mathrm{Beta}(\alpha + S, \beta + nN - S). $$ where $\mathcal{D} = {y_1, y_2, \ldots, y_n}$ is the data.

More generally, if $\pi(p) = f(p)$ for any positive function integrating to 1 on $[0, 1]$, we get the posterior, $$ \pi(p \mid \mathcal{D}) \propto_p \mathrm{Beta}(p; S + 1, nN - S + 1) f(p). $$ We can then for example compute numerically the posterior probability that the branching process is supercritical, i.e., that $P(p > \frac{1}{N} \mid \mathcal{D})$, with, $$ \int_{\frac{1}{N}}^1 \pi(p \mid \mathcal{D}) \ dp = \frac{\int_{\frac{1}{N}}^1 \mathrm{Beta}(p; S + 1, nN - S + 1) f(p) \ dp}{\int_0^1 \mathrm{Beta}(p; S + 1, nN - S + 1) f(p) \ dp}. $$

Example: Using a Multinomial Likelihood

Assume there is a maximum of $N$ offspring and that now $p = (p_0, p_1, \ldots, p_N)$ is an (unknown) probability vector such that $p_i$ is the probability of $i$ offspring. By definition, we get the likelihood, $$ \pi(y_1, y_2, \ldots, y_n \mid p) \coloneqq \mathrm{Multinomial}(c; p), $$ where $c = (c_0, c_1, \ldots, c_N)$ is the vector of counts in the data of cases with $0, \ldots, N$ offspring, respectively.

If we use a prior $p \sim \mathrm{Dirichlet}(\alpha)$, where $\alpha = (\alpha_0, \alpha_1, \ldots, \alpha_N)$ is a vector of pseudocounts, we get the posterior, $$ p \mid \mathcal{D} \sim \mathrm{Dirichlet}(\alpha + c), $$ with expecation, $$ \mathbb{E}[p_i \mid \mathcal{D}] = \frac{\alpha_i + c_i}{\sum_{j = 0}^N (\alpha_j + c_j)}. $$

Bayesian Inference for Poisson Processes

Intuition: Bayesian Inference for Poisson Processes

For a homogeneous Poisson process, we setup up a prior $\pi(\lambda)$ for its parameter $\lambda$, and find the posterior given the observations.

The likelihood for observing $y$ events in an interval of length $t$ is given by, $$ \mathrm{Poisson}(y; \lambda t) = \exp(-\lambda t) \frac{(\lambda t)^y}{y!} \propto_{\lambda} \exp(-\lambda t) \lambda^y. $$ A convenient prior to use is $\lambda \sim \mathrm{Gamma}(\alpha, \beta)$.

In this case, the posterior becomes, $$ \lambda \mid \mathcal{D} \sim \mathrm{Gamma}(\alpha + y, \beta + t), $$ and the predictive distribution for the number of observations $y_n$ in a different interval of length $u$ becomes, $$ \pi(y_n) = \mathrm{Negative-Binomial}\left(y_n; \alpha, \frac{\beta}{\beta + u}\right). $$

Bayesian Inference for Continuous-Time Markov Chains

Intuition: Bayesian Inference for Continuous-Time Markov Chains

Lastyl, recall that a continuous-time Markov chain with finite state space is completely characterized by its holding times parameter vector $q$ and the transition matrix $\tilde{P}$ of its embedded Markov chain.

Parametrizing instead with the “alarm clock” parameters $q_{ij}$ gives an equivalent description of the process.

The two parts of the data can be considered independently are,

We learn about $\tilde{P}$ from the counts of transitions between states, and
We learn about $q$ from the observed lengths of stays the process has in each state.

For $\tilde{P}$ the situation is analogous to the one for discrete-time Markov chains, except that the diagonal of $\tilde{P}$ must be zero, so the prior must exclude the possibility of non-zero diagonal elements.

For example, for $\tilde{P}_1$, the first row of $\tilde{P}$, we might use the prior $\mathrm{Dirichlet}(0, 1, \ldots, 1)$, i.e., a Dirichlet prior with a zero pseudocount for the first element.

The holding times in state $i$ are distributed as $\mathrm{Exponential}(q_i)$. If we have observed a total holding time of $h$ over $n$ intervals, that data has likelihood proportional to $e^{-hq_i}q_i^n$. Using $q_i \sim \mathrm{Gamma}(\alpha, \beta)$ as prior results in the posterior $q_i \mid \mathcal{D} \sim \mathrm{Gamma}(\alpha + n, \beta + h)$.