Part 10 - Parameter Inference for Stochastic Processes

Introduction

In this part, we will look at parameter inference for discrete-time finite state space Markov chains. We will start by looking at the multinomial-Dirichlet conjugacy, which can be seen as a generalization of the binomial-Beta conjugacy we saw earlier.

Further, we will look att Hidden Markov Models and inference for these. Then we will look at inference for branching processes, Poisson processes, and finally continuous-time finite state space Markov chains.

Multinomial-Dirichlet Conjugacy

Definition 1 (Multinomial Distribution)

A vector x=(x1,x2,,xk)x = (x_1, x_2, \ldots, x_k) of non-negative integers has a Multinomial distribution with parameters nn and pp, where n>0n > 0 is an integer and pp is a probability vector of length kk, if i=1kxi=n\sum_{i = 1}^k x_i = n and the probability mass function is given by,

π(xn,p)=n!x1!x2!xk!p1x1p2x2pkxk.\pi(x \mid n, p) = \frac{n!}{x_1! x_2! \ldots x_k!} p_1^{x_1} p_2^{x_2} \ldots p_k^{x_k}.

We write xMultinomial(n,p)x \sim \mathrm{Multinomial}(n, p).

Definition 2 (Dirichlet Distribution)

A vector p=(p1,p2,,pk)p = (p_1, p_2, \ldots, p_k) of non-negative real numbers satisfying i=1kpi=1\sum_{i = 1}^k p_i = 1 has a Dirichlet distribution with parameter vector α=(α1,α2,,αk)\alpha = (\alpha_1, \alpha_2, \ldots, \alpha_k), if it has probability density function,

π(pα)=Γ(α1+α2++αk)Γ(α1)Γ(α2)Γ(αk)p1α11p2α21pkαk1,\pi(p \mid \alpha) = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1},

where αi>0\alpha_i > 0 for all i=1,2,,ki = 1, 2, \ldots, k. We write pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha).

Theorem 1 (Multinomial-Dirichlet Conjugacy)

Let x=(x1,x2,,xk)x = (x_1, x_2, \ldots, x_k) be a vector of counts with xMultinomial(n,p)x \sim \mathrm{Multinomial}(n, p) and prior pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha). Then the posterior distribution of pp given xx is,

pxDirichlet(α1+x1,α2+x2,,αk+xk).p \mid x \sim \mathrm{Dirichlet}(\alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_k + x_k).

Further, if pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha), then E[pi]=αij=1kαj\mathbb{E}[p_i] = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j} for i=1,2,,ki = 1, 2, \ldots, k.

Proof (Multinomial-Dirichlet Conjugacy)

Let π(xp)=Multinomial(x;n,p)\pi(x \mid p) = \mathrm{Multinomial}(x; n, p) and π(p)=Dirichlet(p;α)\pi(p) = \mathrm{Dirichlet}(p; \alpha), respectively. Then,

π(px)π(xp)π(p)=Multinomial(x;n,p)Dirichlet(p;α)pp1x1p2x2pkxkp1α11p2α21pkαk1pp1α1+x11p2α2+x21pkαk+xk1=Dirichlet(p;α1+x1,α2+x2,,αk+xk) \begin{align*} \pi(p \mid x) & \propto \pi(x \mid p) \pi(p) \newline & = \mathrm{Multinomial}(x; n, p) \mathrm{Dirichlet}(p; \alpha) \newline & \propto_p p_1^{x_1} p_2^{x_2} \ldots p_k^{x_k} \cdot p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1} \newline & \propto_p p_1^{\alpha_1 + x_1 - 1} p_2^{\alpha_2 + x_2 - 1} \ldots p_k^{\alpha_k + x_k - 1} \newline & = \mathrm{Dirichlet}(p; \alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_k + x_k) \ _\blacksquare \end{align*}
Proof (Expectation of Dirichlet Distribution)

Let pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha). Then,

E[pi]=piΓ(α1+α2++αk)Γ(α1)Γ(α2)Γ(αk)p1α11p2α21pkαk1 dp1dp2dpk=Γ(α1+α2++αk)Γ(α1)Γ(α2)Γ(αk)p1α11p2α21piαipkαk1 dp1dp2dpk=Γ(α1+α2++αk)Γ(α1)Γ(α2)Γ(αk)Γ(α1)Γ(α2)Γ(αi+1)Γ(αk)Γ(α1+α2++αk+1)=Γ(αi+1)Γ(αi)Γ(α1+α2++αk)Γ(α1+α2++αk+1)=αi1α1+α2++αk=αij=1kαj \begin{align*} \mathbb{E}[p_i] & = \int \cdots \int p_i \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_k^{\alpha_k - 1} \ dp_1 dp_2 \ldots dp_k \newline & = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} \int \cdots \int p_1^{\alpha_1 - 1} p_2^{\alpha_2 - 1} \ldots p_i^{\alpha_i} \ldots p_k^{\alpha_k - 1} \ dp_1 dp_2 \ldots dp_k \newline & = \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_k)} \cdot \frac{\Gamma(\alpha_1) \Gamma(\alpha_2) \ldots \Gamma(\alpha_i + 1) \ldots \Gamma(\alpha_k)}{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k + 1)} \newline & = \frac{\Gamma(\alpha_i + 1)}{\Gamma(\alpha_i)} \cdot \frac{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k)}{\Gamma(\alpha_1 + \alpha_2 + \ldots + \alpha_k + 1)} \newline & = \alpha_i \cdot \frac{1}{\alpha_1 + \alpha_2 + \ldots + \alpha_k} \newline & = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j} \ _\blacksquare \end{align*}
Intuition (Predictions for The Multinomial-Dirichlet Model)

If pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha) and xpMultinomial(n,p)x \mid p \sim \mathrm{Multinomial}(n, p), then the predictive distribution is given by,

π(x)=n!x1!x2!xk!Γ(α1+x1)Γ(α1)Γ(α2+x2)Γ(α2)Γ(αk+xk)Γ(αk)Γ(i=1kαi)Γ(n+i=1kαi).\pi(x) = \frac{n!}{x_1! x_2! \ldots x_k!} \cdot \frac{\Gamma(\alpha_1 + x_1)}{\Gamma(\alpha_1)} \cdot \frac{\Gamma(\alpha_2 + x_2)}{\Gamma(\alpha_2)} \cdots \frac{\Gamma(\alpha_k + x_k)}{\Gamma(\alpha_k)} \cdot \frac{\Gamma(\sum_{i = 1}^k \alpha_i)}{\Gamma(n + \sum_{i = 1}^k \alpha_i)}.

For example, if eie_i is the vector with 1 at position ii and zeros elsewhere, then π(x=ei)=αij=1kαj\pi(x = e_i) = \frac{\alpha_i}{\sum_{j = 1}^k \alpha_j}. Further, if xnewx_{\text{new}} is a vector of new counts, then, as pxMultinomial(x+α)p \mid x \sim \mathrm{Multinomial}(x + \alpha), we get,

π(xnew=eix)=αi+xij=1kαj+n.\pi(x_{\text{new}} = e_i \mid x) = \frac{\alpha_i + x_i}{\sum_{j = 1}^k \alpha_j + n}.

The αi\alpha_i in the prior are often called pseudocounts, as they can be interpreted as prior counts added to the observed counts.

Hidden Markov Models (HMMs)

Definition 3 (Hidden Markov Model)

A Hidden Markov Model (HMMs) consists of:

  • A Markov chain X0,,Xn,X_0, \ldots, X_n, \ldots and,
  • another sequence Y0,,Yn,Y_0, \ldots, Y_n, \ldots such that,
P(YkY0,,Yk1,X0,,Xn)=P(YkXk).P(Y_k \mid Y_0, \ldots, Y_{k - 1}, X_0, \ldots, X_n) = P(Y_k \mid X_k).

In some models we may instead have,

P(YkY0,,Yk1,X0,,Xn)=P(YkYk1,Xk).P(Y_k \mid Y_0, \ldots, Y_{k - 1}, X_0, \ldots, X_n) = P(Y_k \mid Y_{k - 1}, X_k).

Generally, Y0,,YnY_0, \ldots, Y_n are called observations and X0,,XnX_0, \ldots, X_n are called hidden states. The XiX_i’s represent the “underlying process” that the observed values YiY_i’s depend on. Further, generally the XkX_k have a finite state space.

Intuition (Inference in HMMs)

When the parameters of the HMM are known, we want to know about the values of the hidden variables XiX_i, for example,

  • What is the most likely sequence X0,X1,,XnX_0, X_1, \ldots, X_n given the data?
  • What is the probability distribution for a single XiX_i given the data? However, when the parameters of the HMM are unknown, we need to infer these from some data.

If data with all XiX_i and YiY_i known is available, inference for parameters is based on counts of transitions. The inference for the Markov chain is exactly as for the Markov chains we have looked at before. The inference for the emission probabilities, i.e., the parameters of P(YkXk)P(Y_k \mid X_k), can be done independently of the inference for the Markov chain.

Bayesian Inference for Branching Processes

Intuition (Bayesian Inference for Galton-Watson Branching Processes)

Say you have observed some data, and you want to find a Galton-Watson branching process that appropriately models the data, to then make predictions about future observations.

Recall that a branching process is characterized by the probability vector a=(a0,a1,a2,)a = (a_0, a_1, a_2, \ldots), where aia_i is the probability for ii offspring in the offspring process.

Let y1,y2,,yny_1, y_2, \ldots, y_n be the counts of offspring in nn observations of the offspring process. If aa is given, we have the likelihood,

π(y1,y2,,yna)i=1nayi.\pi(y_1, y_2, \ldots, y_n \mid a) \coloneqq \prod_{i = 1}^n a_{y_i}.

Thus, to complete the model, we need a prior on aa.

Since aa has an infinite length, and we have a finite number of observations, we need to put information from the context into the prior, to get a sensible posterior.

Example 1 (Using a Binomial Likelihood)

Assume the offspring process is Binomial(N,p)\mathrm{Binomial}(N, p) for some parameter pp and a fixed (known) NN. By definition, we get the likelihood,

π(y1,y2,,ynp)i=1nBinomial(yi;N,p)\pi(y_1, y_2, \ldots, y_n \mid p) \coloneqq \prod_{i = 1}^n \mathrm{Binomial}(y_i; N, p)

A possibility is to use a prior pBeta(α,β)p \sim \mathrm{Beta}(\alpha, \beta). Writing S=i=1nyiS = \sum_{i = 1}^n y_i, we get the posterior,

pDBeta(α+S,β+nNS).p \mid \mathcal{D} \sim \mathrm{Beta}(\alpha + S, \beta + nN - S).

where D={y1,y2,,yn}\mathcal{D} = \{y_1, y_2, \ldots, y_n\} is the data.

More generally, if π(p)=f(p)\pi(p) = f(p) for any positive function integrating to 1 on [0,1][0, 1], we get the posterior,

π(pD)pBeta(p;S+1,nNS+1)f(p).\pi(p \mid \mathcal{D}) \propto_p \mathrm{Beta}(p; S + 1, nN - S + 1) f(p).

We can then for example compute numerically the posterior probability that the branching process is supercritical, i.e., that P(p>1ND)P(p > \frac{1}{N} \mid \mathcal{D}), with,

1N1π(pD) dp=1N1Beta(p;S+1,nNS+1)f(p) dp01Beta(p;S+1,nNS+1)f(p) dp.\int_{\frac{1}{N}}^1 \pi(p \mid \mathcal{D}) \ dp = \frac{\int_{\frac{1}{N}}^1 \mathrm{Beta}(p; S + 1, nN - S + 1) f(p) \ dp}{\int_0^1 \mathrm{Beta}(p; S + 1, nN - S + 1) f(p) \ dp}.
Example 2 (Using a Multinomial Likelihood)

Assume there is a maximum of NN offspring and that now p=(p0,p1,,pN)p = (p_0, p_1, \ldots, p_N) is an (unknown) probability vector such that pip_i is the probability of ii offspring. By definition, we get the likelihood,

π(y1,y2,,ynp)Multinomial(c;p),\pi(y_1, y_2, \ldots, y_n \mid p) \coloneqq \mathrm{Multinomial}(c; p),

where c=(c0,c1,,cN)c = (c_0, c_1, \ldots, c_N) is the vector of counts in the data of cases with 0,,N0, \ldots, N offspring, respectively.

If we use a prior pDirichlet(α)p \sim \mathrm{Dirichlet}(\alpha), where α=(α0,α1,,αN)\alpha = (\alpha_0, \alpha_1, \ldots, \alpha_N) is a vector of pseudocounts, we get the posterior,

pDDirichlet(α+c),p \mid \mathcal{D} \sim \mathrm{Dirichlet}(\alpha + c),

with expecation,

E[piD]=αi+cij=0N(αj+cj).\mathbb{E}[p_i \mid \mathcal{D}] = \frac{\alpha_i + c_i}{\sum_{j = 0}^N (\alpha_j + c_j)}.

Bayesian Inference for Poisson Processes

Intuition (Bayesian Inference for Poisson Processes)

For a homogeneous Poisson process, we setup up a prior π(λ)\pi(\lambda) for its parameter λ\lambda, and find the posterior given the observations.

The likelihood for observing yy events in an interval of length tt is given by,

Poisson(y;λt)=exp(λt)(λt)yy!λexp(λt)λy.\mathrm{Poisson}(y; \lambda t) = \exp(-\lambda t) \frac{(\lambda t)^y}{y!} \propto_{\lambda} \exp(-\lambda t) \lambda^y.

A convenient prior to use is λGamma(α,β)\lambda \sim \mathrm{Gamma}(\alpha, \beta).

In this case, the posterior becomes,

λDGamma(α+y,β+t),\lambda \mid \mathcal{D} \sim \mathrm{Gamma}(\alpha + y, \beta + t),

and the predictive distribution for the number of observations yny_n in a different interval of length uu becomes,

π(yn)=NegativeBinomial(yn;α,ββ+u).\pi(y_n) = \mathrm{Negative-Binomial}\left(y_n; \alpha, \frac{\beta}{\beta + u}\right).

Bayesian Inference for Continuous-Time Markov Chains

Intuition (Bayesian Inference for Continuous-Time Markov Chains)

Lastyl, recall that a continuous-time Markov chain with finite state space is completely characterized by its holding times parameter vector qq and the transition matrix P~\tilde{P} of its embedded Markov chain.

Parametrizing instead with the “alarm clock” parameters qijq_{ij} gives an equivalent description of the process.

The two parts of the data can be considered independently are,

  • We learn about P~\tilde{P} from the counts of transitions between states, and
  • We learn about qq from the observed lengths of stays the process has in each state.

For P~\tilde{P} the situation is analogous to the one for discrete-time Markov chains, except that the diagonal of P~\tilde{P} must be zero, so the prior must exclude the possibility of non-zero diagonal elements.

For example, for P~1\tilde{P}_1, the first row of P~\tilde{P}, we might use the prior Dirichlet(0,1,,1)\mathrm{Dirichlet}(0, 1, \ldots, 1), i.e., a Dirichlet prior with a zero pseudocount for the first element.

The holding times in state ii are distributed as Exponential(qi)\mathrm{Exponential}(q_i). If we have observed a total holding time of hh over nn intervals, that data has likelihood proportional to ehqiqine^{-hq_i}q_i^n. Using qiGamma(α,β)q_i \sim \mathrm{Gamma}(\alpha, \beta) as prior results in the posterior qiDGamma(α+n,β+h)q_i \mid \mathcal{D} \sim \mathrm{Gamma}(\alpha + n, \beta + h).