In this part, we will look at parameter inference for discrete-time finite state space Markov chains.
We will start by looking at the multinomial-Dirichlet conjugacy, which can be seen as a generalization of the binomial-Beta conjugacy we saw earlier.
Further, we will look att Hidden Markov Models and inference for these.
Then we will look at inference for branching processes, Poisson processes, and finally continuous-time finite state space Markov chains.
Multinomial-Dirichlet Conjugacy
Definition 1 (Multinomial Distribution)
A vector x=(x1,x2,…,xk) of non-negative integers has a Multinomial distribution with parameters n and p, where n>0 is an integer and p is a probability vector of length k, if ∑i=1kxi=n and the probability mass function is given by,
π(x∣n,p)=x1!x2!…xk!n!p1x1p2x2…pkxk.
We write x∼Multinomial(n,p).
Definition 2 (Dirichlet Distribution)
A vector p=(p1,p2,…,pk) of non-negative real numbers satisfying ∑i=1kpi=1 has a Dirichlet distribution with parameter vector α=(α1,α2,…,αk), if it has probability density function,
where αi>0 for all i=1,2,…,k.
We write p∼Dirichlet(α).
Theorem 1 (Multinomial-Dirichlet Conjugacy)
Let x=(x1,x2,…,xk) be a vector of counts with x∼Multinomial(n,p) and prior p∼Dirichlet(α).
Then the posterior distribution of p given x is,
p∣x∼Dirichlet(α1+x1,α2+x2,…,αk+xk).
Further, if p∼Dirichlet(α), then E[pi]=∑j=1kαjαi for i=1,2,…,k.
Proof (Multinomial-Dirichlet Conjugacy)
Let π(x∣p)=Multinomial(x;n,p) and π(p)=Dirichlet(p;α), respectively.
Then,
π(p∣x)∝π(x∣p)π(p)=Multinomial(x;n,p)Dirichlet(p;α)∝pp1x1p2x2…pkxk⋅p1α1−1p2α2−1…pkαk−1∝pp1α1+x1−1p2α2+x2−1…pkαk+xk−1=Dirichlet(p;α1+x1,α2+x2,…,αk+xk)■Proof (Expectation of Dirichlet Distribution)
Let p∼Dirichlet(α).
Then,
E[pi]=∫⋯∫piΓ(α1)Γ(α2)…Γ(αk)Γ(α1+α2+…+αk)p1α1−1p2α2−1…pkαk−1dp1dp2…dpk=Γ(α1)Γ(α2)…Γ(αk)Γ(α1+α2+…+αk)∫⋯∫p1α1−1p2α2−1…piαi…pkαk−1dp1dp2…dpk=Γ(α1)Γ(α2)…Γ(αk)Γ(α1+α2+…+αk)⋅Γ(α1+α2+…+αk+1)Γ(α1)Γ(α2)…Γ(αi+1)…Γ(αk)=Γ(αi)Γ(αi+1)⋅Γ(α1+α2+…+αk+1)Γ(α1+α2+…+αk)=αi⋅α1+α2+…+αk1=∑j=1kαjαi■Intuition (Predictions for The Multinomial-Dirichlet Model)
If p∼Dirichlet(α) and x∣p∼Multinomial(n,p), then the predictive distribution is given by,
For example, if ei is the vector with 1 at position i and zeros elsewhere, then π(x=ei)=∑j=1kαjαi.
Further, if xnew is a vector of new counts, then, as p∣x∼Multinomial(x+α), we get,
π(xnew=ei∣x)=∑j=1kαj+nαi+xi.
The αi in the prior are often called pseudocounts, as they can be interpreted as prior counts added to the observed counts.
Hidden Markov Models (HMMs)
Definition 3 (Hidden Markov Model)
A Hidden Markov Model (HMMs) consists of:
A Markov chain X0,…,Xn,… and,
another sequence Y0,…,Yn,… such that,
P(Yk∣Y0,…,Yk−1,X0,…,Xn)=P(Yk∣Xk).
In some models we may instead have,
P(Yk∣Y0,…,Yk−1,X0,…,Xn)=P(Yk∣Yk−1,Xk).
Generally, Y0,…,Yn are called observations and X0,…,Xn are called hidden states.
The Xi’s represent the “underlying process” that the observed values Yi’s depend on.
Further, generally the Xk have a finite state space.
Intuition (Inference in HMMs)
When the parameters of the HMM are known, we want to know about the values of the hidden variables Xi, for example,
What is the most likely sequence X0,X1,…,Xn given the data?
What is the probability distribution for a single Xi given the data?
However, when the parameters of the HMM are unknown, we need to infer these from some data.
If data with all Xi and Yi known is available, inference for parameters is based on counts of transitions.
The inference for the Markov chain is exactly as for the Markov chains we have looked at before.
The inference for the emission probabilities, i.e., the parameters of P(Yk∣Xk), can be done independently of the inference for the Markov chain.
Bayesian Inference for Branching Processes
Intuition (Bayesian Inference for Galton-Watson Branching Processes)
Say you have observed some data, and you want to find a Galton-Watson branching process that appropriately models the data, to then make predictions about future observations.
Recall that a branching process is characterized by the probability vector a=(a0,a1,a2,…), where ai is the probability for i offspring in the offspring process.
Let y1,y2,…,yn be the counts of offspring in n observations of the offspring process. If a is given, we have the likelihood,
π(y1,y2,…,yn∣a):=i=1∏nayi.
Thus, to complete the model, we need a prior on a.
Since a has an infinite length, and we have a finite number of observations, we need to put information from the context into the prior, to get a sensible posterior.
Example 1 (Using a Binomial Likelihood)
Assume the offspring process is Binomial(N,p) for some parameter p and a fixed (known) N.
By definition, we get the likelihood,
π(y1,y2,…,yn∣p):=i=1∏nBinomial(yi;N,p)
A possibility is to use a prior p∼Beta(α,β). Writing S=∑i=1nyi, we get the posterior,
p∣D∼Beta(α+S,β+nN−S).
where D={y1,y2,…,yn} is the data.
More generally, if π(p)=f(p) for any positive function integrating to 1 on [0,1], we get the posterior,
π(p∣D)∝pBeta(p;S+1,nN−S+1)f(p).
We can then for example compute numerically the posterior probability that the branching process is supercritical, i.e., that P(p>N1∣D), with,
∫N11π(p∣D)dp=∫01Beta(p;S+1,nN−S+1)f(p)dp∫N11Beta(p;S+1,nN−S+1)f(p)dp.Example 2 (Using a Multinomial Likelihood)
Assume there is a maximum of N offspring and that now p=(p0,p1,…,pN) is an (unknown) probability vector such that pi is the probability of i offspring.
By definition, we get the likelihood,
π(y1,y2,…,yn∣p):=Multinomial(c;p),
where c=(c0,c1,…,cN) is the vector of counts in the data of cases with 0,…,N offspring, respectively.
If we use a prior p∼Dirichlet(α), where α=(α0,α1,…,αN) is a vector of pseudocounts, we get the posterior,
p∣D∼Dirichlet(α+c),
with expecation,
E[pi∣D]=∑j=0N(αj+cj)αi+ci.
Bayesian Inference for Poisson Processes
Intuition (Bayesian Inference for Poisson Processes)
For a homogeneous Poisson process, we setup up a prior π(λ) for its parameter λ, and find the posterior given the observations.
The likelihood for observing y events in an interval of length t is given by,
Poisson(y;λt)=exp(−λt)y!(λt)y∝λexp(−λt)λy.
A convenient prior to use is λ∼Gamma(α,β).
In this case, the posterior becomes,
λ∣D∼Gamma(α+y,β+t),
and the predictive distribution for the number of observations yn in a different interval of length u becomes,
π(yn)=Negative−Binomial(yn;α,β+uβ).
Bayesian Inference for Continuous-Time Markov Chains
Intuition (Bayesian Inference for Continuous-Time Markov Chains)
Lastyl, recall that a continuous-time Markov chain with finite state space is completely characterized by its holding times parameter vector q and the transition matrix P~ of its embedded Markov chain.
Parametrizing instead with the “alarm clock” parameters qij gives an equivalent description of the process.
The two parts of the data can be considered independently are,
We learn about P~ from the counts of transitions between states, and
We learn about q from the observed lengths of stays the process has in each state.
For P~ the situation is analogous to the one for discrete-time Markov chains, except that the diagonal of P~ must be zero, so the prior must exclude the possibility of non-zero diagonal elements.
For example, for P~1, the first row of P~, we might use the prior Dirichlet(0,1,…,1), i.e., a Dirichlet prior with a zero pseudocount for the first element.
The holding times in state i are distributed as Exponential(qi).
If we have observed a total holding time of h over n intervals, that data has likelihood proportional to e−hqiqin.
Using qi∼Gamma(α,β) as prior results in the posterior qi∣D∼Gamma(α+n,β+h).