In this part, we will introduce and discuss the idea of Bayesian inference, which is predicting from conditional stochastic models.
Further, we will introduce and discuss a practical example of conjugacy in Bayesian inference.
Lastly, we will briefly discuss the computations of predictive distributions and Bayesian inference in a discrete setting, or with numerical integration.
Bayesian Inference
Example 1 (Throwing a Die)
If you are throwing a fair six-sided dice, your stochastic model would be that each outcome has a probability of 61.
New observations would be independent of old observations, i.e., to make predictions, you do not need (new) data.
Assume instead the dice may be biased in some unknown way.
A way to make predictions would be to first acquire data, i.e., record how often each outcome occurs, and use that information when predicting. Thus, outcomes would be dependent.
Further, you would use a more complex stochastic model that reasonably models the dependency.
Given a sequence D:={1,5,6,1,3,1,1,2,1,5}, the probability for 1 in the next throw is then computed as,
P(1∣D)=P(D)P(D∣1)P(1)
Now imagine a scenario when dealing with a biased coin. The prior used is that θ, the probability of heads, is either 0.7 or 0.5, with equal probability.
Intuition (Reformulation using the underlying parameter θ)
A more common approach is to define the model in terms of a parameter θ, so that all observations are independent given the parameter.
In our dice example, θ is a discrete random variable, the possible values are 0.7 and 0.3,
π(θ=0.7)=π(θ=0.5)=0.5
Let y be the count of heads in the first n throws, and ynew is the count of heads in the next throw,
y∣θ∼Binomial(n,θ),ynew∣θ∼Bernoulli(1,θ)
We then have a complete model expressed with,
π(ynew,y,θ)=π(ynew∣y,θ)π(y∣θ)π(θ)
satisfying,
π(ynew∣y,θ)=π(ynew∣θ)
and there is a standard way we can formulate Bayesian inference.
Definition 1 (Bayesian Inference in Models with a Parameter)
Let y denote our observed data, and let ynew denote what we want to predict, and let θ denote the parameter of our model.
Assume the stochastic model can be written as,
Then, we can always use the fact that π(ynew∣y)=∑θπ(ynew∣θ)π(θ∣y), or,
π(ynew∣y)=∫π(ynew∣θ)π(θ∣y)dθ
where we can use Bayes’,
π(θ∣y)=π(y)π(y∣θ)π(θ)Example 2 (Continuing the Biased Coin Example)
In our example above, we get,
π(θ∣y)=0.3y0.7n−y+0.7y0.3n−yθy(1−θ)n−y
and,
π(ynew=H∣θ)=θ
so we get,
π(ynew=H∣y)=0.3y0.7n−y+0.7y0.3n−y0.3y+10.7n−y+0.7y+10.3n−yNotation (General Terminology in Bayesian Inference)
The probability distribution for the parameter θ,
π(θ)
is called the prior distribution.
The probability distribution for the data y given the parameter θ,
π(y∣θ)
is called the likelihood.
Lastly, the probability distribution for the parameter θ given the data y,
π(θ∣y)
is called the posterior distribution.
Conjugate Priors
Example 3 (Finding the posterior for θ using a uniform prior)
Assume now that our prior for θ is the uniform distribution on [0,1],
The conditional model π(y∣θ) (posterior of θ) can be computed with Bayes’ formula,
π(θ∣y)=π(y)π(y∣θ)π(θ)=∫01π(y∣θ)π(θ)dθπ(y∣θ)π(θ)=∫01Binomial(y;n,θ)dθBinomial(y;n,θ)=∫01θy(1−θ)n−ydθθy(1−θ)n−yRecall (The Beta Distribution)
θ has a Beta distribution on [0,1], with parameters α and B, if its density has the form,
π(θ∣α,B)=B(α,B)1θα−1(1−θ)B−1
where B(α,B) is the Beta function defined as,
B(α,B):=Γ(α+B)Γ(α)Γ(B),
where Γ(t) is the Gamma function defined as,
Γ(t):=∫0∞xt−1e−xdx
Recall that for positive integers, Γ(n)=(n−1)!=1⋅2⋅3⋯(n−1).
Thus, our posterior becomes B(y+1,n−y+1).
As π(ynew=H∣θ)=∫θθπ(θ∣y)dθ, the prediction is the expectation of this Beta distribution,
Definition 2 (Conjugate Priors)
Given a likelihood model π(x∣θ).
A conjugate family of priors to this likelihood is a parametric family of distributions for θ so that if the prior is in this family, the posterior of the form θ∣x is also in the family.
Then, π(θ∣α,B)=Gamma(θ;α,B) where α,B are positive parameters, is a conjugate family. Recall that,
Γ(θ;α,B)=Γ(α)Bαθα−1exp(−Bθ).
Moreover, we have the posterior,
π(θ∣x)=Gamma(θ;α+x,B+1).
We make repeated observations of a Poisson(θ) distributed variable for some θ>0.
The observed value are {x1=20,x2=24,x3=23}. What is the posterior distribution for θ given this data?
Firstly, we need to choose a prior for θ. We will use π(θ)∝θθ11Note that this is an improper prior; it is a “density” that does not integrate to 1. However, using such improper priors is possible in Bayesian statistics.
We get the following posterior after observing x1,
θ∣x1∼Gamma(20,1).
Using this as our new prior, we get the following posterior after observing x2,
θ∣x1,x2∼Gamma(20+24,1+1)=Gamma(44,2).
Lastly, using this as our new prior, we get the following posterior after observing x3,
θ∣x1,x2,x3∼Gamma(44+23,2+1)=Gamma(67,3).The posteriors after one, two, and three observations.
Predictive Distribution
Example 5 (Poisson-Gamma Predictive Distribution)
We have seen that, if k∣θ∼Poisson(θ) and θ∼Gamma(α,B), then the posterior is θ∣k∼Gamma(α+k,B+1).
Direct computation gives the prior predictive distribution as,
π(k)=π(θ∣k)π(k∣θ)π(θ)=(B+1)α+kΓ(α)k!BαΓ(α+k)
Note that the positive integer k has a negative binomial distribution with parameters r and p if its probability mass function is,
π(k;r,p)=(kk+r−1)(1−p)rpk=k!Γ(r)Γ(k+r)(1−p)rpk.
We get that the prior predictive is negative binomial with parameters α and B+11.
Further, note that we can get the posterior predictive by simply replacing the α and B of the prior with their corresponding posterior values.
Distributions for predicting the next observation .
Bayesian Inference in Discrete Settings
Finally, a word on Bayesian inference in discrete settings.
If the sample space θ is finite, Bayesian inference is quite easy.
The prior distribution π(θ) is represented by a vector.
The posterior distribution π(θ∣y) is obtained by termwise multiplication of the vectors π(y∣θ) and π(θ), followed by normalization.
The prediction π(ynew∣y)=∫θπ(ynew∣θ)π(θ∣y)dθ simplifies to taking the sum of the termwise product of the vectors π(ynew∣θ) and π(θ∣y).
Finally, the prediction we want to make can be expressed as a quotient of integrals,
We can compute these integrals using numerical integration, works well as long as the dimension of θ is not too high and our functions are well-behaved.