Part 2 - Basics of Bayesian inference

Introduction to Bayesian Inference

In this part, we will introduce and discuss the idea of Bayesian inference, which is predicting from conditional stochastic models. Further, we will introduce and discuss a practical example of conjugacy in Bayesian inference. Lastly, we will briefly discuss the computations of predictive distributions and Bayesian inference in a discrete setting, or with numerical integration.

Bayesian Inference

Example 1 (Throwing a Die)

If you are throwing a fair six-sided dice, your stochastic model would be that each outcome has a probability of 16\frac{1}{6}. New observations would be independent of old observations, i.e., to make predictions, you do not need (new) data.

Assume instead the dice may be biased in some unknown way. A way to make predictions would be to first acquire data, i.e., record how often each outcome occurs, and use that information when predicting. Thus, outcomes would be dependent. Further, you would use a more complex stochastic model that reasonably models the dependency.

Given a sequence D{1,5,6,1,3,1,1,2,1,5}\mathcal{D} \coloneqq \{1, 5, 6, 1, 3, 1, 1, 2, 1, 5\}, the probability for 11 in the next throw is then computed as,

P(1D)=P(D1)P(1)P(D)P(1 \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid 1) P(1)}{P(\mathcal{D})}

Now imagine a scenario when dealing with a biased coin. The prior used is that θ\theta, the probability of heads, is either 0.70.7 or 0.50.5, with equal probability.

Intuition (Reformulation using the underlying parameter   θ\theta)

A more common approach is to define the model in terms of a parameter θ\theta, so that all observations are independent given the parameter.

In our dice example, θ\theta is a discrete random variable, the possible values are 0.70.7 and 0.30.3,

π(θ=0.7)=π(θ=0.5)=0.5\pi(\theta = 0.7) = \pi(\theta = 0.5) = 0.5

Let yy be the count of heads in the first nn throws, and ynewy_{\text{new}} is the count of heads in the next throw,

yθBinomial(n,θ),ynewθBernoulli(1,θ)y \mid \theta \sim \mathrm{Binomial}(n, \theta), \quad y_{\text{new}} \mid \theta \sim \mathrm{Bernoulli}(1, \theta)

We then have a complete model expressed with,

π(ynew,y,θ)=π(ynewy,θ)π(yθ)π(θ)\pi(y_{\text{new}}, y, \theta) = \pi(y_{\text{new}} \mid y, \theta) \pi(y \mid \theta) \pi(\theta)

satisfying,

π(ynewy,θ)=π(ynewθ)\pi(y_{\text{new}} \mid y, \theta) = \pi(y_{\text{new}} \mid \theta)

and there is a standard way we can formulate Bayesian inference.

Definition 1 (Bayesian Inference in Models with a Parameter)

Let yy denote our observed data, and let ynewy_{\text{new}} denote what we want to predict, and let θ\theta denote the parameter of our model. Assume the stochastic model can be written as,

π(ynew,y,θ)=π(ynewy,θ)π(yθ)π(θ)=π(ynewθ)π(yθ)π(θ)\pi(y_{\text{new}}, y, \theta) = \pi(y_{\text{new}} \mid y, \theta) \pi(y \mid \theta) \pi(\theta) = \pi(y_{\text{new}} \mid \theta) \pi(y \mid \theta) \pi(\theta)

Then, we can always use the fact that π(ynewy)=θπ(ynewθ)π(θy)\pi(y_{\text{new}} \mid y) = \sum_{\theta} \pi(y_{\text{new}} \mid \theta) \pi(\theta \mid y), or,

π(ynewy)=π(ynewθ)π(θy) dθ\pi(y_{\text{new}} \mid y) = \int \pi(y_{\text{new}} \mid \theta) \pi(\theta \mid y) \ d\theta

where we can use Bayes’,

π(θy)=π(yθ)π(θ)π(y)\pi(\theta \mid y) = \frac{\pi(y \mid \theta) \pi(\theta)}{\pi(y)}
Example 2 (Continuing the Biased Coin Example)

In our example above, we get,

π(θy)=θy(1θ)ny0.3y0.7ny+0.7y0.3ny\pi(\theta \mid y) = \frac{\theta^{y} (1 - \theta)^{n - y}}{0.3^{y} 0.7^{n - y} + 0.7^{y} 0.3^{n - y}}

and,

π(ynew=Hθ)=θ\pi(y_{\text{new}} = \text{H} \mid \theta) = \theta

so we get,

π(ynew=Hy)=0.3y+10.7ny+0.7y+10.3ny0.3y0.7ny+0.7y0.3ny\pi(y_{\text{new}} = \text{H} \mid y) = \frac{0.3^{y + 1} 0.7^{n - y} + 0.7^{y + 1} 0.3^{n - y}}{0.3^{y} 0.7^{n - y} + 0.7^{y} 0.3^{n - y}}
Notation (General Terminology in Bayesian Inference)

The probability distribution for the parameter θ\theta,

π(θ)\pi(\theta)

is called the prior distribution.

The probability distribution for the data yy given the parameter θ\theta,

π(yθ)\pi(y \mid \theta)

is called the likelihood.

Lastly, the probability distribution for the parameter θ\theta given the data yy,

π(θy)\pi(\theta \mid y)

is called the posterior distribution.

Conjugate Priors

Example 3 (Finding the posterior for   θ\theta   using a uniform prior)

Assume now that our prior for θ\theta is the uniform distribution on [0,1][0, 1], The conditional model π(yθ)\pi(y \mid \theta) (posterior of θ\theta) can be computed with Bayes’ formula,

π(θy)=π(yθ)π(θ)π(y)=π(yθ)π(θ)01π(yθ)π(θ) dθ=Binomial(y;n,θ)01Binomial(y;n,θ) dθ=θy(1θ)ny01θy(1θ)ny dθ\begin{align*} \pi(\theta \mid y) & = \frac{\pi(y \mid \theta)\pi(\theta)}{\pi(y)} \newline & = \frac{\pi(y \mid \theta) \pi(\theta)}{\int_{0}^{1} \pi(y \mid \theta) \pi(\theta) \ d \theta} \newline & = \frac{\mathrm{Binomial}(y; n, \theta)}{\int_{0}^{1} \mathrm{Binomial}(y; n, \theta) \ d \theta} \newline & = \frac{\theta^y (1 - \theta)^{n -y}}{\int_{0}^{1} \theta^y (1 - \theta)^{n - y} \ d \theta} \end{align*}
Recall (The Beta Distribution)

θ\theta has a Beta distribution on [0,1][0, 1], with parameters α\alpha and B\Beta, if its density has the form,

π(θα,B)=1B(α,B)θα1(1θ)B1\pi(\theta \mid \alpha, \Beta) = \frac{1}{B(\alpha, \Beta)} \theta^{\alpha - 1} (1 - \theta)^{\Beta - 1}

where B(α,B)B(\alpha, \Beta) is the Beta function defined as,

B(α,B)Γ(α)Γ(B)Γ(α+B),B(\alpha, \Beta) \coloneqq \frac{\Gamma(\alpha) \Gamma(\Beta)}{\Gamma(\alpha + \Beta)},

where Γ(t)\Gamma(t) is the Gamma function defined as,

Γ(t)0xt1ex dx\Gamma(t) \coloneqq \int_{0}^{\infty} x^{t - 1} e^{-x} \ dx

Recall that for positive integers, Γ(n)=(n1)!=123(n1)\Gamma(n) = (n - 1)! = 1 \cdot 2 \cdot 3 \cdots (n - 1).

Thus, our posterior becomes B(y+1,ny+1)\Beta(y + 1, n - y + 1). As π(ynew=Hθ)=θθ π(θy) dθ\pi(y_{\text{new}} = \text{H} \mid \theta) = \int_{\theta} \theta \ \pi(\theta \mid y) \ d\theta, the prediction is the expectation of this Beta distribution,

Definition 2 (Conjugate Priors)

Given a likelihood model π(xθ)\pi(x \mid \theta). A conjugate family of priors to this likelihood is a parametric family of distributions for θ\theta so that if the prior is in this family, the posterior of the form θx\theta \mid x is also in the family.

Conjugate Prior Illustration
Conjugate Prior Illustration
Example 4 (Poisson-Gamma Conjugacy)

Assume π(xθ)=Poisson(x;θ)\pi(x \mid \theta) = \mathrm{Poisson}(x; \theta), i.e.,

π(xθ)=eθθxx!.\pi(x \mid \theta) = e^{-\theta} \frac{\theta^x}{x!}.

Then, π(θα,B)=Gamma(θ;α,B)\pi(\theta \mid \alpha, \Beta) = \mathrm{Gamma}(\theta; \alpha, \Beta) where α,B\alpha, \Beta are positive parameters, is a conjugate family. Recall that,

Γ(θ;α,B)=BαΓ(α)θα1exp(Bθ).\mathrm{\Gamma}(\theta; \alpha, \Beta) = \frac{\Beta^{\alpha}}{\Gamma(\alpha)} \theta^{\alpha - 1} \exp(-\Beta \theta).

Moreover, we have the posterior,

π(θx)=Gamma(θ;α+x,B+1).\pi(\theta \mid x) = \mathrm{Gamma}(\theta; \alpha + x, \Beta + 1).

We make repeated observations of a Poisson(θ)\mathrm{Poisson}(\theta) distributed variable for some θ>0\theta > 0. The observed value are {x1=20,x2=24,x3=23}\{x_1 = 20, x_2 = 24, x_3 = 23\}. What is the posterior distribution for θ\theta given this data?

Firstly, we need to choose a prior for θ\theta. We will use π(θ)θ1θ\pi(\theta) \propto_{\theta} \frac{1}{\theta} 1Note that this is an improper prior; it is a “density” that does not integrate to 1. However, using such improper priors is possible in Bayesian statistics. We get the following posterior after observing x1x_1,

θx1Gamma(20,1).\theta \mid x_1 \sim \mathrm{Gamma}(20, 1).

Using this as our new prior, we get the following posterior after observing x2x_2,

θx1,x2Gamma(20+24,1+1)=Gamma(44,2).\theta \mid x_1, x_2 \sim \mathrm{Gamma}(20+24, 1+1) = \mathrm{Gamma}(44, 2).

Lastly, using this as our new prior, we get the following posterior after observing x3x_3,

θx1,x2,x3Gamma(44+23,2+1)=Gamma(67,3).\theta \mid x_1, x_2, x_3 \sim \mathrm{Gamma}(44+23, 2+1) = \mathrm{Gamma}(67, 3).
The posteriors after one, two, and three observations.
The posteriors after one, two, and three observations.

Predictive Distribution

Example 5 (Poisson-Gamma Predictive Distribution)

We have seen that, if kθPoisson(θ)k \mid \theta \sim \mathrm{Poisson}(\theta) and θGamma(α,B)\theta \sim \mathrm{Gamma}(\alpha, \Beta), then the posterior is θkGamma(α+k,B+1)\theta \mid k \sim \mathrm{Gamma}(\alpha + k, \Beta + 1). Direct computation gives the prior predictive distribution as,

π(k)=π(kθ)π(θ)π(θk)=BαΓ(α+k)(B+1)α+kΓ(α)k!\begin{align*} \pi(k) & = \frac{\pi(k \mid \theta)\pi(\theta)}{\pi(\theta \mid k)} \newline & = \frac{\Beta^{\alpha}\Gamma(\alpha + k)}{(\Beta + 1)^{\alpha + k} \Gamma(\alpha) k!} \newline \end{align*}

Note that the positive integer kk has a negative binomial distribution with parameters rr and pp if its probability mass function is,

π(k;r,p)=(k+r1k)(1p)rpk=Γ(k+r)k!Γ(r)(1p)rpk.\pi(k; r, p) = \binom{k + r - 1}{k} (1 - p)^r p^k = \frac{\Gamma(k + r)}{k! \Gamma(r)} (1 - p)^r p^k.

We get that the prior predictive is negative binomial with parameters α\alpha and 1B+1\frac{1}{\Beta + 1}. Further, note that we can get the posterior predictive by simply replacing the α\alpha and B\Beta of the prior with their corresponding posterior values.

Distributions for predicting the next observation .
Distributions for predicting the next observation .

Bayesian Inference in Discrete Settings

Finally, a word on Bayesian inference in discrete settings.

If the sample space θ\theta is finite, Bayesian inference is quite easy.

  • The prior distribution π(θ)\pi(\theta) is represented by a vector.
  • The posterior distribution π(θy)\pi(\theta \mid y) is obtained by termwise multiplication of the vectors π(yθ)\pi(y \mid \theta) and π(θ)\pi(\theta), followed by normalization.
  • The prediction π(ynewy)=θπ(ynewθ)π(θy) dθ\pi(y_{\text{new}} \mid y) = \int_{\theta} \pi(y_{\text{new}} \mid \theta) \pi(\theta \mid y) \ d\theta simplifies to taking the sum of the termwise product of the vectors π(ynewθ)\pi(y_{\text{new}} \mid \theta) and π(θy)\pi(\theta \mid y).

Finally, the prediction we want to make can be expressed as a quotient of integrals,

π(ynewy)=θπ(ynewθ)π(θy) dθ=θπ(ynewθ)π(yθ)π(θ)θπ(yθ)π(θ) dθ dθ=θπ(ynewθ)π(yθ)π(θ) dθθπ(yθ)π(θ) dθ\begin{align*} \pi(y_{\text{new}} \mid y) & = \int_{\theta} \pi(y_{\text{new}} \mid \theta) \pi(\theta \mid y) \ d\theta \newline & = \int_{\theta} \pi(y_{\text{new}} \mid \theta) \frac{\pi(y \mid \theta) \pi(\theta)}{\int_{\theta} \pi(y \mid \theta) \pi(\theta) \ d\theta} \ d\theta \newline & = \frac{\int_{\theta} \pi(y_{\text{new}} \mid \theta) \pi(y \mid \theta) \pi(\theta) \ d\theta}{\int_{\theta} \pi(y \mid \theta) \pi(\theta) \ d\theta} \newline \end{align*}

We can compute these integrals using numerical integration, works well as long as the dimension of θ\theta is not too high and our functions are well-behaved.