Introduction
In this part, we will introduce Markov Random Fields (MRF), which are probabilistic graphical models that represent a set of variables and their conditional dependencies via an undirected graph. MRFs are powerful tools for modeling complex systems and reasoning under uncertainty. We will discuss some fundamental theory and their applications.
Markov Random Fields
Intuition: Markov Random Fields
In many real world phenomena, it is hard to determine the exact directionality of interaction between random variables, thus, Bayesian networks might not be the best choice to model such systems.
Instead, Markov Random Fields (MRFs) are described by undirected graphs and they also encode a factorization (a set of conditional independence relationships).
MRFs are often useful for modeling mutual relationships of compatibility among variables and in imaging and spatial applications.
Recall: Conditional Independence
MRFs (also) allow to graphically assess conditional independence properties by simple separation
Definition: Separation
Let $\mathcal{A}, \mathcal{B}$, and $\mathcal{C}$ be three disjoint sets. Then, $\mathcal{A} \perp \mathcal{B} \mid \mathcal{C}$ if every path from any node in set $\mathcal{A}$ to any node in set $\mathcal{B}$ passes through at least one node in $\mathcal{C}$ (i.e., all paths are blocked by $\mathcal{C}$) (See Figure 1 for an example).
Factorization Properties
Definition: MRF Factorization
The goal in MRFs is to express $p(\mathbf{x})$ as a product of functions defined over local sets of variables.
Definition: Clique
A clique is a fully connected subset of nodes in an undirected graph.
[!DEFINITION/Maximal Clique] A maximal clique is a clique for which it is not possible to add any additional node without it ceasing to be a clique.
Thus, an MRF is an undirected graph whose vertices represent random variables associated to a joint probability distribution that factorizes as product of potential functions corresponding to the maximal cliques of the graph, $$ p(\mathbf{x}) = \frac{1}{Z} \prod_{c} \psi_c(\mathbf{\mathsf{x}}_c) $$ where $c$ is the index of a maximal clique, $\mathbf{\mathsf{x}}_c$ is the set of random variables in clique $c$, $\psi_c(\cdot)$ is the potential function associated to clique $c$, and $Z$ is a normalization constant called the partition function, $$ Z = \sum_{\mathbf{\mathsf{x}}} \prod_c \psi_c(\mathbf{\mathsf{x}}_c) $$
Example: MRF Factorization Example
Consider the MRF in Figure 2, the corresponding joint distribution is given by, $$ \begin{align*} p(\mathbf{x}) & = \frac{1}{Z} \prod_c \psi_c(\mathbf{\mathsf{x}}_c) \newline & = \frac{1}{Z} \psi_1(x_1, x_2, x_3) \psi_2(x_3, x_4, x_5) \psi_3(x_5, x_6) \end{align*} $$
Definition: MRF Factorization (Continued)
A few notes on the factorization, considering $\psi_c(\mathbf{\mathsf{x}}_c) \geq 0$, ensures $p(\mathbf{x}) \geq 0$. In general, $\psi_c(\mathbf{\mathsf{x}}_c)$ are not probability distributions, thus, they do not necessarily integrate (or sum) to one.
Note: Potential Functions
The motivation for $\psi_c(\cdot)$ is they represent which configurations of local variables are preferred to others.
$\psi_c(\mathbf{\mathsf{x}}_c)$ encodes the compatibility of the values $\mathbf{\mathsf{x}}_c$ in each clique (larger $psi_c(\mathbf{\mathsf{x}}_c) \implies$ configurations $\mathbf{\mathsf{x}}_c$ more likely to occur). Lastly, all variables in clique $\mathbf{\mathsf{x}}_c$ can play same role in defining the value of $\psi_c(\mathbf{\mathsf{x}}_c)$.
Connection between Factorization and Conditional Independence
Intuition: Factorization and Conditional Independence
So we have seen that MRFs allow us to easily assess the conditional independence properties (graph separation) and a factorization of the joint distribution.
But how do we connect conditional independence and factorization?
For the graph to represent conditional independence, $\psi_c(\mathbf{\mathsf{x}}_c) > 0$
Let $\mathcal{U}_I$ be the set of distributions $p(\mathbf{x})$ consistent with conditional independencies encoded by the graph structure. Let $\mathcal{U}_F$ be the set of distributions $p(\mathbf{x})$ that factorize as, $$ p(\mathbf{x}) = \frac{1}{Z} \prod_c \psi_c(\mathbf{\mathsf{x}}_c) $$
Theorem: Hammersley-Clifford Theorem
If $p(\mathbf{x}) > 0$, then $\mathcal{U}_I = \mathcal{U}_F$. This means that for positive distributions, the set of distributions that factorize over the cliques of the graph is exactly the set of distributions that satisfy the conditional independencies encoded by the graph structure.
But how do we ensure that $\psi_c(\mathbf{\mathsf{x}}_c) > 0$?
The Gibbs Distribution and Energy-Based Models
Definition: Gibbs Distribution
If we parameterize $\psi_c(\mathbf{\mathsf{x}}_c)$ using an energy-based form (Boltzmann (or in this case Gibbs) distribution), $$ \psi_c(\mathbf{\mathsf{x}}_c) = \exp(-E_c(\mathbf{\mathsf{x}}_c)) $$ Thus, $$ \begin{align*} p(\mathbf{x}) & = \frac{1}{Z} \prod_c \psi_c(\mathbf{\mathsf{x}}_c) \newline & = \frac{1}{Z} \prod_c \exp(-E_c(\mathbf{\mathsf{x}}_c)) \newline & = \frac{1}{Z} \exp\left(-\sum_c E_c(\mathbf{\mathsf{x}}_c)\right) \newline & = \frac{1}{Z} \exp(-E(\mathbf{x})) \newline \end{align*} $$ The total energy of state $\mathbf{x}$ is obtained by adding energies of each maximal clique.
Note
For explicitness, low energy is associated with high probability states 1.
Applications of MRFs
Example: Image Denoising with MRFs
Given a noisy image ($\mathbf{y}$), recover the original noise-free image ($\mathbf{x}$).
We can encode images using an array representing numerical values of pixles. Pixels take on values $+1, -1, x_i \in \{+1, -1\}$ (black and white). $\{y_i\}$ are the distorted version of $\{x_i\}$.
For graphical models we need to make some assumptions.
-
Neighboring (noiseless) pixels are strongly correlated (i.e., similar in value).
-
Noise acts independently on each pixel.
-
If the noise level is low, there should be a strong correlation between $x_i$ and $y_i$.
Further, there are two types of cliques $$\{x_i, x_j\}$$ and $$\{x_i, y_i\}$$, i.e., neighboring pixels and corresponding noisy pixels. Thus, factorization can be written as, $$ p(\mathbf{x}, \mathbf{y}) = \frac{1}{Z} \prod_{i \sim j} \psi_{i, j}(x_i, x_j) \prod_i \psi_i(x_i, y_i) $$ We will assume that $\mathsf{x}_i, \mathsf{y}_i \in \{+1, -1\}$ (Ising model) 2
The cliques $\{x_i, y_i\}$ and its corresponding energy $E(x_i, y_i)$ expresses the correlation between $\mathsf{x}_i$ and $\mathsf{y}_i$, $$ E(x_i, y_i) = -\eta x_i, y_i, \quad \eta > 0 \quad \implies \psi_i(x_i, y_i) = \exp(\eta x_i y_i) $$ The cliques $\{x_i, x_j\}$, $$ E(x_i, x_j) = -\beta x_i x_j, \quad \beta > 0 \quad \implies \psi_{i, j}(x_i, x_j) = \exp(\beta x_i x_j) $$ Further, an energy term $hx_i$ is often added for each pixel $i$ of $\mathbf{x}$ to biasing the model towawrd pixel values of a particular color (black or white), $$ p(\mathbf{x}, \mathbf{y}) = \frac{1}{Z} \prod_i \psi_i(x_i) \prod_{i \sim j} \psi_{i, j}(x_i, x_j) \prod_i \psi_i(x_i, y_i) $$ where $\psi_i(x_i) = \exp(-E(x_i)) = \exp(-h x_i)$. Thus, equivalently, $$ p(\mathbf{x}, \mathbf{y}) = \frac{1}{Z} \exp\left(-h \sum_i x_i + \beta \sum_{i, j} x_i x_j + \eta \sum_i x_i y_i\right) $$ Thus, connecting back to our original goal, this corresponds to maximizing, $$ p(\mathbf{x} \mid \mathbf{y}) \propto \exp\left(-h \sum_i x_i + \beta \sum_{i, j} x_i x_j + \eta \sum_i x_i y_i\right) = \exp(-E(\mathbf{x} \mid \mathbf{y})) $$
Note
- If $h = 0$, the prior probabilities of the two states of $x_i$ are equal, i.e., no bias toward black or white.
- If $\beta = 0$, there is no correlation between neighboring pixels → most probable solution is $x_i = y_i \forall i$ (i.e., noisy image).
- The objective can be solved iteratively (iterated conditional modes)
Algorithm: Iterated Conditional Modes
Initialization: $x_i = y_i \forall i$.
- For $j = 1, \ldots, N$ (for each pixel)
- Evaluate total energy for states $x_j = +1$ and $x_j = -1$ keeping all other variables fixed.
- Set $x_j$ to the state with lower energy.
- Repeat until convergence or stopping criterion met.
Summary of MRF
Summary: Markov Random Fields
- MRFs are useful for modeling mutual relationships of compatibility among variables and in imaging and spatial applications.
However, some pitfalls around, $$ p(\mathbf{x}) = \frac{1}{Z} \prod_c \psi_c(\mathbf{\mathsf{x}}_c) $$
- It is difficult to evaluate the joint distribution and sample from it.
- (Often) computing $Z$ is intractable for large systems.
- Ancestral sampling is not possible with MRFs (Why? Because there is no directionality in the graph structure).
From Bayesian Networks to Markov Random Fields
Intuition: Bayesian Networks vs MRFs
Bayesian networks and MRFs encode different types of statistical dependencies. A Bayesian network captures statistical dependencies while a MRF captures the mutal compatibility among variables.
Thus, independencies captured by a Bayesian network cannot always be captured by a MRF.
The Bayesian network with factorization, $$ p(\mathbf{x}) = \prod_{k=1}^{K} p(\mathsf{x}_k \mid \mathsf{x}_{\mathcal{P}(\mathsf{x}_k)}) $$ Defining, $$ \psi_k(\mathsf{x}_k, \mathsf{x}_{\mathcal{P}(\mathsf{x}_k)}) = p(\mathsf{x}_k \mid \mathsf{x}_{\mathcal{P}(\mathsf{x}_k)}) $$ we can obtain, $$ p(\mathbf{x}) = \frac{1}{Z} \prod_{k=1}^{K} \psi_k(\mathsf{x}_k, \mathsf{x}_{\mathcal{P}(\mathsf{x}_k)}) $$ Thus, to convert a Bayesian network to a MRF, we connect all pairs of parents by an undirected edge and make all edges undirected.
See Figure 3, the resulting MRF may not account for all independencies encoded by the Bayesian network.