Introduction
In this part, we will introduce Bayesian networks, which are probabilistic graphical models that represent a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are powerful tools for modeling complex systems and reasoning under uncertainty. We will discuss some fundamental theory and their applications.
Graphical Models
Intuition: Graphical Models
Graphical models are a powerful framework to represent and learn structured probability models.
The main idea behind graphical models is that they capture a way in which a joint distribution can be factorized into product of factors, each depending only on a subset of variables. They are generally useful for, visualizing the structure of a probabilistic model, encoding structural information (i.e., dependencies) about involved random variables, and provide structure in computations, i.e., provide graph-based algorithms for computations, inference, and prediction.
There are three main types of graphical models,
-
Bayesian Networks (or Bayes nets) which are DAGs 1. They represent a set of random variables and their conditional dependence structure.
-
Markov Random Fields which are undirected graphs. They represent a set of random variables and their Markov structure (i.e., conditional independence structure).
-
Factor Graphs. Which are more convenient for the purposes of inference and learning.
Bayesian Networks
Definition: Bayesian Networks
In a Bayesian network, we have two types of elements.
-
Nodes represents the random variables and more specifically, an empty node is an unobserved variable, while a shaded node is an observed variable.
-
Edges represents the relationships between the random variables.
Further, the bayesian netowrk encodes a factorization of a join distribution.
Assume we have $K$ random variables $\{\mathsf{x}_1, \mathsf{x}_2, \ldots, \mathsf{x}_K\}$ with join probability distribution $p(\mathsf{x}_1, \mathsf{x}_2, \ldots, \mathsf{x}_K)$, we know by the chain rule of probability that, $$ \begin{align*} p(\mathsf{x}_1, \mathsf{x}_2, \ldots, \mathsf{x}_K) & = p(\mathsf{x}_1) p(\mathsf{x}_2 \mid \mathsf{x}_1) p(\mathsf{x}_3 \mid \mathsf{x}_1, \mathsf{x}_2) \cdots p(\mathsf{x}_K \mid \mathsf{x}_1, \mathsf{x}_2, \ldots, \mathsf{x}_{K-1}) \newline & = \prod_{i=1}^{K} p(\mathsf{x}_i \mid p(\mathsf{x}_i)) \newline \end{align*} $$ To build a Bayesian network we need,
-
Introduce a node for each $\mathsf{x}$ and associate the node with $p(\mathsf{x} \mid \cdot)$.
-
For each $p(\mathsf{x} \mid \cdot)$, draw a directed edge to node $\mathsf{x}$ from nodes corresponding to RVs on which the distribution is conditioned.
-
Edge from node $\mathsf{x}$ to node $\tilde{\mathsf{x}}$, $\mathsf{x}$ parent of $\tilde{\mathsf{x}}$, $\tilde{\mathsf{x}}$ child of $\mathsf{x}$.
Example: Bayesian Network Example
Consider the network in Figure 1, the corresponding joint distribution is given by, $$ p(x_1, x_2, x_3, x_4, x_5) = p(x_1) p(x_2) p(x_3 \mid x_1) p(x_4 \mid x_1, x_3) p(x_5 \mid x_2, x_3, x_4) $$
Definition: Bayesian Networks (Continued)
A Bayesian network is a directed acyclic graph (DAG) whose nodes represents random variables $\{\mathsf{x}_1, \mathsf{x}_2, \ldots, \mathsf{x}_K\}$ with an associated joint distribution that factorizes as, $$ p(\mathsf{x}_1, \mathsf{x}_2, \ldots, \mathsf{x}_K) = \prod_{k = 1}^{K} p(\mathsf{x}_k \mid \mathsf{x}_{\mathcal{P}(\mathsf{x}_k)}), $$ where $\mathcal{P}(\mathsf{x}_k)$ denotes the set of parents of node $\mathsf{x}_k$ in the graph.
Note
Note that $\mathsf{x}_{\mathcal{P}(\mathsf{x}_k)}$ accounts for statistical dependence of $\mathsf{x}_k$ with all the preceding variables $\{\mathsf{x}_1, \ldots, \mathsf{x}_{k-1}\}$ according to selected order. The corresponding Bayesian network encodes, $$ \mathsf{x}_k \perp \{\mathsf{x}_1, \ldots, \mathsf{x}_{k-1}\} \setminus \mathcal{P}(\mathsf{x}_k) \mid \mathsf{x}_{\mathcal{P}(\mathsf{x}_k)} $$
Example: Bayesian Polynomial Regression
Consider the joint distribution condtioned on input data and model parameters, $$ p(y_{\mathcal{D}}, \mathbf{w} \mid x_{\mathcal{D}}, \alpha, \sigma^2) = p(\mathbf{w} \mid \alpha) \prod_{i=1}^{N} p(y_i \mid \mathbf{x}_i, \mathbf{w}, \sigma^2) $$
Causal Relationships
Intuition: Causal Relationships
Bayesian networks are especially useful for modeling causal relationships between variables.
If we can identify causal relationships among random variables, this must mean there is a natural order on variables, i.e., random variables that appear later caused by a subset of preceding variables.
Causing random variables for random variable $\mathsf{x}_k$ are included in $\mathcal{P}(\mathsf{x}_k)$. Which means conditioning on $\mathsf{x}_{\mathcal{P}(\mathsf{x}_k)}$, $\mathsf{x}_k$ is independent of all other preceding variables.
Intuition: Ancestral Sampling
However, we usually have a problem when dealing with causal relationships, i.e., obtaining marginals is not a trivial task.
But we can approximate exact distribution by an empirical one built from samples and perfect task for Bayesian networks.
Algorithm: Ancestral Sampling
Assume, $$ p(\mathsf{x}_1, \mathsf{x}_2, \ldots, \mathsf{x}_K) = \prod_{k = 1}^{K} p(\mathsf{x}_k \mid \mathsf{x}_{\mathcal{P}(\mathsf{x}_k)}), $$ i.e., ordered variables $\{\mathsf{x}_1, \mathsf{x}_2, \ldots, \mathsf{x}_K\}$, with no arrow from any node to any lower numbered node.
-
Draw sample for $\mathsf{x}_1$ from $p(\mathsf{x}_1)$.
-
Draw sample for $\mathsf{x}_2$ from $p(\mathsf{x}_2 \mid \mathsf{x}_1)$.
-
$\ldots$
-
Draw sample for $\mathsf{x}_K$ from $p(\mathsf{x}_K \mid \mathsf{x}_{\mathcal{P}(\mathsf{x}_K)})$.
Thus, we have obtained a sample from the joint distribution.
To sample from a marginal distribution we,
-
Take sampled values for required nodes and ignore those remaining ones.
-
$p(\mathsf{x}_2, \mathsf{x}_4)$: sample from $p(\mathsf{x}_1, \ldots, \mathsf{x}_K)$, retain $\hat{\mathsf{x}}_2$ and $\hat{\mathsf{x}}_4$ and discard $\{\hat{\mathsf{x}}_{j \neq 2,4}\}$.
Definition: Conditional Independence
We know from classical probability theory that two random variables $a$ and $b$ are conditionally independent given a third random variable $c$ if, $$ p(a, b \mid c) = p(a \mid c) p(b \mid c) $$ When dealing with graphical models, we denote conditional independence as, $$ a \perp b \mid c $$ This can also be written as, $$ p(a \mid b, c) = \frac{p(a, b, c)}{p(b, c)} = \frac{p(a, b \mid c)}{p(b \mid c)} = \frac{p(a \mid c) p(b \mid c)}{p(b \mid c)} = p(a \mid c) $$
Intuition: Types of Connections
Observe Figure 3, in the case of a tail-to-tail connection, $$ p(a, b \mid c) = \frac{p(a, b, c)}{p(c)} = \frac{p(a \mid c) p(b \mid c) p(c)}{p(c)} = p(a \mid c) p(b \mid c) \implies a \perp b \mid c $$ i.e., $a$ and $b$ are independent if node in between is observed.
In the case of a head-to-tail connection, $$ p(a, b \mid c) = \frac{p(a, b, c)}{p(c)} = \frac{p(a) p(c \mid a) p(b \mid c)}{p(c)} = p(a \mid c) p(b \mid c) \implies a \perp b \mid c $$ i.e., $a$ and $b$ are independent if node in between is observed.
In the case of a head-to-head connection, $$ p(a, b) = \int p(a, b, c) \ dc = \int p(c \mid a, b) p(a) p(b) \ dc = p(a) p(b) \implies a \perp b \mid \emptyset $$ i.e., $a$ and $b$ are independent if node in between and any of its descendants are not observed.
$d$-Separation
Intuition: d-Separation
This leads to asking a more general question. Are $a$ and $b$ conditionally independent given $c$? (see Figure 4)>
Or, in more general, isa given subset of variables $\mathcal{A}$ independent of another set $\mathcal{B}$ conditioned on a third subset $\mathcal{C}$? ($\mathsf{x}_{\mathcal{A}} \perp \mathsf{x}_{\mathcal{B}} \mid \mathsf{x}_{\mathcal{C}}$) Thus, the goal is to determine independencies directly from the directed acyclic graph.
We can define the concept of $d$-separation as. If all paths from any node in $\mathcal{A}$ to any node in $\mathcal{B}$ given the nodes in $\mathcal{C}$ are blocked, then $\mathcal{A} \perp \mathcal{B} \mid \mathcal{C}$.
Or more formally as.
Let $G$ be a directed graph and $\mathcal{A}$, $\mathcal{B}$, and $\mathcal{C}$ disjoint sets of nodes. Then, if all paths from any node in $\mathcal{A}$ to any node in $\mathcal{B}$ given the nodes in $\mathcal{C}$ are blocked, $\mathcal{A}$ and $\mathcal{B}$ are said to be $d$-separated by $\mathcal{C}$ and $\mathcal{A} \perp \mathcal{B} \mid \mathcal{C}$.
A path between $\mathcal{A}$ and $\mathcal{B}$ is blocked if the path includes either,
-
A head-to-tail or tail-to-tail node which is in $\mathcal{C}$, or,
-
A head-to-head node, and neither the node nor any of its descendants in $\mathcal{C}$.
So, to answer our question to Figure 4.
Path from $a$ to $b$ are not blocked by $f$, since a tail-to-tail node and $f$ is unobserved.
Path not blocked by $e$, as a head-to-head node with an observed descendant.
Thus, $a \perp b \mid f$ does not hold from the DAG.
Are $a$ and $b$ conditionally independent given $f$?
Path from $a$ to $b$ are blocked by $f$, since a tail-to-tail node and $f$ is observed. Path blocked by $e$, as a head-to-head node and neither the node nor any of its descendants are observed.
Hence, $a \perp b \mid f$ holds from the DAG.
Note: Markov Blanket
In a directed graphical model, a node is conditionally independent of all other nodes given its parents, children and co-parnets.
The Markov blanket of node $\mathsf{x}_i$ is the minimal set of nodes that isolates $\mathsf{x}_i$ from the rest of the graph.
Note: Structured Learning
What if the Bayesian network is not known? (i.e., we do not know the structure of the graph)
Bayesian networks can be learned from data without a pre-specified structure.
-
Different algorithms can be employed to learn network structure by analyzing the data and inferring most likely graph structure that best fits observed dependencies.
-
Once structure and parameters are learned, the Bayesian network can be used for prediction.