Bayesian Bandit Problem
So far, we have discussed the stochastic bandits model, with a set of $K$ arms $\mathcal{A}$.
In the Bayesian bandit problem, we add the Bayesian assumption.
Definition: Bayesian Bandit Problem
The Bayesian bandit problem is a stochastic bandit problem where the problem instance $\mathcal{J}$ is drawn initially from sone known distribution $\mathbb{P}$.
The problem instance $\mathcal{J}$ (often) specifies the mean reward vector $\mathbf{\mu} \in [0, 1]^K$.
The distribution $\mathbb{P}$ is called the prior distribution, or the Bayesian prior.
As we saw for normal stochastic bandits, the objective is to minimize the expected regret, thus, in the Bayesian setting, we want to minimize the Bayesian expected regret.
Definition: Bayesian Expected Regret
The Bayesian Regret is the expected regret for a particular problem instance $\mathcal{J}$, in expectation over the problem instances, $$ \mathrm{BR}(T) \coloneqq \mathbb{E}_{\mathcal{J} \sim \mathbb{P}}[\mathbb{E}[R(T) \mid \mathcal{J}]] = \mathbb{E}_{\mathcal{J} \sim \mathbb{P}}\left[ \mathbb{E}\left[ \mu^{\star} \cdot T - \sum_{t \in [T]} \mu(a_t) \right] \right] $$
Further, we will assume that our realized rewards are drawn from a single-parameter family of distributions (e.g., Bernoulli, unit-variance Gaussian, etc.). Thus, the problem instance is completely specified by the mean reward vector $\mathbf{\mu} \in [0, 1]^K$.
The prior $\mathbb{P}$ is simply a distribution over $[0, 1]^K$ that the $\mu(\cdot)$ is drawn from.
Furthermore, we will assume that our realized rewards can only take finitely many different values, and the prior $\mathbb{P}$ has a finite support, $\mathcal{F}$.
Lastly, for simplicity, we assume that the best arm $a^{\star}$ is unique for each mean reward vector in the support of $\mathbb{P}$.
Terminology and Notation
We will introduce some notation that we will use later for our analysis. The concepts themselves are not new, but we introduce the compact notation to make our analysis easier.
Definition: $t$-history ($H_t$)
The sequence of action-reward pairs collected by the agent after round $t$, is a random variable defined as, $$ H_t \coloneqq \left( (a_1, r_1), \ldots, (a_t, r_t) \right) \in (\mathcal{A} \times \mathbb{R})^t $$
Definition: Feasible $t$-history (H)
A fixed sequence $H = \left( (a_1, r_1), \ldots, (a_t, r_t) \right) \in (\mathcal{A} \times \mathbb{R})^t$ is called a feasible $t$-history if it has a positive probability of being observed, i.e., $P(H_t = H) > 0$.
Further, we call such an algorithm $H$-consistent.
Definition: $\mathcal{H}_t$
The (finite) set of all feasible $t$-histories is denoted by $\mathcal{H}_t$.
Posterior Distribution
In the Bayesian setting, using Bayes’ rule 1 1, it is powerful to compute the prior and posterior distributions of the problem instance $\mathcal{J}$.
Definition: Posterior Distribution ($\mathbb{P}_H(\tilde{\mu})$)
The Bayesian posterior distribution after round $t$ can be defined as, $$ \mathbb{P}_H(\tilde{\mu}) \coloneqq P(\mu = \tilde{\mu} \mid H_t = H), \ \forall \tilde{\mu} \in [0, 1]^K $$
Further, we can rewrite the posterior distribution using Bayes’ rule as, $$ \begin{align} \mathbb{P}_H(\tilde{\mu}) &= \frac{P(\mu = \tilde{\mu}, H_t = H)}{P(H_t = H)} \newline &= \frac{P(\mu = \tilde{\mu}) P(H_t = H \mid \mu = \tilde{\mu})}{\sum_{\mu^{\prime} \in \mathcal{F}} P(\mu = \mu^{\prime}) P(H_t = H \mid \mu = \mu^{\prime})} \newline \end{align} $$
Lemma: Posterior Distribution Lemma
$\mathbb{P}_H(\tilde{\mu})$ is the same for all $H$-consistent algorithms.
We will prove this lemma, but to prove it, we need the following definitions and concepts.
Posterior As A New Prior
The posterior $\mathbb{P}_H$ can be used as a prior for a subsequent Bayesian update.
Definition: Posterior as a New Prior
Consider a feasible $(t + t^{\prime})$-history. We represent it as a concatenation of feasible $t$-history $H$ and feasible $t^{\prime}$-history $H^{\prime}$, where $H$ comes first.
We can denote the concatenation as $H \oplus H^{\prime}$.
From this, one can perform the Bayesian update and obtain the posterior $(\mathbb{P}_H)_{H^{\prime}}$ in two steps,
-
Condition on $H$ and derive the posterior $\mathbb{P}_H$, then,
-
Condition on $H^{\prime}$ using $\mathbb{P}_H$ as the new prior to obtain $(\mathbb{P}_H)_{H^{\prime}}$.
This “two-step” Bayesian update is equivalent to the “one step” update given $H \oplus H^{\prime}$, i.e., $$ \mathbb{P}_{H \oplus H^{\prime}} = (\mathbb{P}_H)_{H^{\prime}} $$
Sequential Bayesian Update
This sequential update property will prove to be very useful for Bayesian bandits.
Theorem: Sequential Bayesian Update Theorem
Consider round $t \leq 1$ and a feasible $t$-history $H$. We can write $H$ as a concatenation of some feasible feasible $(t -1)$-history $H^{\prime}$ and an action-reward pair $(a_t, r_t)$, i.e., $H = H^{\prime} \oplus (a_t, r_t)$.
With an $H$-consistent bandit algorithm, $\pi(a)$, is the probability that the agent assigns each arm $a$ in round $t$ given our history $H^{\prime}$, $$ \pi(a) = P(a_t = a \mid H_{t-1} = H^{\prime}) $$ At each round $t$, we can treat the previous posterior as a new prior.
We will now prove this for the formulation as stated in Equation (1).
Proof: Sequential Bayesian Update Theorem Proof
First, we will look at the numerator of Equation (1) (divided by $P(H_{t - 1} = H^{\prime})$), $$ \begin{align*} \frac{P(\mu = \tilde{\mu}, H_t = H)}{P(H_{t - 1} = H^{\prime})} & = \frac{P(\mu = \tilde{\mu}, H_t = \underbrace{H}_{H^{\prime} \oplus (a_t, r_t)})}{P(H_{t - 1} = H^{\prime})} \newline &= \frac{P(\mu = \tilde{\mu}, H_{t - 1} = H^{\prime}, (a_t, r_t))}{P(H_{t - 1} = H^{\prime})} \newline &= \frac{P(\mu = \tilde{\mu}, (a_t, r_t) = (a_t, r_t) \mid H_{t - 1} = H^{\prime}) P(H_{t - 1} = H^{\prime})}{P(H_{t - 1} = H^{\prime})} \newline &= P(\mu = \tilde{\mu}, (a_t, r_t) = (a_t, r_t) \mid H_{t - 1} = H^{\prime}) \newline &= \underbrace{P(\mu = \tilde{\mu} \mid H_{t - 1} = H^{\prime})}_{\mathbb{P}_{H^{\prime}}(\tilde{\mu})} \cdot P((a_t, r_t) = (a_t, r_t) \mid \mu = \tilde{\mu}, H_{t - 1} = H^{\prime}) \newline &= \mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \underbrace{P(a_t = a_t \mid H_{t - 1} = H^{\prime})}_{\pi(a)} \cdot \underbrace{P(r_t = r_t \mid \mu = \tilde{\mu}, a_t = a_t, H_{t - 1} = H^{\prime})}_{\mathcal{D}_{\tilde{\mu}(a)}(r)} \newline &= \mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \pi(a) \cdot \mathcal{D}_{\tilde{\mu}(a)}(r) \end{align*} $$
Thus, $$ \boxed{P(\mu = \tilde{\mu}, H_t = H) = P(H_{t - 1} = H^{\prime}) \cdot \mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \pi(a) \cdot \mathcal{D}_{\tilde{\mu}(a)}(r)} $$
Now, consider the denominator of Equation (1), $$ \begin{align*} P(H_t = H) &= \sum_{\mu^{\prime} \in \mathcal{F}} P(\mu = \mu^{\prime}, H_t = H) \newline &= \sum_{\mu^{\prime} \in \mathcal{F}} P(H_{t - 1} = H^{\prime}) \cdot \mathbb{P}_{H^{\prime}}(\mu^{\prime}) \cdot \pi(a) \cdot \mathcal{D}_{\mu^{\prime}(a)}(r) \newline &= P(H_{t - 1} = H^{\prime}) \cdot \pi(a) \cdot \sum_{\mu^{\prime} \in \mathcal{F}} \mathbb{P}_{H^{\prime}}(\mu^{\prime}) \cdot \mathcal{D}_{\mu^{\prime}(a)}(r) \end{align*} $$
Thus, $$ \begin{align*} \mathbb{P}_H(\tilde{\mu}) &= \frac{P(\mu = \tilde{\mu}, H_t = H)}{P(H_t = H)} \newline &= \frac{\cancel{P(H_{t - 1} = H^{\prime})} \cdot \mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \cancel{\pi(a)} \cdot \mathcal{D}_{\tilde{\mu}(a)}(r)}{\cancel{P(H_{t - 1} = H^{\prime})} \cdot \cancel{\pi(a)} \cdot \sum_{\mu^{\prime} \in \mathcal{F}} \mathbb{P}_{H^{\prime}}(\mu^{\prime}) \cdot \mathcal{D}_{\mu^{\prime}(a)}(r)} \tag{$\star$} \newline &= \frac{\mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \mathcal{D}_{\tilde{\mu}(a)}(r)}{\sum_{\mu^{\prime} \in \mathcal{F}} \mathbb{P}_{H^{\prime}}(\mu^{\prime}) \cdot \mathcal{D}_{\mu^{\prime}(a)}(r)} \newline \end{align*} $$ Hence, $\mathbb{P}_{H^{\prime}}$ is the new prior for the Bayesian update at round $t$.
Note
In step $(\star)$, we implicitly proved our previous Lemma, that $\mathbb{P}_H(\tilde{\mu})$ is the same for all $H$-consistent algorithms, since $\pi(a)$ cancels out.
Independent Priors
The prior $\mathbb{P}$ is called independent if $(\mu(a) : a \in \mathcal{A})$ are mutually independent random variables.
This essentially means that our update is even simpler, since we can update each arm’s mean reward distribution separately.
Computational Aspects
With brute-force computation of the posterior $\mathbb{P}_H$ at each round $t$, the runtime is $\mathcal{O}(t \cdot |\mathcal{F}|)$, which is impractical for large $t$.
With sequential (Bayesian) updates, the per-round running time is $\mathcal{O}(|\mathcal{F}|)$, which is much more manageable.
With independent priors, one needs to update only the played arm at each round.
However, with conjugate priors 2, the posterior can be computed directly without summation over all elements in the support $\mathcal{F}$.
Thompson Sampling
This is the idea behind the famous Thompson Sampling algorithm.
The idea is that, for each round $t$ and for each arm $a$, our agent computes the posterier probability that $a$ is the best arm, and samples $a$ with this probability.
Summary: Thompson Sampling Algorithm
For each round $t \in [T]$, do,
- Observe $H_{t - 1} = H$, for some feasible $(t - 1)$-history $H$.
- Sample mean reward vector $\mu_t$ from the posterior distribution $\mathbb{P}_H$.
- Choose the best arm $\tilde{a}_t$ according to $\mu_t$, i.e., $\tilde{a}_t \in \arg\max_{a \in \mathcal{A}} \mu_t(a)$.
end
Case Study
A robot is in need of recharging, there are three different power sockets. When tried, each socket randomly returns some charge or gives no charge at all.
We can model the reward from each socket as a Bernoulli random variable, either 1 (charge) or 0 (no charge).
The ideal distribution to model this problem is the Beta distribution, which is a conjugate prior for the Bernoulli distribution.
The Beta distribution has two parameters, $\alpha$ and $\beta$. In the simplest terms, these parameters respectively corresponds to the count of successes and failures (or, more precisely, $\alpha - 1$ and $\beta - 1$).
Initially, we have no idea what the probability is of any given socket producing an output, so we can start by setting both $\alpha$ and $\beta$ to 1, which corresponds to a uniform distribution over $[0, 1]$.
This initial guess at the probability of the socket producing an output corresponds to choosing the prior distribution.
Once the robot tests a socket and obtains a reward, it can modify its belief in the success rate of that socket, i.e., updates the posterior distribution.
The Beta distribution is a conjugate prior for the Bernoulli distribution, so the posterior distribution is also a Beta distribution, with updated parameters.
If the tested socket returns a reward of 1 (success), we increment $\alpha$ by 1 and leave $\beta$ unchanged. If the tested socket returns a reward of 0 (failure), we increment $\beta$ by 1 and leave $\alpha$ unchanged.
Since actions that have been tried infrequently have wide distributions, they have a larger range of possible values. In this way, a socket that currently has a low estimated mean reward, but has been tested fewer times than a socket with a higher estimated mean, can return a larger sample value and therefore become the selected socket at this time step.
Bayesian Regret Analysis
Recall the definitions we use for analysis in a non-Bayesian setting,
Definition: Confidence Radius and Confidence Bounds
$$ \begin{align*} r_t(a) & \coloneqq \sqrt{\frac{2 \log(T)}{n_t(a)}}, & \text{(Confidence Radius)} \newline \mathrm{UCB}_t(a) & \coloneqq \bar{\mu}_t(a) + r_t(a), & \text{(Upper Confidence Bound)} \newline \mathrm{LCB}_t(a) & \coloneqq \bar{\mu}_t(a) - r_t(a), & \text{(Lower Confidence Bound)} \newline \end{align*} $$ where $n_t(a)$ is the number of times arm $a$ has been played so far, and $\bar{\mu}_t(a)$ is the average reward from this arm.
Further, we had (under the clean event $\mathcal{E}$), $$ \mu(a) \in [\mathrm{LCB}_t(a), \mathrm{UCB}_t(a)], \ \text{ with high probability.} $$
In our bayesian setting, we need to considerer the generalized confidence bounds $U(a, H_t)$ and $L(a, H_t)$, which depend on the history $H_t$. Further, we want these properties to hold (for some constant $\gamma > 0$) 2,
Definition: Generalized Confidence Bounds
$$ \begin{align*} \mathbb{E}[\left( U(a, H_t) - \mu(a) \right)^-] & \leq \frac{\gamma}{TK}, \quad \forall a,t \newline \mathbb{E}[\left( \mu(a) - L(a, H_t) \right)^-] & \leq \frac{\gamma}{TK}, \quad \forall a,t \newline \end{align*} $$ Further, the confidence radius is (still) defined as, $$ r_t(a) \coloneqq \frac{U(a, H_t) - L(a, H_t)}{2} $$
Thomson Sampling Bayesian Regret Bound
We can now state the Bayesian regret bound for Thompson Sampling.
Theorem: Thompson Sampling Bayesian Regret Bound
The Bayesian Regret of Thompson Sampling is, $$ \mathrm{BR}(T) \leq \mathcal{O}\left( \sqrt{K T \log(T)} \right) $$
We will also state a lemma about the regret bound,
Lemma: Thompson Sampling Bayesian Regret Bound Lemma
Assume we have lower and upper bound functions that satisfy the generalized confidence bounds properties. For some paramter $\gamma > 0$, then, the bayesian regret of Thompson Sampling can be bounded as, $$ \mathrm{BR}(T) \leq 2\gamma + 2 \sum_{t \in [T]} \mathbb{E}[r_t(a_t, H_t)] $$
We will now prove this lemma.
Proof: Thompson Sampling Bayesian Regret Bound Lemma Proof
Firstly, we fix a round $t$, further, the chosen arm $a_t$ and the best arm $a^{\star}$ are identically distributed given our $t$-history $H_t$. Thus, for each feasible $t$-history $H$, we have, $$ P(a_t = a \mid H_t = H) = P(a^{\star} = a \mid H_t = H), \ \forall a \in \mathcal{A} $$ Hence, $$ \begin{align*} \mathbb{E}[U(a_t, H_t) \mid H_t = H] & = \mathbb{E}[U(a^{\star}, H_t) \mid H_t = H] \newline \mathbb{E}[U(a_t, H_t) \mid H_t = H] - \mathbb{E}[U(a^{\star}, H_t) \mid H_t = H] & = 0 \newline \mathbb{E}[U(a_t, H_t) - U(a^{\star}, H_t) \mid H_t = H] & = 0 \newline \end{align*} $$ We now consider the instantaneous regret at round $t$, $$ \begin{align*} \mathrm{BR}_t & \coloneqq \mathbb{E}[\mu(a^{\star}) - \mu(a_t)] \newline & = \mathbb{E}_{H \sim H_t} \left[ \mathbb{E}[\mu(a^{\star}) - \mu(a_t) \mid H_t = H] \right] \quad \text{(Law of Total Expectation)} \newline & = \mathbb{E}_{H \sim H_t} \left[ \mathbb{E}[U(a_t, H) - \mu(a_t) + \mu(a^{\star}) - U(a^{\star}, H) \mid H_t = H] \right] \newline & = \underbrace{\mathbb{E}\left[ \mu(a^{\star} - U(a^{\star}, H_t) \right]}_{\text{(I)}} + \underbrace{\mathbb{E}\left[ U(a_t, H_t) - \mu(a_t) \right]}_{\text{(II)}} \newline \end{align*} $$
Let’s analyze the first term (I) 3, $$ \begin{align*} \mathbb{E}\left[ \mu(a^{\star}) - U(a^{\star}, H_t) \right] & \leq \mathbb{E}\left[ \left( \mu(a^{\star}) - U(a^{\star}, H_t) \right)^+ \right] \newline & \leq \mathbb{E}\left[ \sum_{a \in \mathcal{A}} \left( \mu(a) - U(a, H_t) \right)^+ \right] \newline & = \sum_{a \in \mathcal{A}} \mathbb{E}\left[ \left( \mu(a) - U(a, H_t) \right)^+ \right] \newline & = \sum_{a \in \mathcal{A}} \mathbb{E}\left[ \left( U(a, H_t) - \mu(a) \right)^- \right] \newline & \leq \underbrace{\sum_{a \in \mathcal{A}}}_{K} \frac{\gamma}{TK} \quad \text{(By the generalized confidence bounds property)} \newline & = \frac{\gamma}{T} \newline \end{align*} $$ Similarly, we can analyze the second term (II), $$ \begin{align*} \mathbb{E}\left[ U(a_t, H_t) - \mu(a_t) \right] & = \mathbb{E}\left[ 2r_t(a_t, H_t) + L(a_t, H_t) - \mu(a_t) \right] \quad \text{(By definition of $r_t(a, H_t)$)} \newline & = \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \mathbb{E}\left[ L(a_t, H_t) - \mu(a_t) \right] \newline & \leq \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \mathbb{E}\left[ \left( L(a_t, H_t) - \mu(a_t) \right)^+ \right] \newline & \leq \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \sum_{a \in \mathcal{A}} \mathbb{E}\left[ \left( L(a, H_t) - \mu(a) \right)^+ \right] \newline & = \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \sum_{a \in \mathcal{A}} \mathbb{E}\left[ \left( \mu(a) - L(a, H_t) \right)^- \right] \newline & \leq \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \underbrace{\sum_{a \in \mathcal{A}}}_{K} \frac{\gamma}{TK} \quad \text{(By the generalized confidence bounds property)} \newline & = \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \frac{\gamma}{T} \newline \end{align*} $$ Thus, we have, $$ \mathrm{BR}_t \leq 2 \frac{\gamma}{T} + 2 \mathbb{E}[r_t(a_t, H_t)] $$ Then, summing over all rounds $t \in [T]$, we obtain, $$ \mathrm{BR}(T) = \sum_{t \in [T]} \mathrm{BR}_t \leq 2\gamma + 2 \sum_{t \in [T]} \mathbb{E}[r_t(a_t, H_t)] $$
We can now prove the Thompson Sampling Bayesian Regret Bound theorem using this lemma.
Proof: Thompson Sampling Bayesian Regret Bound Theorem Proof
We will use the confidence bounds as defined in the non-Bayesian setting, i.e., $$ \begin{align*} U(a, H_t) & \coloneqq \bar{\mu}_t(a) + r_t(a), \newline L(a, H_t) & \coloneqq \bar{\mu}_t(a) - r_t(a), \newline r_t(a) & \coloneqq \sqrt{\frac{2 \log(T)}{n_t(a)}}, \newline \end{align*} $$ Our generalized confidence bounds properties hold with $\gamma = 2$, quick proof, $$ \begin{align*} \mathbb{E}[\left( U(a, H_t) - \mu(a) \right)^-] & \leq P\left( U(a, H_t) < \mu(a) \right) \newline & \leq \frac{1}{T^2} \quad \text{(By Hoeffding’s inequality)} \newline & \leq \frac{1}{TK} \quad \text{(Since $K \leq T$)} \newline \end{align*} $$ Similarly, we can prove the second property. Thus, we can use the lemma with $\gamma = 2$, $$ \begin{align*} \mathrm{BR}(T) & \leq 4 + 2 \sum{t \in [T]} \mathbb{E}[r_t(a_t, H_t)] \newline & = 4 + 2 \sum_{t \in [T]} \mathbb{E}\left[ \sqrt{\frac{2 \log(T)}{n_t(a_t)}} \right] \newline & \leq 4 + 2 \sqrt{2 \log(T)} \sum_{t \in [T]} \mathbb{E}\left[ \frac{1}{\sqrt{n_t(a_t)}} \right] \newline & \leq \mathcal{O}\left( \sqrt{\log(T)} \right) \sum_{t \in [T]} \mathbb{E}\left[ \frac{1}{\sqrt{n_t(a_t)}} \right] \newline \end{align*} $$ We can rewrite out sum as, $$ \begin{align*} \sum_{t \in [T]} \frac{1}{\sqrt{n_t(a_t)}} & = \sum_{a \in \mathcal{A}} \sum_{t \in [T] \mid a_t = a} \frac{1}{\sqrt{n_t(a)}} \newline & = \sum_{a \in \mathcal{A}} \sum_{i \in [n_{T + 1}(a)]} \frac{1}{\sqrt{i}} \newline & = \sum_{a \in \mathcal{A}} \mathcal{O}\left( \sqrt{n(a)} \right) \newline \end{align*} $$ Thus, $$ \begin{align*} \mathrm{BR}(T) & \leq \mathcal{O}\left( \sqrt{\log(T)} \right) \sum_{a \in \mathcal{A}} \sqrt{n(a)} \newline & \leq \mathcal{O}\left( \sqrt{\log(T)} \right) \sqrt{K \sum_{a \in \mathcal{A}} n(a)} \quad \text{(By Cauchy-Schwarz inequality, (or Jensen’s inequality, since $\sqrt{\cdot}$ is concave))} \newline & = \mathcal{O}\left( \sqrt{\log(T)} \right) \sqrt{K \cdot T} \newline & = \mathcal{O}\left( \sqrt{K T \log(T)} \right) \ _\blacksquare \end{align*} $$