Part 4 - Bayesian Bandits

Bayesian Bandit Problem

So far, we have discussed the stochastic bandits model, with a set of KK arms A\mathcal{A}.

In the Bayesian bandit problem, we add the Bayesian assumption.

Definition 1 (Bayesian Bandit Problem)

The Bayesian bandit problem is a stochastic bandit problem where the problem instance J\mathcal{J} is drawn initially from sone known distribution P\mathbb{P}.

The problem instance J\mathcal{J} (often) specifies the mean reward vector μ[0,1]K\mathbf{\mu} \in [0, 1]^K.

The distribution P\mathbb{P} is called the prior distribution, or the Bayesian prior.

As we saw for normal stochastic bandits, the objective is to minimize the expected regret, thus, in the Bayesian setting, we want to minimize the Bayesian expected regret.

Definition 2 (Bayesian Expected Regret)

The Bayesian Regret is the expected regret for a particular problem instance J\mathcal{J}, in expectation over the problem instances,

BR(T)EJP[E[R(T)J]]=EJP[E[μTt[T]μ(at)]]\mathrm{BR}(T) \coloneqq \mathbb{E}_{\mathcal{J} \sim \mathbb{P}}[\mathbb{E}[R(T) \mid \mathcal{J}]] = \mathbb{E}_{\mathcal{J} \sim \mathbb{P}}\left[ \mathbb{E}\left[ \mu^{\star} \cdot T - \sum_{t \in [T]} \mu(a_t) \right] \right]

Further, we will assume that our realized rewards are drawn from a single-parameter family of distributions (e.g., Bernoulli, unit-variance Gaussian, etc.). Thus, the problem instance is completely specified by the mean reward vector μ[0,1]K\mathbf{\mu} \in [0, 1]^K.

The prior P\mathbb{P} is simply a distribution over [0,1]K[0, 1]^K that the μ()\mu(\cdot) is drawn from.

Furthermore, we will assume that our realized rewards can only take finitely many different values, and the prior P\mathbb{P} has a finite support, F\mathcal{F}.

Lastly, for simplicity, we assume that the best arm aa^{\star} is unique for each mean reward vector in the support of P\mathbb{P}.

Terminology and Notation

We will introduce some notation that we will use later for our analysis. The concepts themselves are not new, but we introduce the compact notation to make our analysis easier.

Definition 3 (tt-history (HtH_t))

The sequence of action-reward pairs collected by the agent after round tt, is a random variable defined as,

Ht((a1,r1),,(at,rt))(A×R)tH_t \coloneqq \left( (a_1, r_1), \ldots, (a_t, r_t) \right) \in (\mathcal{A} \times \mathbb{R})^t
Definition 4 (Feasible tt-history (H))

A fixed sequence H=((a1,r1),,(at,rt))(A×R)tH = \left( (a_1, r_1), \ldots, (a_t, r_t) \right) \in (\mathcal{A} \times \mathbb{R})^t is called a feasible tt-history if it has a positive probability of being observed, i.e., P(Ht=H)>0P(H_t = H) > 0.

Further, we call such an algorithm HH-consistent.

Definition 5 (Ht\mathcal{H}_t)

The (finite) set of all feasible tt-histories is denoted by Ht\mathcal{H}_t.

Posterior Distribution

In the Bayesian setting, using Bayes’ rule 1 1Recall that Bayes’ rule states that P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}, it is powerful to compute the prior and posterior distributions of the problem instance J\mathcal{J}.

Definition 6 (Posterior Distribution (PH(μ~)\mathbb{P}_H(\tilde{\mu})))

The Bayesian posterior distribution after round tt can be defined as,

PH(μ~)P(μ=μ~Ht=H), μ~[0,1]K\mathbb{P}_H(\tilde{\mu}) \coloneqq P(\mu = \tilde{\mu} \mid H_t = H), \ \forall \tilde{\mu} \in [0, 1]^K

Further, we can rewrite the posterior distribution using Bayes’ rule as,

PH(μ~)=P(μ=μ~,Ht=H)P(Ht=H)=P(μ=μ~)P(Ht=Hμ=μ~)μFP(μ=μ)P(Ht=Hμ=μ)\begin{align} \mathbb{P}_H(\tilde{\mu}) &= \frac{P(\mu = \tilde{\mu}, H_t = H)}{P(H_t = H)} \newline &= \frac{P(\mu = \tilde{\mu}) P(H_t = H \mid \mu = \tilde{\mu})}{\sum_{\mu^{\prime} \in \mathcal{F}} P(\mu = \mu^{\prime}) P(H_t = H \mid \mu = \mu^{\prime})} \newline \end{align}
Lemma 1 (Posterior Distribution Lemma)

PH(μ~)\mathbb{P}_H(\tilde{\mu}) is the same for all HH-consistent algorithms.

We will prove this lemma, but to prove it, we need the following definitions and concepts.

Posterior As A New Prior

The posterior PH\mathbb{P}_H can be used as a prior for a subsequent Bayesian update.

Definition 7 (Posterior as a New Prior)

Consider a feasible (t+t)(t + t^{\prime})-history. We represent it as a concatenation of feasible tt-history HH and feasible tt^{\prime}-history HH^{\prime}, where HH comes first.

We can denote the concatenation as HHH \oplus H^{\prime}.

From this, one can perform the Bayesian update and obtain the posterior (PH)H(\mathbb{P}_H)_{H^{\prime}} in two steps,

  1. Condition on HH and derive the posterior PH\mathbb{P}_H, then,

  2. Condition on HH^{\prime} using PH\mathbb{P}_H as the new prior to obtain (PH)H(\mathbb{P}_H)_{H^{\prime}}.

This “two-step” Bayesian update is equivalent to the “one step” update given HHH \oplus H^{\prime}, i.e.,

PHH=(PH)H\mathbb{P}_{H \oplus H^{\prime}} = (\mathbb{P}_H)_{H^{\prime}}

Sequential Bayesian Update

This sequential update property will prove to be very useful for Bayesian bandits.

Theorem 1 (Sequential Bayesian Update Theorem)

Consider round t1t \leq 1 and a feasible tt-history HH. We can write HH as a concatenation of some feasible feasible (t1)(t -1)-history HH^{\prime} and an action-reward pair (at,rt)(a_t, r_t), i.e., H=H(at,rt)H = H^{\prime} \oplus (a_t, r_t).

With an HH-consistent bandit algorithm, π(a)\pi(a), is the probability that the agent assigns each arm aa in round tt given our history HH^{\prime},

π(a)=P(at=aHt1=H)\pi(a) = P(a_t = a \mid H_{t-1} = H^{\prime})

At each round tt, we can treat the previous posterior as a new prior.

We will now prove this for the formulation as stated in Equation (1).

Proof (Sequential Bayesian Update Theorem Proof)

First, we will look at the numerator of Equation (1) (divided by P(Ht1=H)P(H_{t - 1} = H^{\prime})),

P(μ=μ~,Ht=H)P(Ht1=H)=P(μ=μ~,Ht=HH(at,rt))P(Ht1=H)=P(μ=μ~,Ht1=H,(at,rt))P(Ht1=H)=P(μ=μ~,(at,rt)=(at,rt)Ht1=H)P(Ht1=H)P(Ht1=H)=P(μ=μ~,(at,rt)=(at,rt)Ht1=H)=P(μ=μ~Ht1=H)PH(μ~)P((at,rt)=(at,rt)μ=μ~,Ht1=H)=PH(μ~)P(at=atHt1=H)π(a)P(rt=rtμ=μ~,at=at,Ht1=H)Dμ~(a)(r)=PH(μ~)π(a)Dμ~(a)(r)\begin{align*} \frac{P(\mu = \tilde{\mu}, H_t = H)}{P(H_{t - 1} = H^{\prime})} & = \frac{P(\mu = \tilde{\mu}, H_t = \underbrace{H}_{H^{\prime} \oplus (a_t, r_t)})}{P(H_{t - 1} = H^{\prime})} \newline &= \frac{P(\mu = \tilde{\mu}, H_{t - 1} = H^{\prime}, (a_t, r_t))}{P(H_{t - 1} = H^{\prime})} \newline &= \frac{P(\mu = \tilde{\mu}, (a_t, r_t) = (a_t, r_t) \mid H_{t - 1} = H^{\prime}) P(H_{t - 1} = H^{\prime})}{P(H_{t - 1} = H^{\prime})} \newline &= P(\mu = \tilde{\mu}, (a_t, r_t) = (a_t, r_t) \mid H_{t - 1} = H^{\prime}) \newline &= \underbrace{P(\mu = \tilde{\mu} \mid H_{t - 1} = H^{\prime})}_{\mathbb{P}_{H^{\prime}}(\tilde{\mu})} \cdot P((a_t, r_t) = (a_t, r_t) \mid \mu = \tilde{\mu}, H_{t - 1} = H^{\prime}) \newline &= \mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \underbrace{P(a_t = a_t \mid H_{t - 1} = H^{\prime})}_{\pi(a)} \cdot \underbrace{P(r_t = r_t \mid \mu = \tilde{\mu}, a_t = a_t, H_{t - 1} = H^{\prime})}_{\mathcal{D}_{\tilde{\mu}(a)}(r)} \newline &= \mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \pi(a) \cdot \mathcal{D}_{\tilde{\mu}(a)}(r) \end{align*}

Thus,

P(μ=μ~,Ht=H)=P(Ht1=H)PH(μ~)π(a)Dμ~(a)(r)\boxed{P(\mu = \tilde{\mu}, H_t = H) = P(H_{t - 1} = H^{\prime}) \cdot \mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \pi(a) \cdot \mathcal{D}_{\tilde{\mu}(a)}(r)}

Now, consider the denominator of Equation (1),

P(Ht=H)=μFP(μ=μ,Ht=H)=μFP(Ht1=H)PH(μ)π(a)Dμ(a)(r)=P(Ht1=H)π(a)μFPH(μ)Dμ(a)(r)\begin{align*} P(H_t = H) &= \sum_{\mu^{\prime} \in \mathcal{F}} P(\mu = \mu^{\prime}, H_t = H) \newline &= \sum_{\mu^{\prime} \in \mathcal{F}} P(H_{t - 1} = H^{\prime}) \cdot \mathbb{P}_{H^{\prime}}(\mu^{\prime}) \cdot \pi(a) \cdot \mathcal{D}_{\mu^{\prime}(a)}(r) \newline &= P(H_{t - 1} = H^{\prime}) \cdot \pi(a) \cdot \sum_{\mu^{\prime} \in \mathcal{F}} \mathbb{P}_{H^{\prime}}(\mu^{\prime}) \cdot \mathcal{D}_{\mu^{\prime}(a)}(r) \end{align*}

Thus,

PH(μ~)=P(μ=μ~,Ht=H)P(Ht=H)=P(Ht1=H)PH(μ~)π(a)Dμ~(a)(r)P(Ht1=H)π(a)μFPH(μ)Dμ(a)(r)=PH(μ~)Dμ~(a)(r)μFPH(μ)Dμ(a)(r)\begin{align*} \mathbb{P}_H(\tilde{\mu}) &= \frac{P(\mu = \tilde{\mu}, H_t = H)}{P(H_t = H)} \newline &= \frac{\cancel{P(H_{t - 1} = H^{\prime})} \cdot \mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \cancel{\pi(a)} \cdot \mathcal{D}_{\tilde{\mu}(a)}(r)}{\cancel{P(H_{t - 1} = H^{\prime})} \cdot \cancel{\pi(a)} \cdot \sum_{\mu^{\prime} \in \mathcal{F}} \mathbb{P}_{H^{\prime}}(\mu^{\prime}) \cdot \mathcal{D}_{\mu^{\prime}(a)}(r)} \tag{$\star$} \newline &= \frac{\mathbb{P}_{H^{\prime}}(\tilde{\mu}) \cdot \mathcal{D}_{\tilde{\mu}(a)}(r)}{\sum_{\mu^{\prime} \in \mathcal{F}} \mathbb{P}_{H^{\prime}}(\mu^{\prime}) \cdot \mathcal{D}_{\mu^{\prime}(a)}(r)} \newline \end{align*}

Hence, PH\mathbb{P}_{H^{\prime}} is the new prior for the Bayesian update at round tt.

Note

In step ()(\star), we implicitly proved our previous Lemma, that PH(μ~)\mathbb{P}_H(\tilde{\mu}) is the same for all HH-consistent algorithms, since π(a)\pi(a) cancels out.

Independent Priors

The prior P\mathbb{P} is called independent if (μ(a):aA)(\mu(a) : a \in \mathcal{A}) are mutually independent random variables.

This essentially means that our update is even simpler, since we can update each arm’s mean reward distribution separately.

Computational Aspects

With brute-force computation of the posterior PH\mathbb{P}_H at each round tt, the runtime is O(tF)\mathcal{O}(t \cdot |\mathcal{F}|), which is impractical for large tt.

With sequential (Bayesian) updates, the per-round running time is O(F)\mathcal{O}(|\mathcal{F}|), which is much more manageable.

With independent priors, one needs to update only the played arm at each round.

However, with conjugate priors 2, the posterior can be computed directly without summation over all elements in the support F\mathcal{F}.

Thompson Sampling

This is the idea behind the famous Thompson Sampling algorithm.

The idea is that, for each round tt and for each arm aa, our agent computes the posterier probability that aa is the best arm, and samples aa with this probability.

Summary (Thompson Sampling Algorithm)

For each round t[T]t \in [T], do,

  • Observe Ht1=HH_{t - 1} = H, for some feasible (t1)(t - 1)-history HH.
  • Sample mean reward vector μt\mu_t from the posterior distribution PH\mathbb{P}_H.
  • Choose the best arm a~t\tilde{a}_t according to μt\mu_t, i.e., a~targmaxaAμt(a)\tilde{a}_t \in \arg\max_{a \in \mathcal{A}} \mu_t(a).

end

Case Study

A robot is in need of recharging, there are three different power sockets. When tried, each socket randomly returns some charge or gives no charge at all.

We can model the reward from each socket as a Bernoulli random variable, either 1 (charge) or 0 (no charge).

The ideal distribution to model this problem is the Beta distribution, which is a conjugate prior for the Bernoulli distribution.

The Beta distribution has two parameters, α\alpha and β\beta. In the simplest terms, these parameters respectively corresponds to the count of successes and failures (or, more precisely, α1\alpha - 1 and β1\beta - 1).

Initially, we have no idea what the probability is of any given socket producing an output, so we can start by setting both α\alpha and β\beta to 1, which corresponds to a uniform distribution over [0,1][0, 1].

This initial guess at the probability of the socket producing an output corresponds to choosing the prior distribution.

Once the robot tests a socket and obtains a reward, it can modify its belief in the success rate of that socket, i.e., updates the posterior distribution.

The Beta distribution is a conjugate prior for the Bernoulli distribution, so the posterior distribution is also a Beta distribution, with updated parameters.

If the tested socket returns a reward of 1 (success), we increment α\alpha by 1 and leave β\beta unchanged. If the tested socket returns a reward of 0 (failure), we increment β\beta by 1 and leave α\alpha unchanged.

Since actions that have been tried infrequently have wide distributions, they have a larger range of possible values. In this way, a socket that currently has a low estimated mean reward, but has been tested fewer times than a socket with a higher estimated mean, can return a larger sample value and therefore become the selected socket at this time step.

Bayesian Regret Analysis

Recall the definitions we use for analysis in a non-Bayesian setting,

Definition 8 (Confidence Radius and Confidence Bounds)rt(a)2log(T)nt(a),(Confidence Radius)UCBt(a)μˉt(a)+rt(a),(Upper Confidence Bound)LCBt(a)μˉt(a)rt(a),(Lower Confidence Bound)\begin{align*} r_t(a) & \coloneqq \sqrt{\frac{2 \log(T)}{n_t(a)}}, & \text{(Confidence Radius)} \newline \mathrm{UCB}_t(a) & \coloneqq \bar{\mu}_t(a) + r_t(a), & \text{(Upper Confidence Bound)} \newline \mathrm{LCB}_t(a) & \coloneqq \bar{\mu}_t(a) - r_t(a), & \text{(Lower Confidence Bound)} \newline \end{align*}

where nt(a)n_t(a) is the number of times arm aa has been played so far, and μˉt(a)\bar{\mu}_t(a) is the average reward from this arm.

Further, we had (under the clean event E\mathcal{E}),

μ(a)[LCBt(a),UCBt(a)],  with high probability.\mu(a) \in [\mathrm{LCB}_t(a), \mathrm{UCB}_t(a)], \ \text{ with high probability.}

In our bayesian setting, we need to considerer the generalized confidence bounds U(a,Ht)U(a, H_t) and L(a,Ht)L(a, H_t), which depend on the history HtH_t. Further, we want these properties to hold (for some constant γ>0\gamma > 0) 2Here xx^- is the negative portion of the number xx, i.e., x=0x^- = 0 if x0x \geq 0 and x=xx^- = |x| otherwise.,

Definition 9 (Generalized Confidence Bounds)E[(U(a,Ht)μ(a))]γTK,a,tE[(μ(a)L(a,Ht))]γTK,a,t\begin{align*} \mathbb{E}[\left( U(a, H_t) - \mu(a) \right)^-] & \leq \frac{\gamma}{TK}, \quad \forall a,t \newline \mathbb{E}[\left( \mu(a) - L(a, H_t) \right)^-] & \leq \frac{\gamma}{TK}, \quad \forall a,t \newline \end{align*}

Further, the confidence radius is (still) defined as,

rt(a)U(a,Ht)L(a,Ht)2r_t(a) \coloneqq \frac{U(a, H_t) - L(a, H_t)}{2}

Thomson Sampling Bayesian Regret Bound

We can now state the Bayesian regret bound for Thompson Sampling.

Theorem 2 (Thompson Sampling Bayesian Regret Bound)

The Bayesian Regret of Thompson Sampling is,

BR(T)O(KTlog(T))\mathrm{BR}(T) \leq \mathcal{O}\left( \sqrt{K T \log(T)} \right)

We will also state a lemma about the regret bound,

Lemma 2 (Thompson Sampling Bayesian Regret Bound Lemma)

Assume we have lower and upper bound functions that satisfy the generalized confidence bounds properties. For some paramter γ>0\gamma > 0, then, the bayesian regret of Thompson Sampling can be bounded as,

BR(T)2γ+2t[T]E[rt(at,Ht)]\mathrm{BR}(T) \leq 2\gamma + 2 \sum_{t \in [T]} \mathbb{E}[r_t(a_t, H_t)]

We will now prove this lemma.

Proof (Thompson Sampling Bayesian Regret Bound Lemma Proof)

Firstly, we fix a round tt, further, the chosen arm ata_t and the best arm aa^{\star} are identically distributed given our tt-history HtH_t. Thus, for each feasible tt-history HH, we have,

P(at=aHt=H)=P(a=aHt=H), aAP(a_t = a \mid H_t = H) = P(a^{\star} = a \mid H_t = H), \ \forall a \in \mathcal{A}

Hence,

E[U(at,Ht)Ht=H]=E[U(a,Ht)Ht=H]E[U(at,Ht)Ht=H]E[U(a,Ht)Ht=H]=0E[U(at,Ht)U(a,Ht)Ht=H]=0\begin{align*} \mathbb{E}[U(a_t, H_t) \mid H_t = H] & = \mathbb{E}[U(a^{\star}, H_t) \mid H_t = H] \newline \mathbb{E}[U(a_t, H_t) \mid H_t = H] - \mathbb{E}[U(a^{\star}, H_t) \mid H_t = H] & = 0 \newline \mathbb{E}[U(a_t, H_t) - U(a^{\star}, H_t) \mid H_t = H] & = 0 \newline \end{align*}

We now consider the instantaneous regret at round tt,

BRtE[μ(a)μ(at)]=EHHt[E[μ(a)μ(at)Ht=H]](Law of Total Expectation)=EHHt[E[U(at,H)μ(at)+μ(a)U(a,H)Ht=H]]=E[μ(aU(a,Ht)](I)+E[U(at,Ht)μ(at)](II)\begin{align*} \mathrm{BR}_t & \coloneqq \mathbb{E}[\mu(a^{\star}) - \mu(a_t)] \newline & = \mathbb{E}_{H \sim H_t} \left[ \mathbb{E}[\mu(a^{\star}) - \mu(a_t) \mid H_t = H] \right] \quad \text{(Law of Total Expectation)} \newline & = \mathbb{E}_{H \sim H_t} \left[ \mathbb{E}[U(a_t, H) - \mu(a_t) + \mu(a^{\star}) - U(a^{\star}, H) \mid H_t = H] \right] \newline & = \underbrace{\mathbb{E}\left[ \mu(a^{\star} - U(a^{\star}, H_t) \right]}_{\text{(I)}} + \underbrace{\mathbb{E}\left[ U(a_t, H_t) - \mu(a_t) \right]}_{\text{(II)}} \newline \end{align*}

Let’s analyze the first term (I) 3Again, the notation x+x^+ is the positive portion of the number xx, i.e., x+=xx^+ = x if x0x \geq 0 and x+=0x^+ = 0 otherwise.,

E[μ(a)U(a,Ht)]E[(μ(a)U(a,Ht))+]E[aA(μ(a)U(a,Ht))+]=aAE[(μ(a)U(a,Ht))+]=aAE[(U(a,Ht)μ(a))]aAKγTK(By the generalized confidence bounds property)=γT\begin{align*} \mathbb{E}\left[ \mu(a^{\star}) - U(a^{\star}, H_t) \right] & \leq \mathbb{E}\left[ \left( \mu(a^{\star}) - U(a^{\star}, H_t) \right)^+ \right] \newline & \leq \mathbb{E}\left[ \sum_{a \in \mathcal{A}} \left( \mu(a) - U(a, H_t) \right)^+ \right] \newline & = \sum_{a \in \mathcal{A}} \mathbb{E}\left[ \left( \mu(a) - U(a, H_t) \right)^+ \right] \newline & = \sum_{a \in \mathcal{A}} \mathbb{E}\left[ \left( U(a, H_t) - \mu(a) \right)^- \right] \newline & \leq \underbrace{\sum_{a \in \mathcal{A}}}_{K} \frac{\gamma}{TK} \quad \text{(By the generalized confidence bounds property)} \newline & = \frac{\gamma}{T} \newline \end{align*}

Similarly, we can analyze the second term (II),

E[U(at,Ht)μ(at)]=E[2rt(at,Ht)+L(at,Ht)μ(at)](By definition of rt(a,Ht))=E[2rt(at,Ht)]+E[L(at,Ht)μ(at)]E[2rt(at,Ht)]+E[(L(at,Ht)μ(at))+]E[2rt(at,Ht)]+aAE[(L(a,Ht)μ(a))+]=E[2rt(at,Ht)]+aAE[(μ(a)L(a,Ht))]E[2rt(at,Ht)]+aAKγTK(By the generalized confidence bounds property)=E[2rt(at,Ht)]+γT\begin{align*} \mathbb{E}\left[ U(a_t, H_t) - \mu(a_t) \right] & = \mathbb{E}\left[ 2r_t(a_t, H_t) + L(a_t, H_t) - \mu(a_t) \right] \quad \text{(By definition of $r_t(a, H_t)$)} \newline & = \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \mathbb{E}\left[ L(a_t, H_t) - \mu(a_t) \right] \newline & \leq \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \mathbb{E}\left[ \left( L(a_t, H_t) - \mu(a_t) \right)^+ \right] \newline & \leq \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \sum_{a \in \mathcal{A}} \mathbb{E}\left[ \left( L(a, H_t) - \mu(a) \right)^+ \right] \newline & = \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \sum_{a \in \mathcal{A}} \mathbb{E}\left[ \left( \mu(a) - L(a, H_t) \right)^- \right] \newline & \leq \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \underbrace{\sum_{a \in \mathcal{A}}}_{K} \frac{\gamma}{TK} \quad \text{(By the generalized confidence bounds property)} \newline & = \mathbb{E}\left[ 2r_t(a_t, H_t) \right] + \frac{\gamma}{T} \newline \end{align*}

Thus, we have,

BRt2γT+2E[rt(at,Ht)]\mathrm{BR}_t \leq 2 \frac{\gamma}{T} + 2 \mathbb{E}[r_t(a_t, H_t)]

Then, summing over all rounds t[T]t \in [T], we obtain,

BR(T)=t[T]BRt2γ+2t[T]E[rt(at,Ht)]\mathrm{BR}(T) = \sum_{t \in [T]} \mathrm{BR}_t \leq 2\gamma + 2 \sum_{t \in [T]} \mathbb{E}[r_t(a_t, H_t)]

We can now prove the Thompson Sampling Bayesian Regret Bound theorem using this lemma.

Proof (Thompson Sampling Bayesian Regret Bound Theorem Proof)

We will use the confidence bounds as defined in the non-Bayesian setting, i.e.,

U(a,Ht)μˉt(a)+rt(a),L(a,Ht)μˉt(a)rt(a),rt(a)2log(T)nt(a),\begin{align*} U(a, H_t) & \coloneqq \bar{\mu}_t(a) + r_t(a), \newline L(a, H_t) & \coloneqq \bar{\mu}_t(a) - r_t(a), \newline r_t(a) & \coloneqq \sqrt{\frac{2 \log(T)}{n_t(a)}}, \newline \end{align*}

Our generalized confidence bounds properties hold with γ=2\gamma = 2, quick proof,

E[(U(a,Ht)μ(a))]P(U(a,Ht)<μ(a))1T2(By Hoeffding’s inequality)1TK(Since KT)\begin{align*} \mathbb{E}[\left( U(a, H_t) - \mu(a) \right)^-] & \leq P\left( U(a, H_t) < \mu(a) \right) \newline & \leq \frac{1}{T^2} \quad \text{(By Hoeffding's inequality)} \newline & \leq \frac{1}{TK} \quad \text{(Since $K \leq T$)} \newline \end{align*}

Similarly, we can prove the second property. Thus, we can use the lemma with γ=2\gamma = 2,

BR(T)4+2t[T]E[rt(at,Ht)]=4+2t[T]E[2log(T)nt(at)]4+22log(T)t[T]E[1nt(at)]O(log(T))t[T]E[1nt(at)]\begin{align*} \mathrm{BR}(T) & \leq 4 + 2 \sum_{t \in [T]} \mathbb{E}[r_t(a_t, H_t)] \newline & = 4 + 2 \sum_{t \in [T]} \mathbb{E}\left[ \sqrt{\frac{2 \log(T)}{n_t(a_t)}} \right] \newline & \leq 4 + 2 \sqrt{2 \log(T)} \sum_{t \in [T]} \mathbb{E}\left[ \frac{1}{\sqrt{n_t(a_t)}} \right] \newline & \leq \mathcal{O}\left( \sqrt{\log(T)} \right) \sum_{t \in [T]} \mathbb{E}\left[ \frac{1}{\sqrt{n_t(a_t)}} \right] \newline \end{align*}

We can rewrite out sum as,

t[T]1nt(at)=aAt[T]at=a1nt(a)=aAi[nT+1(a)]1i=aAO(n(a))\begin{align*} \sum_{t \in [T]} \frac{1}{\sqrt{n_t(a_t)}} & = \sum_{a \in \mathcal{A}} \sum_{t \in [T] \mid a_t = a} \frac{1}{\sqrt{n_t(a)}} \newline & = \sum_{a \in \mathcal{A}} \sum_{i \in [n_{T + 1}(a)]} \frac{1}{\sqrt{i}} \newline & = \sum_{a \in \mathcal{A}} \mathcal{O}\left( \sqrt{n(a)} \right) \newline \end{align*}

Thus,

BR(T)O(log(T))aAn(a)O(log(T))KaAn(a)(By Cauchy-Schwarz inequality, (or Jensen’s inequality, since  is concave))=O(log(T))KT=O(KTlog(T)) \begin{align*} \mathrm{BR}(T) & \leq \mathcal{O}\left( \sqrt{\log(T)} \right) \sum_{a \in \mathcal{A}} \sqrt{n(a)} \newline & \leq \mathcal{O}\left( \sqrt{\log(T)} \right) \sqrt{K \sum_{a \in \mathcal{A}} n(a)} \quad \text{(By Cauchy-Schwarz inequality, (or Jensen's inequality, since $\sqrt{\cdot}$ is concave))} \newline & = \mathcal{O}\left( \sqrt{\log(T)} \right) \sqrt{K \cdot T} \newline & = \mathcal{O}\left( \sqrt{K T \log(T)} \right) \ _\blacksquare \end{align*}

Footnotes

  1. Wikipedia: Bayes’ rule

  2. Wikipedia: Conjugate prior