Part 3 - Stochastic Multi-Armed Bandits III

DAT441_3

Successive Elimination ($K > 2$)

In the last part we introduced successive elimination for the case of $K=2$ arms.

We will extend this to the case of $K > 2$ arms and introduce another adaptive exploration algorithm.

Summary: Successive Elimination Algorithm ($K > 2$)

Initially all arms are set “active”;

Each phase,

Try all active arms (thus each phase may contain multiple rounds);
Deactivate all arms $a$ such that $\exists a^{\prime}$ with $\mathrm{UCB}_t(a) < \mathrm{LCB}_t(a^{\prime})$;

Repeat until only one arm remains active.

Further, let $a^{\star}$ be an optimal arm, and consider any arm $a$ such that $\mu(a) < \mu(a^{\star})$.

If we analyze the last round $\hat{t}$ when we have not yet deactivated arm $a$ (or, the last round $T$ if $a$ is still active at the end).

The confidence intervals of the two arms $a$ and $a^{\star}$ at round $t \leq \hat{t}$ must overlap (otherwise, arm $a$ would have been deactivated),

$$ \Delta(a) \coloneq \mu(a^{\star}) - \mu(a) \leq \underbrace{2 r_t(a) + 2 r_t(a^{\star})}_{\leq \mathcal{O}(r_t(a))} $$

As we established, arm $a$ is never played after round $\hat{t}$, then,

$$ \begin{align*} n_{\hat{t}}(a) & = n_T(a) \newline r_{\hat{t}}(a) & = r_T(a) \end{align*} $$

Thus, we have proved the following crucial property,

Lemma: Successive Elimination Key Property

For any suboptimal arm $a$ (i.e., $\mu(a) < \mu(a^{\star})$), if it is still active at the end of round $T$, then, $$ \Delta(a) \leq \mathcal{O}(r_T(a)) = \mathcal{O}\left(\sqrt{\frac{\log(T)}{n_T(a)}}\right) $$

This is result is very nice, since it relates the suboptimality gap $\Delta(a)$ to the number of times $n_T(a)$ that arm $a$ has been played. Thus, an arm played many times must be close to optimal!

The rest of our analysis will rely on this result.

Regret Analysis

As per usual, when doing analysis we will construct the clean event and then connect it to the regret (the bad event will be negligible).

Let $R(t; a)$ be the regret contribution of arm $a$ at an arbitrary round $t$.

Thus,

$$ \begin{align*} R(t; a) & = n_t(a) \cdot \Delta(a) \leq n_t(a) \cdot \underbrace{\mathcal{O}\left(\sqrt{\frac{\log(T)}{n_t(a)}}\right)}_{\mathcal{O}(\frac{\log(T)^{1/2}}{n_t(a)^{1/2}})} \newline & = \mathcal{O}\left(\sqrt{n_t(a) \log(T)}\right) \end{align*} $$

We now introduce the set $\mathcal{A}^{+} = \{a \mid \mu(a) < \mu(a^{\star})\}$ the set of all arms that contribute to the regret (i.e., all suboptimal arms).

Then, we can bound the total regret at round $t$ as,

$$ \begin{align*} R(t) & = \sum_{a \in \mathcal{A}^{+}} R(t; a) \newline & = \sum_{a \in \mathcal{A}^{+}} \mathcal{O}\left(\sqrt{n_t(a) \log(T)}\right) \newline & = \underbrace{\mathcal{O}(\sqrt{\log(T)})}_{\text{does not depend on sum}} \cdot \sum_{a \in \mathcal{A}^{+}} \sqrt{n_t(a)} \newline & \leq \mathcal{O}(\sqrt{\log(T)}) \cdot \sum_{a \in \mathcal{A}} \sqrt{n_t(a)} \end{align*} $$

By Jensen’s inequality ¹ ¹, and the fact that $\sum_{a \in \mathcal{A}} n_t(a) = t$, we have that,

$$ \frac{1}{K} \sum_{a \in \mathcal{A}} \sqrt{n_t(a)} \leq \sqrt{\frac{1}{K} \sum_{a \in \mathcal{A}} n_t(a)} = \sqrt{\frac{t}{K}} $$

Thus,

$$ \sum_{a \in \mathcal{A}} \sqrt{n_t(a)} \leq K \sqrt{\frac{t}{K}} = \sqrt{K t} $$

Thus, we have our final regret bound,

Theorem: Successive Elimination Regret Bound ($K > 2$)

For $K > 2$ arms, the successive elimination algorithm achieves the regret bound, $$ \mathbb{E}[R(T)] \leq \mathcal{O}(\sqrt{K t \log(T)}) \quad \text{for all rounds } t \leq T $$

But there is another bound we can derive, from our previous property, one may rearrange to obtain,

$$ \Delta(a) \leq \mathcal{O}\left( \sqrt{\frac{\log(T)}{n_T(a)}}\right) \implies n_T(a) \leq \mathcal{O}\left(\frac{\log(T)}{(\Delta(a))^2}\right) $$

If we now consider the set $\mathcal{A}^{+}$ of all suboptimal arms again, we can bound the regret for each arm $a \in \mathcal{A}^{+}$ as,

$$ R(T; a) = \Delta(a) \cdot n_T(a) \leq \Delta(a) \cdot \mathcal{O}\left(\frac{\log(T)}{(\Delta(a))^2}\right) = \mathcal{O}\left(\frac{\log(T)}{\Delta(a)}\right) $$

Summing over all suboptimal arms, we have,

$$ R(T) \leq \mathcal{O}(\log(T)) \cdot \sum_{a \in \mathcal{A}^{+}} \frac{1}{\Delta(a)} $$

Theorem: Successive Elimination Regret Bound ($K > 2$)

For $K > 2$ arms, the successive elimination algorithm achieves the regret bound, $$ \mathbb{E}[R(T)] \leq \mathcal{O}(\log(T)) \cdot \sum_{a \in \mathcal{A}^{+}} \frac{1}{\Delta(a)} $$

Instance-Dependent vs. Instance-Independent Bounds

We have now derived two different regret bounds for the successive elimination algorithm. The first one is an instance-independent bound, since it does not depend on the gaps $\Delta(a)$, while the second one is instance-dependent, since it does depend on the gaps.

These bounds are lograithmic in $T$, with a constant that can be arbitrarily large (depending on the problem instance).

Thus, we conclude that adaptive exploration can lead to much better performance than uniform exploration (logarithmic vs. sublinear regret).

UCB and Optimism Under Uncertainty

Another adaptive exploration algorithm is the Upper Confidence Bound (UCB) algorithm.

The idea is to assume that each arm is as good as it can possiblyt be given the observations so far, and the agent chooses the best arm based on these optimistic estimates.

Summary: UCB Algorithm

Try each arm once;

In each round $t$, pick $\underset{a \in \mathcal{A}}{\arg\max} \ \mathrm{UCB}_t(a)$, where $\mathrm{UCB}_t(a) = \bar{\mu}_t(a) + r_t(a)$.

But why does this UCB-based strategy work?

Imagine that an arm $a$ is chosen in round $t$ because it has a large $\mathrm{UCB}_t(a) = \bar{\mu}_t(a) + r_t(a)$.

Then, either,

The average reward $\bar{\mu}_t(a)$ is large $\implies$ arm $a$ is likely to have a high reward $\mu(a)$,
and/or the confidence radius $r_t(a)$ is large $\implies$ arm $a$ has not been tried often, and thus we are uncertain about its true reward $\mu(a)$.

Either reason makes the arm $a$ a good candidate to try!

Note: Adaptive Exploration-Exploitation

Note that, not only is this UCB-based algorithm an adaptive exploration algorithm, it is also an adaptive exploration-exploitation algorithm.

The $\bar{\mu}_t(a)$ term encourages exploitation, while the $r_t(a)$ term encourages exploration, and summing them up (i.e., taking the UCB) is a natural way to trade-off exploration and exploitation adaptively.

Analysis

Recall the clean event (from reward tape idea),

$$ \mathcal{E} \coloneqq \{\forall a \forall t |\bar{\mu}_t(a) - \mu(a) | \leq r_t(a) \}. $$

Assume $a^{\star}$ be an optimal arm and $a_t$ is the arm chosen by the agent at round $t$, i.e.,

$$ \mathrm{UCB}_t(a_t) \geq \mathrm{UCB}_t(a^{\star}) $$

We know that, under the clean event,

$$ \mu(a_t) + r_t(a_t) \geq \bar{\mu}_t(a_t) $$

By definition of UCB and combining the two inequalities, we have,

$$ \mathrm{UCB}_t(a^{\star}) \geq \mu(a^{\star}) $$

Then,

$$ \begin{align*} \mu(a_t) + 2 r_t(a_t) & \geq \bar{\mu}_t(a_t) + r_t(a_t) \newline \mu(a_t) + 2 r_t(a_t) - r_t(a_t) & \geq \bar{\mu}_t(a_t) + r_t(a_t) - r_t(a_t) \newline \underbrace{\mu(a_t) + r_t(a_t)}_{\mathrm{UCB}_t(a_t)} & \geq \bar{\mu}_t(a^{\star}) \newline \mathrm{UCB}_t(a_t) & \underbrace{\geq \mathrm{UCB}_t(a^{\star})}_{\text{by our assumption}} \geq \bar{\mu}_t(a^{\star}) \newline \end{align*} $$

Using this chain of inequalities, we find that,

$$ \mu(a_t) + 2 r_t(a_t) \geq \mu(a^{\star}), $$

Rearranging this we have found an expression of the gap $\Delta(a_t)$,

$$ \Delta(a_t) \coloneq \mu(a^{\star}) - \mu(a_t) \leq 2 r_t(a_t) = 2 \sqrt{\frac{2 \log(T)}{n_t(a_t)}} $$

Now (again), for each arm $a$, consider the last round $\hat{t}$ when arm $a$ was chosen (or round $T$ if it was chosen in the last round) by the agent,

$$ \Delta(a) \leq \mathcal{O}\left(\sqrt{\frac{\log(T)}{n_{\hat{t}}(a)}}\right) = \mathcal{O}\left(\sqrt{\frac{\log(T)}{n_T(a)}}\right) $$

Thus, we have proved the same key property as for successive elimination!

One can use the exact same analysis as for successive elimination to obtain the same two regret bounds.

Theorem: UCB Regret Bound ($K > 2$)

For $K > 2$ arms, the UCB algorithm achieves the regret bounds, $$ \mathbb{E}[R(T)] \leq \mathcal{O}(\sqrt{K t \log(T)}) \quad \text{for all rounds } t \leq T $$ $$ \mathbb{E}[R(T)] \leq \mathcal{O}(\log(T)) \cdot \sum_{a \in \mathcal{A}^{+}} \frac{1}{\Delta(a)} $$

Lower Bound

We have so far only discussed upper bounds on the regret. But what about lower bounds?

The performance of any algorithm is determined by the distance between the optimal arm and other arms. But, hard problems have arms with similar distributions.

These distances are formally described by the gaps $\Delta(a)$ and the distribution distances $\mathrm{KL}(\mathcal{D}_a, \mathcal{D}_{a^{\star}})$ ².

We will end this part of the course wit the Lai and Robbins lower bound theorem,

Theorem: Lai and Robbins Lower Bound

The (asymptotic) total regret is at least logarithmic in number of steps, $$ \lim_{T \to \infty} R(T) \geq \log(T) \cdot \sum_{a \mid \Delta(a) > 0} \frac{\Delta(a)}{\mathrm{KL}(\mathcal{D}_a, \mathcal{D}_{a^{\star}})} $$

Wikipedia: Jensen’s Inequality ↩