Part 3 - Stochastic Multi-Armed Bandits III

Successive Elimination (K>2K > 2)

In the last part we introduced successive elimination for the case of K=2K=2 arms.

We will extend this to the case of K>2K > 2 arms and introduce another adaptive exploration algorithm.

Summary (Successive Elimination Algorithm (K>2K > 2))

Initially all arms are set “active”;

Each phase,

  • Try all active arms (thus each phase may contain multiple rounds);
  • Deactivate all arms aa such that a\exists a^{\prime} with UCBt(a)<LCBt(a)\mathrm{UCB}_t(a) < \mathrm{LCB}_t(a^{\prime});

Repeat until only one arm remains active.

Further, let aa^{\star} be an optimal arm, and consider any arm aa such that μ(a)<μ(a)\mu(a) < \mu(a^{\star}).

If we analyze the last round t^\hat{t} when we have not yet deactivated arm aa (or, the last round TT if aa is still active at the end).

The confidence intervals of the two arms aa and aa^{\star} at round tt^t \leq \hat{t} must overlap (otherwise, arm aa would have been deactivated),

Δ(a):μ(a)μ(a)2rt(a)+2rt(a)O(rt(a))\Delta(a) \coloneq \mu(a^{\star}) - \mu(a) \leq \underbrace{2 r_t(a) + 2 r_t(a^{\star})}_{\leq \mathcal{O}(r_t(a))}

As we established, arm aa is never played after round t^\hat{t}, then,

nt^(a)=nT(a)rt^(a)=rT(a)\begin{align*} n_{\hat{t}}(a) & = n_T(a) \newline r_{\hat{t}}(a) & = r_T(a) \end{align*}

Thus, we have proved the following crucial property,

Lemma 1 (Successive Elimination Key Property)

For any suboptimal arm aa (i.e., μ(a)<μ(a)\mu(a) < \mu(a^{\star})), if it is still active at the end of round TT, then,

Δ(a)O(rT(a))=O(log(T)nT(a))\Delta(a) \leq \mathcal{O}(r_T(a)) = \mathcal{O}\left(\sqrt{\frac{\log(T)}{n_T(a)}}\right)

This is result is very nice, since it relates the suboptimality gap Δ(a)\Delta(a) to the number of times nT(a)n_T(a) that arm aa has been played. Thus, an arm played many times must be close to optimal!

The rest of our analysis will rely on this result.

Regret Analysis

As per usual, when doing analysis we will construct the clean event and then connect it to the regret (the bad event will be negligible).

Let R(t;a)R(t; a) be the regret contribution of arm aa at an arbitrary round tt.

Thus,

R(t;a)=nt(a)Δ(a)nt(a)O(log(T)nt(a))O(log(T)1/2nt(a)1/2)=O(nt(a)log(T))\begin{align*} R(t; a) & = n_t(a) \cdot \Delta(a) \leq n_t(a) \cdot \underbrace{\mathcal{O}\left(\sqrt{\frac{\log(T)}{n_t(a)}}\right)}_{\mathcal{O}(\frac{\log(T)^{1/2}}{n_t(a)^{1/2}})} \newline & = \mathcal{O}\left(\sqrt{n_t(a) \log(T)}\right) \end{align*}

We now introduce the set A+={aμ(a)<μ(a)}\mathcal{A}^{+} = \{a \mid \mu(a) < \mu(a^{\star})\} the set of all arms that contribute to the regret (i.e., all suboptimal arms).

Then, we can bound the total regret at round tt as,

R(t)=aA+R(t;a)=aA+O(nt(a)log(T))=O(log(T))does not depend on sumaA+nt(a)O(log(T))aAnt(a)\begin{align*} R(t) & = \sum_{a \in \mathcal{A}^{+}} R(t; a) \newline & = \sum_{a \in \mathcal{A}^{+}} \mathcal{O}\left(\sqrt{n_t(a) \log(T)}\right) \newline & = \underbrace{\mathcal{O}(\sqrt{\log(T)})}_{\text{does not depend on sum}} \cdot \sum_{a \in \mathcal{A}^{+}} \sqrt{n_t(a)} \newline & \leq \mathcal{O}(\sqrt{\log(T)}) \cdot \sum_{a \in \mathcal{A}} \sqrt{n_t(a)} \end{align*}

By Jensen’s inequality 1 1Here we will use the concave version which states, 1ni=1nf(xi)f(1ni=1nxi)\frac{1}{n} \sum_{i=1}^{n} f(x_i) \leq f\left(\frac{1}{n} \sum_{i=1}^{n} x_i\right) for a concave function ff, and the fact that aAnt(a)=t\sum_{a \in \mathcal{A}} n_t(a) = t, we have that,

1KaAnt(a)1KaAnt(a)=tK\frac{1}{K} \sum_{a \in \mathcal{A}} \sqrt{n_t(a)} \leq \sqrt{\frac{1}{K} \sum_{a \in \mathcal{A}} n_t(a)} = \sqrt{\frac{t}{K}}

Thus,

aAnt(a)KtK=Kt\sum_{a \in \mathcal{A}} \sqrt{n_t(a)} \leq K \sqrt{\frac{t}{K}} = \sqrt{K t}

Thus, we have our final regret bound,

Theorem 1 (Successive Elimination Regret Bound (K>2K > 2))

For K>2K > 2 arms, the successive elimination algorithm achieves the regret bound,

E[R(T)]O(Ktlog(T))for all rounds tT\mathbb{E}[R(T)] \leq \mathcal{O}(\sqrt{K t \log(T)}) \quad \text{for all rounds } t \leq T

But there is another bound we can derive, from our previous property, one may rearrange to obtain,

Δ(a)O(log(T)nT(a))    nT(a)O(log(T)(Δ(a))2)\Delta(a) \leq \mathcal{O}\left( \sqrt{\frac{\log(T)}{n_T(a)}}\right) \implies n_T(a) \leq \mathcal{O}\left(\frac{\log(T)}{(\Delta(a))^2}\right)

If we now consider the set A+\mathcal{A}^{+} of all suboptimal arms again, we can bound the regret for each arm aA+a \in \mathcal{A}^{+} as,

R(T;a)=Δ(a)nT(a)Δ(a)O(log(T)(Δ(a))2)=O(log(T)Δ(a))R(T; a) = \Delta(a) \cdot n_T(a) \leq \Delta(a) \cdot \mathcal{O}\left(\frac{\log(T)}{(\Delta(a))^2}\right) = \mathcal{O}\left(\frac{\log(T)}{\Delta(a)}\right)

Summing over all suboptimal arms, we have,

R(T)O(log(T))aA+1Δ(a)R(T) \leq \mathcal{O}(\log(T)) \cdot \sum_{a \in \mathcal{A}^{+}} \frac{1}{\Delta(a)}
Theorem 2 (Successive Elimination Regret Bound (K>2K > 2))

For K>2K > 2 arms, the successive elimination algorithm achieves the regret bound,

E[R(T)]O(log(T))aA+1Δ(a)\mathbb{E}[R(T)] \leq \mathcal{O}(\log(T)) \cdot \sum_{a \in \mathcal{A}^{+}} \frac{1}{\Delta(a)}

Instance-Dependent vs. Instance-Independent Bounds

We have now derived two different regret bounds for the successive elimination algorithm. The first one is an instance-independent bound, since it does not depend on the gaps Δ(a)\Delta(a), while the second one is instance-dependent, since it does depend on the gaps.

These bounds are lograithmic in TT, with a constant that can be arbitrarily large (depending on the problem instance).

Thus, we conclude that adaptive exploration can lead to much better performance than uniform exploration (logarithmic vs. sublinear regret).

UCB and Optimism Under Uncertainty

Another adaptive exploration algorithm is the Upper Confidence Bound (UCB) algorithm.

The idea is to assume that each arm is as good as it can possiblyt be given the observations so far, and the agent chooses the best arm based on these optimistic estimates.

Summary (UCB Algorithm)

Try each arm once;

In each round tt, pick argmaxaA UCBt(a)\underset{a \in \mathcal{A}}{\arg\max} \ \mathrm{UCB}_t(a), where UCBt(a)=μˉt(a)+rt(a)\mathrm{UCB}_t(a) = \bar{\mu}_t(a) + r_t(a).

But why does this UCB-based strategy work?

Imagine that an arm aa is chosen in round tt because it has a large UCBt(a)=μˉt(a)+rt(a)\mathrm{UCB}_t(a) = \bar{\mu}_t(a) + r_t(a).

Then, either,

  • The average reward μˉt(a)\bar{\mu}_t(a) is large     \implies arm aa is likely to have a high reward μ(a)\mu(a),
  • and/or the confidence radius rt(a)r_t(a) is large     \implies arm aa has not been tried often, and thus we are uncertain about its true reward μ(a)\mu(a).

Either reason makes the arm aa a good candidate to try!

Note (Adaptive Exploration-Exploitation)

Note that, not only is this UCB-based algorithm an adaptive exploration algorithm, it is also an adaptive exploration-exploitation algorithm.

The μˉt(a)\bar{\mu}_t(a) term encourages exploitation, while the rt(a)r_t(a) term encourages exploration, and summing them up (i.e., taking the UCB) is a natural way to trade-off exploration and exploitation adaptively.

Analysis

Recall the clean event (from reward tape idea),

E{atμˉt(a)μ(a)rt(a)}.\mathcal{E} \coloneqq \{\forall a \forall t |\bar{\mu}_t(a) - \mu(a) | \leq r_t(a) \}.

Assume aa^{\star} be an optimal arm and ata_t is the arm chosen by the agent at round tt, i.e.,

UCBt(at)UCBt(a)\mathrm{UCB}_t(a_t) \geq \mathrm{UCB}_t(a^{\star})

We know that, under the clean event,

μ(at)+rt(at)μˉt(at)\mu(a_t) + r_t(a_t) \geq \bar{\mu}_t(a_t)

By definition of UCB and combining the two inequalities, we have,

UCBt(a)μ(a)\mathrm{UCB}_t(a^{\star}) \geq \mu(a^{\star})

Then,

μ(at)+2rt(at)μˉt(at)+rt(at)μ(at)+2rt(at)rt(at)μˉt(at)+rt(at)rt(at)μ(at)+rt(at)UCBt(at)μˉt(a)UCBt(at)UCBt(a)by our assumptionμˉt(a)\begin{align*} \mu(a_t) + 2 r_t(a_t) & \geq \bar{\mu}_t(a_t) + r_t(a_t) \newline \mu(a_t) + 2 r_t(a_t) - r_t(a_t) & \geq \bar{\mu}_t(a_t) + r_t(a_t) - r_t(a_t) \newline \underbrace{\mu(a_t) + r_t(a_t)}_{\mathrm{UCB}_t(a_t)} & \geq \bar{\mu}_t(a^{\star}) \newline \mathrm{UCB}_t(a_t) & \underbrace{\geq \mathrm{UCB}_t(a^{\star})}_{\text{by our assumption}} \geq \bar{\mu}_t(a^{\star}) \newline \end{align*}

Using this chain of inequalities, we find that,

μ(at)+2rt(at)μ(a),\mu(a_t) + 2 r_t(a_t) \geq \mu(a^{\star}),

Rearranging this we have found an expression of the gap Δ(at)\Delta(a_t),

Δ(at):μ(a)μ(at)2rt(at)=22log(T)nt(at)\Delta(a_t) \coloneq \mu(a^{\star}) - \mu(a_t) \leq 2 r_t(a_t) = 2 \sqrt{\frac{2 \log(T)}{n_t(a_t)}}

Now (again), for each arm aa, consider the last round t^\hat{t} when arm aa was chosen (or round TT if it was chosen in the last round) by the agent,

Δ(a)O(log(T)nt^(a))=O(log(T)nT(a))\Delta(a) \leq \mathcal{O}\left(\sqrt{\frac{\log(T)}{n_{\hat{t}}(a)}}\right) = \mathcal{O}\left(\sqrt{\frac{\log(T)}{n_T(a)}}\right)

Thus, we have proved the same key property as for successive elimination!

One can use the exact same analysis as for successive elimination to obtain the same two regret bounds.

Theorem 3 (UCB Regret Bound (K>2K > 2))

For K>2K > 2 arms, the UCB algorithm achieves the regret bounds,

E[R(T)]O(Ktlog(T))for all rounds tT\mathbb{E}[R(T)] \leq \mathcal{O}(\sqrt{K t \log(T)}) \quad \text{for all rounds } t \leq TE[R(T)]O(log(T))aA+1Δ(a)\mathbb{E}[R(T)] \leq \mathcal{O}(\log(T)) \cdot \sum_{a \in \mathcal{A}^{+}} \frac{1}{\Delta(a)}

Lower Bound

We have so far only discussed upper bounds on the regret. But what about lower bounds?

The performance of any algorithm is determined by the distance between the optimal arm and other arms. But, hard problems have arms with similar distributions.

These distances are formally described by the gaps Δ(a)\Delta(a) and the distribution distances KL(Da,Da)\mathrm{KL}(\mathcal{D}_a, \mathcal{D}_{a^{\star}}) 2Here, KL(,)\mathrm{KL}(\cdot, \cdot) is the Kullback-Leibler divergence between two distributions, xXP(x)log(P(x)Q(x))\sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right) .

We will end this part of the course wit the Lai and Robbins lower bound theorem,

Theorem 4 (Lai and Robbins Lower Bound)

The (asymptotic) total regret is at least logarithmic in number of steps,

limTR(T)log(T)aΔ(a)>0Δ(a)KL(Da,Da)\lim_{T \to \infty} R(T) \geq \log(T) \cdot \sum_{a \mid \Delta(a) > 0} \frac{\Delta(a)}{\mathrm{KL}(\mathcal{D}_a, \mathcal{D}_{a^{\star}})}

Footnotes

  1. Wikipedia: Jensen’s Inequality