Part 2 - Stochastic Multi-Armed Bandits II

ϵ\epsilon-Greedy Algorithm

In the last part, we understood the background and the related tools for formulating and solving the multi-armed bandit problem.

We formulated and analyzed the Explore-First algorithm and found the upper bound on the regret to be E[R(T)]=O(T2/3(Klog(T))1/3).\mathbb{E}[R(T)] = \mathcal{O}(T^{2/3} (K \log(T))^{1/3}).

However, the Explore-First algorithm has poor performance during the exploration phase.

So an idea is to spread exploration more uniformly over time.

Summary (ϵ\epsilon-Greedy Algorithm)
  • for each round t=1,2,,Tt = 1, 2, \ldots, T do
    • Toss a coin with success probability ϵt\epsilon_t;
    • if success then
      • explore: choose an arm uniformly at random from [K][K];
    • else
      • exploit: choose the arm with the highest average reward so far (i.e., argmaxi[K]μ^i(t)\arg\max_{i \in [K]} \hat{\mu}_i(t)).
  • end
Example of -Greedy Algorithm
Example of -Greedy Algorithm

Greedy VS. ϵ\epsilon-Greedy

The advantage of the ϵ\epsilon-Greedy over a pure greedy approach depends on the task.

Suppose we are dealing with a large reward variance, say 10 instead of 1 (what we had assumed so far). With noisier rewards, it will take more time to explore and find the optimal arm, thus, ϵ\epsilon-greedy becomes even more effective.

However, if the reward variance is very small or zero, then the greed method would be more effective, and in the case of zero, know the true value of each action after trying it once!

Theorem 1 (ϵ\epsilon-Greedy Regret Bound)

ϵ\epsilon-greedy algorithm with exploration probability ϵt=t1/3(Klog(K))1/3\epsilon_t = t^{-1/3} (K \log(K))^{1/3} achieves the regret bound,

E[R(T)]t2/3O((Klog(K))1/3)\mathbb{E}[R(T)] \leq t^{2/3} \mathcal{O}((K \log(K))^{1/3})

for each round tt.

We will not prove this, but this becomes consistent with Explore-First for t=Tt = T.

Adaptive Exploration

So far, both the exploration-first and ϵ\epsilon-greedy exploration schedules, do not depend on the history of the observed rewards.

Why would this be a good idea?

Suppose we have K=3K=3, we could have these two scenarios,

μ={0.8,0.75,0.3}μ={0.8,0.1,0.05}\begin{align*} \mathbf{\mu} &= \{0.8, 0.75, 0.3\} \newline \mathbf{\mu} &= \{0.8, 0.1, 0.05\} \end{align*}

In the first case, we need to explore more to distinguish between the first two arms, while in the second case, we can quickly identify the best arm (after a few trials). But this is not captured by the previous algorithms.

Formulating Adaptive Exploration

Let’s formulate the setup for adaptive exploration.

For a fixed round tt, let nt(a)n_t(a) be the number of samples drawn from arm aa in round 1,2,,t1, 2, \ldots, t and μˉt(a)\bar{\mu}_t(a) be the average reward of arm aa so far.

The first thing we would like to try is to see if we can use the Hoeffding’s inequality to construct confidence intervals around the empirical means,

P(μˉt(a)μ(a)rt(a))12T4wherert(a)=2log(T)nt(a),P(|\bar{\mu}_t(a) - \mu(a)| \leq r_t(a)) \geq 1 - 2 T^{-4} \quad \text{where} \quad r_t(a) = \sqrt{\frac{2 \log(T)}{n_t(a)}},

However, nt(a)n_t(a) is a random variable, we will need a more careful argument.

Reward Tape

We define the reward tape for arm aa as an 1×T1 \times T table with each cell independently sampled from Da\mathcal{D}_a.

The reward tape encodes rewards the jj-th time a given arm aa is chosen by our agent and its reward is taken from the jj-th cell in this arm’s reward type.

Thus, we can define the average reward of arm aa from the first jj times that arm aa is chosen. We denote this by vˉj(a)\bar{v}_j(a).

Let’s now try to apply Hoeffding’s inequality.

Assume that nt(a)=jn_t(a) = j is fixed, then we can apply Hoeffding’s inequality to get,

jP(vˉj(a)μ(a)rt(a))12T4\forall j \quad P(|\bar{v}_j(a) - \mu(a)| \leq r_t(a)) \geq 1 - 2 T^{-4}

We can now take a union bound over all j=1,2,,tj = 1, 2, \ldots, t to get,

P(jt,vˉj(a)μ(a)rt(a))12tT4tT12T3.P(\forall j \leq t, |\bar{v}_j(a) - \mu(a)| \leq r_t(a)) \geq \underbrace{1 - 2 t T^{-4}}_{t \leq T} \geq 1 - 2 T^{-3}.

Further, we can apply a second union bound over all arms a[K]a \in [K] to get,

P(a[K],jt,vˉj(a)μ(a)rt(a))12KT3KT12T2.P(\forall a \in [K], \forall j \leq t, |\bar{v}_j(a) - \mu(a)| \leq r_t(a)) \geq \underbrace{1 - 2 K T^{-3}}_{K \leq T} \geq 1 - 2 T^{-2}.

Thus, the above equation implies the clean event,

E{a[K],jt,vˉj(a)μ(a)rt(a)}\mathcal{E} \coloneqq \{\forall a \in [K], \forall j \leq t, |\bar{v}_j(a) - \mu(a)| \leq r_t(a) \}
Lemma 1 (Clean Event for Adaptive Exploration)

The clean event E\mathcal{E} holds with probability at least 12T21 - 2 T^{-2}.

Upper and Lower Confidence Bounds

With adaptive exploration, we can now define the upper and lower confidence bounds for each arm aa at round tt as,

UCBt(a)=μˉt(a)+rt(a)LCBt(a)=μˉt(a)rt(a)\begin{align*} \mathrm{UCB}_t(a) & = \bar{\mu}_t(a) + r_t(a) \newline \mathrm{LCB}_t(a) & = \bar{\mu}_t(a) - r_t(a) \end{align*}

We call the interval [LCBt(a),UCBt(a)][\mathrm{LCB}_t(a), \mathrm{UCB}_t(a)] the confidence interval for arm aa at round tt.

With this, we will now define a new algorithm, the successive elimination algorithm.

Successive Elimination Algorithm (K=2K=2)

We will first define the algorithm for K=2K=2 arms, and then we will generalize it to K>2K > 2 arms.

The idea is to alternate between the two arms until we find that one arm is much better than the other. It might feel that the condition is a ill-posed question, but we will see that it is not.

Summary (Successive Elimination Algorithm (K=2K=2))

Alternate two arms until UCBt(a)<LCBt(a)\mathrm{UCB}_t(a) < \mathrm{LCB}_t(a^{\prime}) after some even round t; Abandon arm aa, and use arm aa^{\prime} for the rest of the rounds.

For analysis, we will only assume the clean event (since E[R(T)bad]P(bad)\mathbb{E}[R(T) \mid \text{bad}] P(\text{bad}) is negligible).

With the clean event, the abandonded arm cannot be the best arm (by definition), i.e., no regret after the abandonment.

But how much regret do we accumulate before the abandonment?

Let t^\hat{t} be the last round before abandonment. Then, for any round tt^t \leq \hat{t}, we have that the confidence intervals of the two arms still overlap,

Δμ(a)μ(a)2rt(a)+2rt(a),\Delta \coloneqq |\mu(a) - \mu(a^{\prime})| \leq 2 r_t(a) + 2 r_t(a^{\prime}),

We have that nt()t2n_t(\cdot) \approx \frac{t}{2}. Thus 1This is by the Hoeffding Inequality under the clean event,

Δ2(rt(a)+rt(a))42log(T)t2=O(log(T)t),\Delta \leq 2(r_t(a) + r_t(a^{\prime})) \leq 4 \sqrt{\frac{2 \log(T)}{\lfloor \frac{t}{2} \rfloor}} = \mathcal{O}\left(\sqrt{\frac{\log(T)}{t}}\right),

Thus, the total regret accumulated till round tt is,

R(t)=ΔtO(log(T)tt)=O(tlog(T)),R(t) = \Delta \cdot t \leq \mathcal{O}\left(\sqrt{\frac{\log(T)}{t}} \cdot t\right) = \mathcal{O}(\sqrt{t \log(T)}),

Let’s now also consider the bad event,

E[R(T)]=E[R(T)clean]P(clean)12T21+E[R(T)bad]P(bad)2T2E[R(T)clean]O(Tlog(T))+tO(T2)O(tlog(T))\begin{align*} \mathbb{E}[R(T)] & = \mathbb{E}[R(T) \mid \text{clean}] \underbrace{P(\text{clean})}_{1 - 2 T^{-2} \leq \cdot \leq 1} + \mathbb{E}[R(T) \mid \text{bad}] \underbrace{P(\text{bad})}_{\leq 2 T^{-2}} \newline & \leq \underbrace{\mathbb{E}[R(T) \mid \text{clean}]}_{\mathcal{O}(\sqrt{T \log(T)})} + t \cdot \mathcal{O}(T^{-2}) \newline & \leq \mathcal{O}(\sqrt{t \log(T)}) \end{align*}
Lemma 2 (Successive Elimination Regret Bound (K=2K=2))

For K=2K=2 arms, the successive elimination algorithm achieves the regret bound,

E[R(T)]O(tlog(T))\mathbb{E}[R(T)] \leq \mathcal{O}(\sqrt{t \log(T)})

for each round tTt \leq T.

Note

For tt^t \leq \hat{t} this upper bound holds, as shown.

For t>t^t > \hat{t}, the regret is zero, since we always pull the optimal arm (by the clean event), and thus, E[R(t)]O(t^log(T))\mathbb{E}[R(t)] \leq \mathcal{O}(\sqrt{\hat{t} \log(T)}).

However, to unify the notation, we can still write E[R(t)]O(tlog(T))\mathbb{E}[R(t)] \leq \mathcal{O}(\sqrt{t \log(T)}) since O(t^log(T))O(tlog(T))\mathcal{O}(\sqrt{\hat{t} \log(T)}) \leq \mathcal{O}(\sqrt{t \log(T)}) for t>t^t > \hat{t}.