In the last part we introduced successive elimination for the case of K=2 arms.
We will extend this to the case of K>2 arms and introduce another adaptive exploration algorithm.
Summary (Successive Elimination Algorithm (K>2))
Initially all arms are set “active”;
Each phase,
Try all active arms (thus each phase may contain multiple rounds);
Deactivate all arms a such that ∃a′ with UCBt(a)<LCBt(a′);
Repeat until only one arm remains active.
Further, let a⋆ be an optimal arm, and consider any arm a such that μ(a)<μ(a⋆).
If we analyze the last round t^ when we have not yet deactivated arm a (or, the last round T if a is still active at the end).
The confidence intervals of the two arms a and a⋆ at round t≤t^ must overlap (otherwise, arm a would have been deactivated),
Δ(a):−μ(a⋆)−μ(a)≤≤O(rt(a))2rt(a)+2rt(a⋆)
As we established, arm a is never played after round t^, then,
nt^(a)rt^(a)=nT(a)=rT(a)
Thus, we have proved the following crucial property,
Lemma 1 (Successive Elimination Key Property)
For any suboptimal arm a (i.e., μ(a)<μ(a⋆)), if it is still active at the end of round T, then,
Δ(a)≤O(rT(a))=O(nT(a)log(T))
This is result is very nice, since it relates the suboptimality gap Δ(a) to the number of times nT(a) that arm a has been played.
Thus, an arm played many times must be close to optimal!
The rest of our analysis will rely on this result.
Regret Analysis
As per usual, when doing analysis we will construct the clean event and then connect it to the regret (the bad event will be negligible).
Let R(t;a) be the regret contribution of arm a at an arbitrary round t.
We now introduce the set A+={a∣μ(a)<μ(a⋆)} the set of all arms that contribute to the regret (i.e., all suboptimal arms).
Then, we can bound the total regret at round t as,
R(t)=a∈A+∑R(t;a)=a∈A+∑O(nt(a)log(T))=does not depend on sumO(log(T))⋅a∈A+∑nt(a)≤O(log(T))⋅a∈A∑nt(a)
By Jensen’s inequality 11Here we will use the concave version which states, n1∑i=1nf(xi)≤f(n1∑i=1nxi) for a concave function f, and the fact that ∑a∈Ant(a)=t, we have that,
For K>2 arms, the successive elimination algorithm achieves the regret bound,
E[R(T)]≤O(log(T))⋅a∈A+∑Δ(a)1
Instance-Dependent vs. Instance-Independent Bounds
We have now derived two different regret bounds for the successive elimination algorithm.
The first one is an instance-independent bound, since it does not depend on the gaps Δ(a), while the second one is instance-dependent, since it does depend on the gaps.
These bounds are lograithmic in T, with a constant that can be arbitrarily large (depending on the problem instance).
Thus, we conclude that adaptive exploration can lead to much better performance than uniform exploration (logarithmic vs. sublinear regret).
UCB and Optimism Under Uncertainty
Another adaptive exploration algorithm is the Upper Confidence Bound (UCB) algorithm.
The idea is to assume that each arm is as good as it can possiblyt be given the observations so far, and the agent chooses the best arm based on these optimistic estimates.
Summary (UCB Algorithm)
Try each arm once;
In each round t, pick a∈AargmaxUCBt(a), where UCBt(a)=μˉt(a)+rt(a).
But why does this UCB-based strategy work?
Imagine that an arm a is chosen in round t because it has a large UCBt(a)=μˉt(a)+rt(a).
Then, either,
The average reward μˉt(a) is large ⟹ arm a is likely to have a high reward μ(a),
and/or the confidence radius rt(a) is large ⟹ arm a has not been tried often, and thus we are uncertain about its true reward μ(a).
Either reason makes the arm a a good candidate to try!
Note (Adaptive Exploration-Exploitation)
Note that, not only is this UCB-based algorithm an adaptive exploration algorithm, it is also an adaptive exploration-exploitation algorithm.
The μˉt(a) term encourages exploitation, while the rt(a) term encourages exploration, and summing them up (i.e., taking the UCB) is a natural way to trade-off exploration and exploitation adaptively.
Analysis
Recall the clean event (from reward tape idea),
E:={∀a∀t∣μˉt(a)−μ(a)∣≤rt(a)}.
Assume a⋆ be an optimal arm and at is the arm chosen by the agent at round t, i.e.,
UCBt(at)≥UCBt(a⋆)
We know that, under the clean event,
μ(at)+rt(at)≥μˉt(at)
By definition of UCB and combining the two inequalities, we have,
Rearranging this we have found an expression of the gap Δ(at),
Δ(at):−μ(a⋆)−μ(at)≤2rt(at)=2nt(at)2log(T)
Now (again), for each arm a, consider the last round t^ when arm a was chosen (or round T if it was chosen in the last round) by the agent,
Δ(a)≤O(nt^(a)log(T))=O(nT(a)log(T))
Thus, we have proved the same key property as for successive elimination!
One can use the exact same analysis as for successive elimination to obtain the same two regret bounds.
Theorem 3 (UCB Regret Bound (K>2))
For K>2 arms, the UCB algorithm achieves the regret bounds,
E[R(T)]≤O(Ktlog(T))for all rounds t≤TE[R(T)]≤O(log(T))⋅a∈A+∑Δ(a)1
Lower Bound
We have so far only discussed upper bounds on the regret. But what about lower bounds?
The performance of any algorithm is determined by the distance between the optimal arm and other arms.
But, hard problems have arms with similar distributions.
These distances are formally described by the gaps Δ(a) and the distribution distances KL(Da,Da⋆)2Here, KL(⋅,⋅) is the Kullback-Leibler divergence between two distributions, ∑x∈XP(x)log(Q(x)P(x)).
We will end this part of the course wit the Lai and Robbins lower bound theorem,
Theorem 4 (Lai and Robbins Lower Bound)
The (asymptotic) total regret is at least logarithmic in number of steps,