Introduction
In this part we will (again) cover convex optimization problems, subgradients, and subdifferentials.
We will see how we can use subgradients to solve convex optimization problems that are not necessarily differentiable.
Furthermore, we will connect these methods to the Lagrange dual problem.
Convex Optimization Problems
Recall (Convex Optimization Problem) Let’s firstly recall what we need to build a convex optimization problem.
Recall (Convex Set) A set S ⊆ R n S \subseteq \mathbb{R}^n S ⊆ R n is convex if,
x 1 , x 2 ∈ S λ ∈ ( 0 , 1 ) } ⟹ λ x 1 + ( 1 − λ ) x 2 ∈ S \begin{align*}
\left.
\begin{array}{l}
\mathbf{x}_1, \mathbf{x}_2 \in S \newline
\lambda \in (0, 1)
\end{array}
\right\}
\Longrightarrow \lambda \mathbf{x}_1 + (1 - \lambda) \mathbf{x}_2 \in S
\end{align*} x 1 , x 2 ∈ S λ ∈ ( 0 , 1 ) } ⟹ λ x 1 + ( 1 − λ ) x 2 ∈ S Recall (Convex Function) A function f : R n → R f: \mathbb{R}^n \to \mathbb{R} f : R n → R is convex on the convex set S S S if,
x 1 , x 2 ∈ S λ ∈ ( 0 , 1 ) } ⟹ f ( λ x 1 + ( 1 − λ ) x 2 ) ≤ λ f ( x 1 ) + ( 1 − λ ) f ( x 2 ) \begin{align*}
\left.
\begin{array}{l}
\mathbf{x}_1, \mathbf{x}_2 \in S \newline
\lambda \in (0, 1)
\end{array}
\right\}
\Longrightarrow f(\lambda \mathbf{x}_1 + (1 - \lambda) \mathbf{x}_2) \leq \lambda f(\mathbf{x}_1) + (1 - \lambda) f(\mathbf{x}_2)
\end{align*} x 1 , x 2 ∈ S λ ∈ ( 0 , 1 ) } ⟹ f ( λ x 1 + ( 1 − λ ) x 2 ) ≤ λ f ( x 1 ) + ( 1 − λ ) f ( x 2 ) Recall (Convex Optimization Problem) A optimization problem,
min f ( x ) subject to x ∈ S \begin{align*}
\min \ & f(\mathbf{x}) \newline
\text{subject to} \ & \mathbf{x} \in S \newline
\end{align*} min subject to f ( x ) x ∈ S is called convex if S S S is a convex set and f f f is a convex function on S S S .
Typically,
S = { x ∈ R n ∣ g i ( x ) ≤ 0 , i = 1 , … , m , h j ( x ) = 0 , j = 1 , … , l } S = \{\mathbf{x} \in \mathbb{R}^n \mid g_i(\mathbf{x}) \leq 0, \ i = 1, \ldots, m, \ h_j(\mathbf{x}) = 0, \ j = 1, \ldots, l \} S = { x ∈ R n ∣ g i ( x ) ≤ 0 , i = 1 , … , m , h j ( x ) = 0 , j = 1 , … , l } where g i g_i g i are convex functions and h j h_j h j are affine functions.
Note ((Re)formulating Problems to be Convex) Sometimes, we can reformulate problems to be convex.
min 1 2 x 1 2 + x 2 subject to x 1 2 − 2 x 2 = 0 x 1 , x 2 ∈ R " ⟺ " min x 1 2 subject to x ∈ R 2 , x 2 = 1 2 x 1 2 " ⟺ " min 1 2 x 1 2 + x 2 subject to x 1 2 − 2 x 2 ≤ 0 \begin{align*}
\min \ & \frac{1}{2} x_1^2 + x_2 \newline
\text{subject to} \ & x_1^2 - 2 x_2 = 0 \newline
& x_1, x_2 \in \mathbb{R}
\end{align*}
\ \ \text{"} \iff \text{"}
\begin{align*}
\min \ & x_1^2 \newline
\text{subject to} \ & \mathbf{x} \in \mathbb{R}^2, \ x_2 = \frac{1}{2} x_1^2
\end{align*}
\ \ \text{"} \iff \text{"}
\begin{align*}
\min \ & \frac{1}{2} x_1^2 + x_2 \newline
\text{subject to} \ & x_1^2 - 2 x_2 \leq 0
\end{align*} min subject to 2 1 x 1 2 + x 2 x 1 2 − 2 x 2 = 0 x 1 , x 2 ∈ R " ⟺ " min subject to x 1 2 x ∈ R 2 , x 2 = 2 1 x 1 2 " ⟺ " min subject to 2 1 x 1 2 + x 2 x 1 2 − 2 x 2 ≤ 0
Subgradients and Subdifferentials
Recall (Subgradient) Let S ⊆ R n S \subseteq \mathbb{R}^n S ⊆ R n be a non-empty and convex set and f : S → R f: S \to \mathbb{R} f : S → R be a convex function on S S S .
Then, p ∈ R n \mathbf{p} \in \mathbb{R}^n p ∈ R n is a subgradient of f f f at x ˉ \bar{\mathbf{x}} x ˉ if,
f ( x ) ≥ f ( x ˉ ) + p T ( x − x ˉ ) , ∀ x ∈ S f(\mathbf{x}) \geq f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}), \ \forall \mathbf{x} \in S f ( x ) ≥ f ( x ˉ ) + p T ( x − x ˉ ) , ∀ x ∈ S
Recall (Subdifferential) Let S S S and f f f be defined as above.
The subdifferential of f f f at x ˉ \bar{\mathbf{x}} x ˉ is,
∂ f ( x ˉ ) ≔ { p ∈ R n ∣ f ( x ) ≥ f ( x ˉ ) + p T ( x − x ˉ ) , ∀ x ∈ S } \partial f(\bar{\mathbf{x}}) \coloneqq \{\mathbf{p} \in \mathbb{R}^n \mid f(\mathbf{x}) \geq f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}), \ \forall \mathbf{x} \in S \} ∂ f ( x ˉ ) : = { p ∈ R n ∣ f ( x ) ≥ f ( x ˉ ) + p T ( x − x ˉ ) , ∀ x ∈ S } i.e., the set of all subgradients of f f f at x ˉ \bar{\mathbf{x}} x ˉ .
Note (C 1 C^1 C 1 functions and global optimality) If f ∈ C 1 f \in C^1 f ∈ C 1 and x ˉ ∈ i n t ( S ) \bar{\mathbf{x}} \in \mathrm{int}(S) x ˉ ∈ int ( S ) 1 1 i n t ( S ) \mathrm{int}(S) int ( S ) is the interior of S S S , i.e., the set of all points in S S S that are not on the boundary of S S S . , then,
∂ f ( x ˉ ) = { ∇ f ( x ˉ ) } \partial f(\bar{\mathbf{x}}) = \{\nabla f(\bar{\mathbf{x}})\} ∂ f ( x ˉ ) = { ∇ f ( x ˉ )} Further, if f : R n → R f : \mathbb{R}^n \to \mathbb{R} f : R n → R and f ∈ C 1 f \in C^1 f ∈ C 1 , then,
∇ f ( x ˉ ) = 0 ⟺ x ˉ is a global minimum of f \nabla f(\bar{\mathbf{x}}) = 0 \iff \bar{\mathbf{x}} \text{ is a global minimum of } f ∇ f ( x ˉ ) = 0 ⟺ x ˉ is a global minimum of f
Intuition (Proposition about Subdifferentials for Convex Functions) Let f : R n → R f: \mathbb{R}^n \to \mathbb{R} f : R n → R be a convex function, then,
x ⋆ is a global minimum of f ⟺ 0 ∈ ∂ f ( x ⋆ ) \mathbf{x}^{\star} \text{ is a global minimum of } f \iff \mathbf{0} \in \partial f(\mathbf{x}^{\star}) x ⋆ is a global minimum of f ⟺ 0 ∈ ∂ f ( x ⋆ ) Thus we can ask ourselves two questions,
How can we use subgradients?
How can we compute subgradients?
Subgradient Method
Intuition (Subgradient Method) Step 0: Initialize x 0 , f 0 best = f ( x 0 ) , k = 0 \mathbf{x}_0, f_0^\text{best} = f(\mathbf{x}_0), k = 0 x 0 , f 0 best = f ( x 0 ) , k = 0 .
Step 1: Find p k ∈ ∂ f ( x k ) \mathbf{p}_k \in \partial f(\mathbf{x}_k) p k ∈ ∂ f ( x k ) .
Step 2: Update x k + 1 = x k − α k p k \mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \mathbf{p}_k x k + 1 = x k − α k p k .
Step 3: f k + 1 best = min { f k best , f ( x k + 1 ) } f_{k+1}^\text{best} = \min\{f_k^\text{best}, f(\mathbf{x}_{k+1})\} f k + 1 best = min { f k best , f ( x k + 1 )} .
If termination criteria is fulfilled, STOP; Else, go to (1) with k ← k + 1 k \leftarrow k + 1 k ← k + 1 .
Note For an element p ∈ ∂ f ( x ) \mathbf{p} \in \partial f(\mathbf{x}) p ∈ ∂ f ( x ) , − p -\mathbf{p} − p might not be a descent direction from x ˉ \bar{\mathbf{x}} x ˉ .
Example 1 Consider,
min ∥ x 1 ∥ + 2 ∥ x 2 ∥ subject to x ∈ R 2 \begin{align*}
\min \ & \Vert x_1 \Vert + 2 \Vert x_2 \Vert \newline
\text{subject to} \ & \mathbf{x} \in \mathbb{R}^2
\end{align*} min subject to ∥ x 1 ∥ + 2∥ x 2 ∥ x ∈ R 2 Let x ˉ = [ 1 0 ] T \bar{\mathbf{x}} = \begin{bmatrix} 1 & 0 \end{bmatrix}^T x ˉ = [ 1 0 ] T , then f ( x ˉ ) = 1 f(\bar{\mathbf{x}}) = 1 f ( x ˉ ) = 1 , further, let p = [ 1 1 ] T \mathbf{p} = \begin{bmatrix} 1 & 1 \end{bmatrix}^T p = [ 1 1 ] T ,
f ( x ˉ ) + p T ( x − x ˉ ) = 1 + [ 1 1 ] [ x 1 − 1 x 2 − 0 ] = x 1 + x 2 ≤ ∥ x 1 ∥ + ∥ 2 x 2 ∥ ≤ ∥ x 1 ∥ + 2 ∥ x 2 ∥ ⏟ f ( x ) ⟹ p ∈ ∂ f ( x ˉ ) \begin{align*}
f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}) & = 1 + \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} x_1 - 1 \newline x_2 - 0 \end{bmatrix} \newline
& = x_1 + x_2 \newline
& \leq \Vert x_1 \Vert + \Vert 2 x_2 \Vert \newline
& \leq \underbrace{\Vert x_1 \Vert + 2 \Vert x_2 \Vert}_{f(\mathbf{x})} \newline
& \implies \mathbf{p} \in \partial f(\bar{\mathbf{x}}) \newline
\end{align*} f ( x ˉ ) + p T ( x − x ˉ ) = 1 + [ 1 1 ] [ x 1 − 1 x 2 − 0 ] = x 1 + x 2 ≤ ∥ x 1 ∥ + ∥2 x 2 ∥ ≤ f ( x ) ∥ x 1 ∥ + 2∥ x 2 ∥ ⟹ p ∈ ∂ f ( x ˉ ) However, notice that,
f ( x ˉ − α p ) = ∥ x ˉ 1 − α p 1 ∥ + 2 ∥ x ˉ 2 − α p 2 ∥ = ∥ 1 − α ∥ + 2 ∥ − α ∥ ≥ 1 ∀ x \begin{align*}
f(\bar{\mathbf{x}} - \alpha \mathbf{p}) & = \Vert \bar{x}_1 - \alpha p_1 \Vert + 2 \Vert \bar{x}_2 - \alpha p_2 \Vert \newline
& = \Vert 1 - \alpha \Vert + 2 \Vert - \alpha \Vert \geq 1 \ \forall \mathbf{x}
\end{align*} f ( x ˉ − α p ) = ∥ x ˉ 1 − α p 1 ∥ + 2∥ x ˉ 2 − α p 2 ∥ = ∥1 − α ∥ + 2∥ − α ∥ ≥ 1 ∀ x
Intuition (Subgradient Step) Although − p k -\mathbf{p}_k − p k might not be a descent direction, it moves us closer to the optimal solution,
p k ∈ ∂ f ( x k ) ⟹ f ( x ⋆ ) ≥ f ( x k ) + p k T ( x ⋆ − x k ) \mathbf{p}_k \in \partial f(\mathbf{x}_k) \implies f(\mathbf{x}^{\star}) \geq f(\mathbf{x}_k) + \mathbf{p}_k^T (\mathbf{x}^{\star} - \mathbf{x}_k) p k ∈ ∂ f ( x k ) ⟹ f ( x ⋆ ) ≥ f ( x k ) + p k T ( x ⋆ − x k ) where x ⋆ \mathbf{x}^{\star} x ⋆ is a global minimum of f f f ,
⟹ p k T ( x k − x ⋆ ) ⏟ points towards optimal solution ≥ f ( x k ) − f ( x ⋆ ) ≥ 0 \begin{align*}
\implies \mathbf{p}_k^T \underbrace{(\mathbf{x}_k - \mathbf{x}^{\star})}_{\text{points towards optimal solution}} & \geq f(\mathbf{x}_k) - f(\mathbf{x}^{\star}) \newline
& \geq 0
\end{align*} ⟹ p k T points towards optimal solution ( x k − x ⋆ ) ≥ f ( x k ) − f ( x ⋆ ) ≥ 0 i.e., the directions define half-spaces where the optimal solution cannot be.
Definition 1 (Step Length Rule) For the subgradient method, we set the step length sequence { α k } \{\alpha_k\} { α k } a priori, a common rule is a sequence that is square summable but not summable, i.e.,
∑ k = 0 ∞ α k 2 < ∞ ∑ k = 0 ∞ α k = ∞ \begin{align*}
\sum_{k=0}^{\infty} \alpha_k^2 & < \infty \newline
\sum_{k=0}^{\infty} \alpha_k & = \infty \newline
\end{align*} k = 0 ∑ ∞ α k 2 k = 0 ∑ ∞ α k < ∞ = ∞ e.g.,
α k = a b + c k , a , b , c > 0 \alpha_k = \frac{a}{b + ck}, \ a, b, c > 0 α k = b + c k a , a , b , c > 0
Theorem 1 (Convergence of Subgradient Method) Let f : R n → R f: \mathbb{R}^n \to \mathbb{R} f : R n → R be a convex function and let X ⋆ = arg min f ( x ) \mathbf{X}^{\star} = \arg \min \ f(\mathbf{x}) X ⋆ = arg min f ( x ) be non-empty.
Let { x k } \{\mathbf{x}_k\} { x k } be the sequence generated by the subgradient method where { α k } \{\alpha_k\} { α k } is square summable but not summable.
If { p k } \{\mathbf{p}_k\} { p k } is bounded, then,
f ( x k ) → f ⋆ and x k → x ⋆ ∈ X ⋆ f(\mathbf{x}_k) \to f^{\star} \text{ and } \mathbf{x}_k \to \mathbf{x}^{\star} \in \mathbf{X}^{\star} f ( x k ) → f ⋆ and x k → x ⋆ ∈ X ⋆
Connection to Lagrange Dual Problem
Recall (Primal, Lagrangian, and Dual Problem) Consider the primal problem,
( P ) { inf f ( x ) subject to g i ( x ) ≤ 0 , i = 1 , … , m x ∈ X \begin{align*}
(P) \quad &
\begin{cases}
\inf \ & f(\mathbf{x}) \newline
\text{subject to} \ & g_i(\mathbf{x}) \leq 0, \ i = 1, \ldots, m \newline
& \mathbf{x} \in \mathbf{X}
\end{cases}
\end{align*} ( P ) ⎩ ⎨ ⎧ inf subject to f ( x ) g i ( x ) ≤ 0 , i = 1 , … , m x ∈ X We define the Lagrangian function as,
L ( x , μ ) = f ( x ) + ∑ i = 1 m μ i g i ( x ) \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) = f(\mathbf{x}) + \sum_{i=1}^m \mu_i g_i(\mathbf{x}) L ( x , μ ) = f ( x ) + i = 1 ∑ m μ i g i ( x ) where μ ≥ 0 \boldsymbol{\mu} \geq 0 μ ≥ 0 are the Lagrange multipliers.
The Lagrange dual function is defined as,
q ( μ ) = inf x ∈ X L ( x , μ ) q(\boldsymbol{\mu}) = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) q ( μ ) = x ∈ X inf L ( x , μ ) The dual problem is then defined as,
( D ) { sup q ( μ ) subject to μ ≥ 0 \begin{align*}
(D) \quad &
\begin{cases}
\sup \ & q(\boldsymbol{\mu}) \newline
\text{subject to} \ & \boldsymbol{\mu} \geq 0
\end{cases}
\end{align*} ( D ) { sup subject to q ( μ ) μ ≥ 0 Note (Convexity of the Dual Problem) Since q q q is concave and μ ≥ 0 \boldsymbol{\mu} \geq 0 μ ≥ 0 is a convex set, the dual problem is a convex optimization problem.
However, as we have noted before, to evaluate q ( μ ) q(\boldsymbol{\mu}) q ( μ ) , we must solve an optimization problem, which may not even be easier than the primal problem.
Theorem 2 (Subgradients of the Lagrange Dual Function) Let x ( μ ) ≔ arg min x ∈ X L ( x , μ ) \mathbf{x}(\boldsymbol{\mu}) \coloneqq \arg \min_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) x ( μ ) : = arg min x ∈ X L ( x , μ ) .
Further, let X \mathbf{X} X be non-empty and compact and let μ ≥ 0 \boldsymbol{\mu} \geq 0 μ ≥ 0 .
If x μ ∈ x ( μ ) \mathbf{x}_{\boldsymbol{\mu}} \in \mathbf{x}(\boldsymbol{\mu}) x μ ∈ x ( μ ) , then,
g ( x μ ) ≔ [ g i ( x μ ) ] i = 1 m is a subgradient to q at μ g(\mathbf{x}_{\boldsymbol{\mu}}) \coloneqq \left[g_i(\mathbf{x}_{\boldsymbol{\mu}})\right]_{i=1}^m \text{ is a subgradient to } q \text{ at } \boldsymbol{\mu} g ( x μ ) : = [ g i ( x μ ) ] i = 1 m is a subgradient to q at μ Furthermore,
∂ q ( μ ) = c o n v { g ( x μ ) ∣ x μ ∈ x ( μ ) } \partial q(\boldsymbol{\mu}) = \mathrm{conv} \{g(\mathbf{x}_{\boldsymbol{\mu}}) \mid \mathbf{x}_{\boldsymbol{\mu}} \in \mathbf{x}(\boldsymbol{\mu})\} ∂ q ( μ ) = conv { g ( x μ ) ∣ x μ ∈ x ( μ )} where c o n v ( S ) \mathrm{conv}(S) conv ( S ) is the convex hull of the set S S S .
Proof (Proof Idea) q ( μ ) = inf x ∈ X L ( z , μ ) = inf x ∈ X f ( z ) + ∑ i = 1 m μ i g i ( z ) = f ( x μ ) + ∑ i = 1 m μ i g i ( x μ ) = f ( x μ ) + μ T g ( x μ ) = f ( x μ ) + μ T g ( x μ ) ( μ ˉ − μ ) T g ( x μ ) ⏟ 0 = f ( x μ ) + μ ˉ T g ( x μ ) ⏟ q ( μ ˉ ) + ( μ − μ ˉ ) T g ( x μ ) ≥ q ( μ ˉ ) + ( μ − μ ˉ ) T g ( x μ ) \begin{align*}
q(\boldsymbol{\mu}) & = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{z}, \boldsymbol{\mu}) \newline
& = \inf_{\mathbf{x} \in \mathbf{X}} \ f(\mathbf{z}) + \sum_{i=1}^m \mu_i g_i(\mathbf{z}) \newline
& = f(\mathbf{x}_{\boldsymbol{\mu}}) + \sum_{i=1}^m \mu_i g_i(\mathbf{x}_{\boldsymbol{\mu}}) \newline
& = f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\mu}^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline
& = f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\mu}^T g(\mathbf{x}_{\boldsymbol{\mu}}) \underbrace{(\boldsymbol{\bar{\mu}} - \boldsymbol{\mu})^T g(\mathbf{x}_{\boldsymbol{\mu}})}_{0} \newline
& = \underbrace{f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\bar{\mu}}^T g(\mathbf{x}_{\boldsymbol{\mu}})}_{q(\boldsymbol{\bar{\mu}})} + (\boldsymbol{\mu} - \boldsymbol{\bar{\mu}})^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline
& \geq q(\boldsymbol{\bar{\mu}}) + (\boldsymbol{\mu} - \boldsymbol{\bar{\mu}})^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline
\end{align*} q ( μ ) = x ∈ X inf L ( z , μ ) = x ∈ X inf f ( z ) + i = 1 ∑ m μ i g i ( z ) = f ( x μ ) + i = 1 ∑ m μ i g i ( x μ ) = f ( x μ ) + μ T g ( x μ ) = f ( x μ ) + μ T g ( x μ ) 0 ( μ ˉ − μ ) T g ( x μ ) = q ( μ ˉ ) f ( x μ ) + μ ˉ T g ( x μ ) + ( μ − μ ˉ ) T g ( x μ ) ≥ q ( μ ˉ ) + ( μ − μ ˉ ) T g ( x μ ) Thus, g ( x μ ) g(\mathbf{x}_{\boldsymbol{\mu}}) g ( x μ ) is a subgradient to q q q at μ \boldsymbol{\mu} μ .
Algorithm (Subgradient Method for the Lagrange Dual Problem) Step 0: Initialize μ 0 ≥ 0 , q 0 best = q ( μ 0 ) , k = 0 \boldsymbol{\mu}_0 \geq 0, q_0^\text{best} = q(\boldsymbol{\mu}_0), k = 0 μ 0 ≥ 0 , q 0 best = q ( μ 0 ) , k = 0 .
Step 1: Evaluate q ( μ k ) q(\boldsymbol{\mu}_k) q ( μ k ) , i.e., solve q ( μ k ) = inf x ∈ X L ( x , μ k ) q(\boldsymbol{\mu}_k) = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}_k) q ( μ k ) = inf x ∈ X L ( x , μ k ) .
Let x k \mathbf{x}_k x k be a solution.
Step 2: Update μ k + 1 = ( μ k + α k g ( x k ) ) + \boldsymbol{\mu}_{k+1} = \left(\boldsymbol{\mu}_k + \alpha_k g(\mathbf{x}_k)\right)_+ μ k + 1 = ( μ k + α k g ( x k ) ) + , where ( ⋅ ) + (\cdot)_+ ( ⋅ ) + is max { 0 , ⋅ } \max\{0, \cdot\} max { 0 , ⋅ } applied element-wise.
Step 3: Let q k + 1 best = max { q k best , q ( μ k + 1 ) } q_{k+1}^\text{best} = \max\{q_k^\text{best}, q(\boldsymbol{\mu}_{k+1})\} q k + 1 best = max { q k best , q ( μ k + 1 )} .
Step 4: If termination criteria is fulfilled, STOP; Else, go to (1) with k ← k + 1 k \leftarrow k + 1 k ← k + 1 .