Introduction
In this part we will (again) cover convex optimization problems, subgradients, and subdifferentials. We will see how we can use subgradients to solve convex optimization problems that are not necessarily differentiable. Furthermore, we will connect these methods to the Lagrange dual problem.
Convex Optimization Problems
Recall: Convex Optimization Problem
Let’s firstly recall what we need to build a convex optimization problem.
Recall: Convex Set
A set $S \subseteq \mathbb{R}^n$ is convex if, $$ \begin{align*} \left. \begin{array}{l} \mathbf{x}_1, \mathbf{x}_2 \in S \newline \lambda \in (0, 1) \end{array} \right\} \Longrightarrow \lambda \mathbf{x}_1 + (1 - \lambda) \mathbf{x}_2 \in S \end{align*} $$
Recall: Convex Function
A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex on the convex set $S$ if, $$ \begin{align*} \left. \begin{array}{l} \mathbf{x}_1, \mathbf{x}_2 \in S \newline \lambda \in (0, 1) \end{array} \right\} \Longrightarrow f(\lambda \mathbf{x}_1 + (1 - \lambda) \mathbf{x}_2) \leq \lambda f(\mathbf{x}_1) + (1 - \lambda) f(\mathbf{x}_2) \end{align*} $$
Recall: Convex Optimization Problem
A optimization problem, $$ \begin{align*} \min \ & f(\mathbf{x}) \newline \text{subject to} \ & \mathbf{x} \in S \newline \end{align*} $$ is called convex if $S$ is a convex set and $f$ is a convex function on $S$. Typically, $$ S = \{\mathbf{x} \in \mathbb{R}^n \mid g_i(\mathbf{x}) \leq 0, \ i = 1, \ldots, m, \ h_j(\mathbf{x}) = 0, \ j = 1, \ldots, l \} $$ where $g_i$ are convex functions and $h_j$ are affine functions. [!NOTE/(Re)formulating Problems to be Convex] Sometimes, we can reformulate problems to be convex. $$ \begin{align*} \min \ & \frac{1}{2} x_1^2 + x_2 \newline \text{subject to} \ & x_1^2 - 2 x_2 = 0 \newline & x_1, x_2 \in \mathbb{R} \end{align*} \text{” } \iff \text{ ”} \begin{align*} \min \ & x_1^2 \newline \text{subject to} \ & \mathbf{x} \in \mathbb{R}^2, \ x_2 = \frac{1}{2} x_1^2 \end{align*} \text{” } \iff text{ ”} \begin{align*} \min \ & \frac{1}{2} x_1^2 + x_2 \newline \text{subject to} \ & x_1^2 - 2 x_2 \leq 0 \end{align*} $$
Subgradients and Subdifferentials
Recall: Subgradient
Let $S \subseteq \mathbb{R}^n$ be a non-empty and convex set and $f: S \to \mathbb{R}$ be a convex function on $S$. Then, $\mathbf{p} \in \mathbb{R}^n$ is a subgradient of $f$ at $\bar{\mathbf{x}}$ if, $$ f(\mathbf{x}) \geq f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}), \ \forall \mathbf{x} \in S $$
Recall: Subdifferential
Let $S$ and $f$ be defined as above. The subdifferential of $f$ at $\bar{\mathbf{x}}$ is, $$ \partial f(\bar{\mathbf{x}}) \coloneqq \{\mathbf{p} \in \mathbb{R}^n \mid f(\mathbf{x}) \geq f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}), \ \forall \mathbf{x} \in S \} $$ i.e., the set of all subgradients of $f$ at $\bar{\mathbf{x}}$. [!NOTE/$C^1$ functions and global optimality] If $f \in C^1$ AND $\bar{\mathbf{x}} \in \mathrm{int}(S)$ 1, then, $$ \partial f(\bar{\mathbf{x}}) = \{\nabla f(\bar{\mathbf{x}})\} $$ Further, if $f : \mathbb{R}^n \to \mathbb{R}$ and $f \in C^1$, then, $$ \nabla f(\bar{\mathbf{x}}) = 0 \iff \bar{\mathbf{x}} \text{ is a global minimum of } f $$
Intuition: Proposition about Subdifferentials for Convex Functions
Let $f: \mathbb{R}^n \to \mathbb{R}$ be a convex function, then, $$ \mathbf{x}^{\star} \text{ is a global minimum of } f \iff \mathbf{0} \in \partial f(\mathbf{x}^{\star}) $$ Thus we can ask ourselves two questions,
- How can we use subgradients?
- How can we compute subgradients?
Subgradient Method
Intuition: Subgradient Method
Step 0: Initialize $\mathbf{x}_0, f_0^\text{best} = f(\mathbf{x}_0), k = 0$.
Step 1: Find $\mathbf{p}_k \in \partial f(\mathbf{x}_k)$.
Step 2: Update $\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \mathbf{p}_k$.
Step 3: $f_{k+1}^\text{best} = \min\{f_k^\text{best}, f(\mathbf{x}_{k+1})\}$.
If termination criteria is fulfilled, STOP; Else, go to (1) with $k \leftarrow k + 1$.
Note
For an element $\mathbf{p} \in \partial f(\mathbf{x})$, $-\mathbf{p}$ might not be a descent direction from $\bar{\mathbf{x}}$.
Example
Consider, $$ \begin{align*} \min \ & \Vert x_1 \Vert + 2 \Vert x_2 \Vert \newline \text{subject to} \ & \mathbf{x} \in \mathbb{R}^2 \end{align*} $$ Let $\bar{\mathbf{x}} = \begin{bmatrix} 1 & 0 \end{bmatrix}^T$, then $f(\bar{\mathbf{x}}) = 1$, further, let $\mathbf{p} = \begin{bmatrix} 1 & 1 \end{bmatrix}^T$, $$ \begin{align*} f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}) & = 1 + \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} x_1 - 1 \newline x_2 - 0 \end{bmatrix} \newline & = x_1 + x_2 \newline & \leq \Vert x_1 \Vert + \Vert 2 x_2 \Vert \newline & \leq \underbrace{\Vert x_1 \Vert + 2 \Vert x_2 \Vert}_{f(\mathbf{x})} \newline & \implies \mathbf{p} \in \partial f(\bar{\mathbf{x}}) \newline \end{align*} $$ However, notice that, $$ \begin{align*} f(\bar{\mathbf{x}} - \alpha \mathbf{p}) & = \Vert \bar{x}_1 - \alpha p_1 \Vert + 2 \Vert \bar{x}_2 - \alpha p_2 \Vert \newline & = \Vert 1 - \alpha \Vert + 2 \Vert - \alpha \Vert \geq 1 \ \forall \mathbf{x} \end{align*} $$
Intuition: Subgradient Step
Although $-\mathbf{p}_k$ might not be a descent direction, it moves us closer to the optimal solution, $$ \mathbf{p}_k \in \partial f(\mathbf{x}_k) \implies f(\mathbf{x}^{\star}) \geq f(\mathbf{x}_k) + \mathbf{p}_k^T (\mathbf{x}^{\star} - \mathbf{x}_k) $$ where $\mathbf{x}^{\star}$ is a global minimum of $f$, $$ \begin{align*} \implies \mathbf{p}_k^T \underbrace{(\mathbf{x}_k - \mathbf{x}^{\star})}_{\text{points towards optimal solution}} & \geq f(\mathbf{x}_k) - f(\mathbf{x}^{\star}) \newline & \geq 0 \end{align*} $$ i.e., the directions define half-spaces where the optimal solution cannot be.
Definition: Step Length Rule
For the subgradient method, we set the step length sequence $\{\alpha_k\}$ a priori, a common rule is a sequence that is square summable but not summable, i.e., $$ \begin{align*} \sum_{k=0}^{\infty} \alpha_k^2 & < \infty \newline \sum_{k=0}^{\infty} \alpha_k & = \infty \newline \end{align*} $$ e.g., $$ \alpha_k = \frac{a}{b + ck}, \ a, b, c > 0 $$
Theorem: Convergence of Subgradient Method
Let $f: \mathbb{R}^n \to \mathbb{R}$ be a convex function and let $\mathbf{X}^{\star} = \arg \min \ f(\mathbf{x})$ be non-empty. Let $\{\mathbf{x}_k\}$ be the sequence generated by the subgradient method where $\{\alpha_k\}$ is square summable but not summable. If $\{\mathbf{p}_k\}$ is bounded, then, $$ f(\mathbf{x}_k) \to f^{\star} \text{ and } \mathbf{x}_k \to \mathbf{x}^{\star} \in \mathbf{X}^{\star} $$
Connection to Lagrange Dual Problem
Recall: Primal, Lagrangian, and Dual Problem
Consider the primal problem, $$ \begin{align*} (P) \quad & \begin{cases} \inf \ & f(\mathbf{x}) \newline \text{subject to} \ & g_i(\mathbf{x}) \leq 0, \ i = 1, \ldots, m \newline & \mathbf{x} \in \mathbf{X} \end{cases} \end{align*} $$ We define the Lagrangian function as, $$ \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) = f(\mathbf{x}) + \sum_{i=1}^m \mu_i g_i(\mathbf{x}) $$ where $\boldsymbol{\mu} \geq 0$ are the Lagrange multipliers. The Lagrange dual function is defined as, $$ q(\boldsymbol{\mu}) = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) $$ The dual problem is then defined as, $$ \begin{align*} (D) \quad & \begin{cases} \sup \ & q(\boldsymbol{\mu}) \newline \text{subject to} \ & \boldsymbol{\mu} \geq 0 \end{cases} \end{align*} $$
Note: Convexity of the Dual Problem
Since $q$ is concave and $\boldsymbol{\mu} \geq 0$ is a convex set, the dual problem is a convex optimization problem. However, as we have noted before, to evaluate $q(\boldsymbol{\mu})$, we must solve an optimization problem, which may not even be easier than the primal problem.
Theorem: Subgradients of the Lagrange Dual Function
Let $\mathbf{x}(\boldsymbol{\mu}) \coloneqq \arg \min_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu})$. Further, let $\mathbf{X}$ be non-empty and compact and let $\boldsymbol{\mu} \geq 0$. If $\mathbf{x}_{\boldsymbol{\mu}} \in \mathbf{x}(\boldsymbol{\mu})$, then, $$ g(\mathbf{x}_{\boldsymbol{\mu}}) \coloneqq \left[g_i(\mathbf{x}_{\boldsymbol{\mu}})\right]_{i=1}^m \text{ is a subgradient to } q \text{ at } \boldsymbol{\mu} $$ Furthermore, $$ \partial q(\boldsymbol{\mu}) = \mathrm{conv} \{g(\mathbf{x}_{\boldsymbol{\mu}}) \mid \mathbf{x}_{\boldsymbol{\mu}} \in \mathbf{x}(\boldsymbol{\mu})\} $$ where $\mathrm{conv}(S)$ is the convex hull of the set $S$.
Proof: Proof Idea
$$ \begin{align*} q(\boldsymbol{\mu}) & = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{z}, \boldsymbol{\mu}) \newline & = \inf_{\mathbf{x} \in \mathbf{X}} \ f(\mathbf{z}) + \sum_{i=1}^m \mu_i g_i(\mathbf{z}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \sum_{i=1}^m \mu_i g_i(\mathbf{x}_{\boldsymbol{\mu}}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\mu}^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\mu}^T g(\mathbf{x}_{\boldsymbol{\mu}}) \underbrace{(\boldsymbol{\bar{\mu}} - \boldsymbol{\mu})^T g(\mathbf{x}_{\boldsymbol{\mu}})}_{0} \newline & = \underbrace{f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\bar{\mu}}^T g(\mathbf{x}_{\boldsymbol{\mu}})}_{q(\boldsymbol{\bar{\mu}})} + (\boldsymbol{\mu} - \boldsymbol{\bar{\mu}})^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline & \geq q(\boldsymbol{\bar{\mu}}) + (\boldsymbol{\mu} - \boldsymbol{\bar{\mu}})^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline \end{align*} $$ Thus, $g(\mathbf{x}_{\boldsymbol{\mu}})$ is a subgradient to $q$ at $\boldsymbol{\mu}$.
Algorithm: Subgradient Method for the Lagrange Dual Problem
Step 0: Initialize $\boldsymbol{\mu}_0 \geq 0, q_0^\text{best} = q(\boldsymbol{\mu}_0), k = 0$.
Step 1: Evaluate $q(\boldsymbol{\mu}_k)$, i.e., solve $q(\boldsymbol{\mu}_k) = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}_k)$. Let $\mathbf{x}_k$ be a solution.
Step 2: Update $\boldsymbol{\mu}_{k+1} = \left(\boldsymbol{\mu}_k + \alpha_k g(\mathbf{x}_k)\right)_+$, where $(\cdot)_+$ is $\max\{0, \cdot\}$ applied element-wise.
Step 3: Let $q_{k+1}^\text{best} = \max\{q_k^\text{best}, q(\boldsymbol{\mu}_{k+1})\}$.
Step 4: If termination criteria is fulfilled, STOP; Else, go to (1) with $k \leftarrow k + 1$.