Overview
Margin Notes (1)

Part 14 - Convex Optimization

TMA947
Date: October 13, 2025
Last modified: Invalid Date
8 min read
TMA947_14

Introduction

In this part we will (again) cover convex optimization problems, subgradients, and subdifferentials. We will see how we can use subgradients to solve convex optimization problems that are not necessarily differentiable. Furthermore, we will connect these methods to the Lagrange dual problem.

Convex Optimization Problems

Recall: Convex Optimization Problem

Let’s firstly recall what we need to build a convex optimization problem.

Recall: Convex Set

A set $S \subseteq \mathbb{R}^n$ is convex if, $$ \begin{align*} \left. \begin{array}{l} \mathbf{x}_1, \mathbf{x}_2 \in S \newline \lambda \in (0, 1) \end{array} \right\} \Longrightarrow \lambda \mathbf{x}_1 + (1 - \lambda) \mathbf{x}_2 \in S \end{align*} $$

Recall: Convex Function

A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex on the convex set $S$ if, $$ \begin{align*} \left. \begin{array}{l} \mathbf{x}_1, \mathbf{x}_2 \in S \newline \lambda \in (0, 1) \end{array} \right\} \Longrightarrow f(\lambda \mathbf{x}_1 + (1 - \lambda) \mathbf{x}_2) \leq \lambda f(\mathbf{x}_1) + (1 - \lambda) f(\mathbf{x}_2) \end{align*} $$

Recall: Convex Optimization Problem

A optimization problem, $$ \begin{align*} \min \ & f(\mathbf{x}) \newline \text{subject to} \ & \mathbf{x} \in S \newline \end{align*} $$ is called convex if $S$ is a convex set and $f$ is a convex function on $S$. Typically, $$ S = \{\mathbf{x} \in \mathbb{R}^n \mid g_i(\mathbf{x}) \leq 0, \ i = 1, \ldots, m, \ h_j(\mathbf{x}) = 0, \ j = 1, \ldots, l \} $$ where $g_i$ are convex functions and $h_j$ are affine functions. [!NOTE/(Re)formulating Problems to be Convex] Sometimes, we can reformulate problems to be convex. $$ \begin{align*} \min \ & \frac{1}{2} x_1^2 + x_2 \newline \text{subject to} \ & x_1^2 - 2 x_2 = 0 \newline & x_1, x_2 \in \mathbb{R} \end{align*} \text{” } \iff \text{ ”} \begin{align*} \min \ & x_1^2 \newline \text{subject to} \ & \mathbf{x} \in \mathbb{R}^2, \ x_2 = \frac{1}{2} x_1^2 \end{align*} \text{” } \iff text{ ”} \begin{align*} \min \ & \frac{1}{2} x_1^2 + x_2 \newline \text{subject to} \ & x_1^2 - 2 x_2 \leq 0 \end{align*} $$

Subgradients and Subdifferentials

Recall: Subgradient

Let $S \subseteq \mathbb{R}^n$ be a non-empty and convex set and $f: S \to \mathbb{R}$ be a convex function on $S$. Then, $\mathbf{p} \in \mathbb{R}^n$ is a subgradient of $f$ at $\bar{\mathbf{x}}$ if, $$ f(\mathbf{x}) \geq f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}), \ \forall \mathbf{x} \in S $$

Recall: Subdifferential

Let $S$ and $f$ be defined as above. The subdifferential of $f$ at $\bar{\mathbf{x}}$ is, $$ \partial f(\bar{\mathbf{x}}) \coloneqq \{\mathbf{p} \in \mathbb{R}^n \mid f(\mathbf{x}) \geq f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}), \ \forall \mathbf{x} \in S \} $$ i.e., the set of all subgradients of $f$ at $\bar{\mathbf{x}}$. [!NOTE/$C^1$ functions and global optimality] If $f \in C^1$ AND $\bar{\mathbf{x}} \in \mathrm{int}(S)$ 1, then, $$ \partial f(\bar{\mathbf{x}}) = \{\nabla f(\bar{\mathbf{x}})\} $$ Further, if $f : \mathbb{R}^n \to \mathbb{R}$ and $f \in C^1$, then, $$ \nabla f(\bar{\mathbf{x}}) = 0 \iff \bar{\mathbf{x}} \text{ is a global minimum of } f $$

Intuition: Proposition about Subdifferentials for Convex Functions

Let $f: \mathbb{R}^n \to \mathbb{R}$ be a convex function, then, $$ \mathbf{x}^{\star} \text{ is a global minimum of } f \iff \mathbf{0} \in \partial f(\mathbf{x}^{\star}) $$ Thus we can ask ourselves two questions,

  1. How can we use subgradients?
  2. How can we compute subgradients?

Subgradient Method

Intuition: Subgradient Method

Step 0: Initialize $\mathbf{x}_0, f_0^\text{best} = f(\mathbf{x}_0), k = 0$.

Step 1: Find $\mathbf{p}_k \in \partial f(\mathbf{x}_k)$.

Step 2: Update $\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \mathbf{p}_k$.

Step 3: $f_{k+1}^\text{best} = \min\{f_k^\text{best}, f(\mathbf{x}_{k+1})\}$.

If termination criteria is fulfilled, STOP; Else, go to (1) with $k \leftarrow k + 1$.

Note

For an element $\mathbf{p} \in \partial f(\mathbf{x})$, $-\mathbf{p}$ might not be a descent direction from $\bar{\mathbf{x}}$.

Example

Consider, $$ \begin{align*} \min \ & \Vert x_1 \Vert + 2 \Vert x_2 \Vert \newline \text{subject to} \ & \mathbf{x} \in \mathbb{R}^2 \end{align*} $$ Let $\bar{\mathbf{x}} = \begin{bmatrix} 1 & 0 \end{bmatrix}^T$, then $f(\bar{\mathbf{x}}) = 1$, further, let $\mathbf{p} = \begin{bmatrix} 1 & 1 \end{bmatrix}^T$, $$ \begin{align*} f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}) & = 1 + \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} x_1 - 1 \newline x_2 - 0 \end{bmatrix} \newline & = x_1 + x_2 \newline & \leq \Vert x_1 \Vert + \Vert 2 x_2 \Vert \newline & \leq \underbrace{\Vert x_1 \Vert + 2 \Vert x_2 \Vert}_{f(\mathbf{x})} \newline & \implies \mathbf{p} \in \partial f(\bar{\mathbf{x}}) \newline \end{align*} $$ However, notice that, $$ \begin{align*} f(\bar{\mathbf{x}} - \alpha \mathbf{p}) & = \Vert \bar{x}_1 - \alpha p_1 \Vert + 2 \Vert \bar{x}_2 - \alpha p_2 \Vert \newline & = \Vert 1 - \alpha \Vert + 2 \Vert - \alpha \Vert \geq 1 \ \forall \mathbf{x} \end{align*} $$

Intuition: Subgradient Step

Although $-\mathbf{p}_k$ might not be a descent direction, it moves us closer to the optimal solution, $$ \mathbf{p}_k \in \partial f(\mathbf{x}_k) \implies f(\mathbf{x}^{\star}) \geq f(\mathbf{x}_k) + \mathbf{p}_k^T (\mathbf{x}^{\star} - \mathbf{x}_k) $$ where $\mathbf{x}^{\star}$ is a global minimum of $f$, $$ \begin{align*} \implies \mathbf{p}_k^T \underbrace{(\mathbf{x}_k - \mathbf{x}^{\star})}_{\text{points towards optimal solution}} & \geq f(\mathbf{x}_k) - f(\mathbf{x}^{\star}) \newline & \geq 0 \end{align*} $$ i.e., the directions define half-spaces where the optimal solution cannot be.

Definition: Step Length Rule

For the subgradient method, we set the step length sequence $\{\alpha_k\}$ a priori, a common rule is a sequence that is square summable but not summable, i.e., $$ \begin{align*} \sum_{k=0}^{\infty} \alpha_k^2 & < \infty \newline \sum_{k=0}^{\infty} \alpha_k & = \infty \newline \end{align*} $$ e.g., $$ \alpha_k = \frac{a}{b + ck}, \ a, b, c > 0 $$

Theorem: Convergence of Subgradient Method

Let $f: \mathbb{R}^n \to \mathbb{R}$ be a convex function and let $\mathbf{X}^{\star} = \arg \min \ f(\mathbf{x})$ be non-empty. Let $\{\mathbf{x}_k\}$ be the sequence generated by the subgradient method where $\{\alpha_k\}$ is square summable but not summable. If $\{\mathbf{p}_k\}$ is bounded, then, $$ f(\mathbf{x}_k) \to f^{\star} \text{ and } \mathbf{x}_k \to \mathbf{x}^{\star} \in \mathbf{X}^{\star} $$

Connection to Lagrange Dual Problem

Recall: Primal, Lagrangian, and Dual Problem

Consider the primal problem, $$ \begin{align*} (P) \quad & \begin{cases} \inf \ & f(\mathbf{x}) \newline \text{subject to} \ & g_i(\mathbf{x}) \leq 0, \ i = 1, \ldots, m \newline & \mathbf{x} \in \mathbf{X} \end{cases} \end{align*} $$ We define the Lagrangian function as, $$ \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) = f(\mathbf{x}) + \sum_{i=1}^m \mu_i g_i(\mathbf{x}) $$ where $\boldsymbol{\mu} \geq 0$ are the Lagrange multipliers. The Lagrange dual function is defined as, $$ q(\boldsymbol{\mu}) = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) $$ The dual problem is then defined as, $$ \begin{align*} (D) \quad & \begin{cases} \sup \ & q(\boldsymbol{\mu}) \newline \text{subject to} \ & \boldsymbol{\mu} \geq 0 \end{cases} \end{align*} $$

Note: Convexity of the Dual Problem

Since $q$ is concave and $\boldsymbol{\mu} \geq 0$ is a convex set, the dual problem is a convex optimization problem. However, as we have noted before, to evaluate $q(\boldsymbol{\mu})$, we must solve an optimization problem, which may not even be easier than the primal problem.

Theorem: Subgradients of the Lagrange Dual Function

Let $\mathbf{x}(\boldsymbol{\mu}) \coloneqq \arg \min_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu})$. Further, let $\mathbf{X}$ be non-empty and compact and let $\boldsymbol{\mu} \geq 0$. If $\mathbf{x}_{\boldsymbol{\mu}} \in \mathbf{x}(\boldsymbol{\mu})$, then, $$ g(\mathbf{x}_{\boldsymbol{\mu}}) \coloneqq \left[g_i(\mathbf{x}_{\boldsymbol{\mu}})\right]_{i=1}^m \text{ is a subgradient to } q \text{ at } \boldsymbol{\mu} $$ Furthermore, $$ \partial q(\boldsymbol{\mu}) = \mathrm{conv} \{g(\mathbf{x}_{\boldsymbol{\mu}}) \mid \mathbf{x}_{\boldsymbol{\mu}} \in \mathbf{x}(\boldsymbol{\mu})\} $$ where $\mathrm{conv}(S)$ is the convex hull of the set $S$.

Proof: Proof Idea

$$ \begin{align*} q(\boldsymbol{\mu}) & = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{z}, \boldsymbol{\mu}) \newline & = \inf_{\mathbf{x} \in \mathbf{X}} \ f(\mathbf{z}) + \sum_{i=1}^m \mu_i g_i(\mathbf{z}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \sum_{i=1}^m \mu_i g_i(\mathbf{x}_{\boldsymbol{\mu}}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\mu}^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\mu}^T g(\mathbf{x}_{\boldsymbol{\mu}}) \underbrace{(\boldsymbol{\bar{\mu}} - \boldsymbol{\mu})^T g(\mathbf{x}_{\boldsymbol{\mu}})}_{0} \newline & = \underbrace{f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\bar{\mu}}^T g(\mathbf{x}_{\boldsymbol{\mu}})}_{q(\boldsymbol{\bar{\mu}})} + (\boldsymbol{\mu} - \boldsymbol{\bar{\mu}})^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline & \geq q(\boldsymbol{\bar{\mu}}) + (\boldsymbol{\mu} - \boldsymbol{\bar{\mu}})^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline \end{align*} $$ Thus, $g(\mathbf{x}_{\boldsymbol{\mu}})$ is a subgradient to $q$ at $\boldsymbol{\mu}$.

Algorithm: Subgradient Method for the Lagrange Dual Problem

Step 0: Initialize $\boldsymbol{\mu}_0 \geq 0, q_0^\text{best} = q(\boldsymbol{\mu}_0), k = 0$.

Step 1: Evaluate $q(\boldsymbol{\mu}_k)$, i.e., solve $q(\boldsymbol{\mu}_k) = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}_k)$. Let $\mathbf{x}_k$ be a solution.

Step 2: Update $\boldsymbol{\mu}_{k+1} = \left(\boldsymbol{\mu}_k + \alpha_k g(\mathbf{x}_k)\right)_+$, where $(\cdot)_+$ is $\max\{0, \cdot\}$ applied element-wise.

Step 3: Let $q_{k+1}^\text{best} = \max\{q_k^\text{best}, q(\boldsymbol{\mu}_{k+1})\}$.

Step 4: If termination criteria is fulfilled, STOP; Else, go to (1) with $k \leftarrow k + 1$.