Part 14 - Convex Optimization

Introduction

In this part we will (again) cover convex optimization problems, subgradients, and subdifferentials. We will see how we can use subgradients to solve convex optimization problems that are not necessarily differentiable. Furthermore, we will connect these methods to the Lagrange dual problem.

Convex Optimization Problems

Recall (Convex Optimization Problem)

Let’s firstly recall what we need to build a convex optimization problem.

Recall (Convex Set)

A set SRnS \subseteq \mathbb{R}^n is convex if,

x1,x2Sλ(0,1)}λx1+(1λ)x2S\begin{align*} \left. \begin{array}{l} \mathbf{x}_1, \mathbf{x}_2 \in S \newline \lambda \in (0, 1) \end{array} \right\} \Longrightarrow \lambda \mathbf{x}_1 + (1 - \lambda) \mathbf{x}_2 \in S \end{align*}
Recall (Convex Function)

A function f:RnRf: \mathbb{R}^n \to \mathbb{R} is convex on the convex set SS if,

x1,x2Sλ(0,1)}f(λx1+(1λ)x2)λf(x1)+(1λ)f(x2)\begin{align*} \left. \begin{array}{l} \mathbf{x}_1, \mathbf{x}_2 \in S \newline \lambda \in (0, 1) \end{array} \right\} \Longrightarrow f(\lambda \mathbf{x}_1 + (1 - \lambda) \mathbf{x}_2) \leq \lambda f(\mathbf{x}_1) + (1 - \lambda) f(\mathbf{x}_2) \end{align*}
Recall (Convex Optimization Problem)

A optimization problem,

min f(x)subject to xS\begin{align*} \min \ & f(\mathbf{x}) \newline \text{subject to} \ & \mathbf{x} \in S \newline \end{align*}

is called convex if SS is a convex set and ff is a convex function on SS. Typically,

S={xRngi(x)0, i=1,,m, hj(x)=0, j=1,,l}S = \{\mathbf{x} \in \mathbb{R}^n \mid g_i(\mathbf{x}) \leq 0, \ i = 1, \ldots, m, \ h_j(\mathbf{x}) = 0, \ j = 1, \ldots, l \}

where gig_i are convex functions and hjh_j are affine functions.

Note ((Re)formulating Problems to be Convex)

Sometimes, we can reformulate problems to be convex.

min 12x12+x2subject to x122x2=0x1,x2R  "    "min x12subject to xR2, x2=12x12  "    "min 12x12+x2subject to x122x20\begin{align*} \min \ & \frac{1}{2} x_1^2 + x_2 \newline \text{subject to} \ & x_1^2 - 2 x_2 = 0 \newline & x_1, x_2 \in \mathbb{R} \end{align*} \ \ \text{"} \iff \text{"} \begin{align*} \min \ & x_1^2 \newline \text{subject to} \ & \mathbf{x} \in \mathbb{R}^2, \ x_2 = \frac{1}{2} x_1^2 \end{align*} \ \ \text{"} \iff \text{"} \begin{align*} \min \ & \frac{1}{2} x_1^2 + x_2 \newline \text{subject to} \ & x_1^2 - 2 x_2 \leq 0 \end{align*}

Subgradients and Subdifferentials

Recall (Subgradient)

Let SRnS \subseteq \mathbb{R}^n be a non-empty and convex set and f:SRf: S \to \mathbb{R} be a convex function on SS. Then, pRn\mathbf{p} \in \mathbb{R}^n is a subgradient of ff at xˉ\bar{\mathbf{x}} if,

f(x)f(xˉ)+pT(xxˉ), xSf(\mathbf{x}) \geq f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}), \ \forall \mathbf{x} \in S
Recall (Subdifferential)

Let SS and ff be defined as above. The subdifferential of ff at xˉ\bar{\mathbf{x}} is,

f(xˉ){pRnf(x)f(xˉ)+pT(xxˉ), xS}\partial f(\bar{\mathbf{x}}) \coloneqq \{\mathbf{p} \in \mathbb{R}^n \mid f(\mathbf{x}) \geq f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}), \ \forall \mathbf{x} \in S \}

i.e., the set of all subgradients of ff at xˉ\bar{\mathbf{x}}.

Note (C1C^1 functions and global optimality)

If fC1f \in C^1 and xˉint(S)\bar{\mathbf{x}} \in \mathrm{int}(S) 1int(S)\mathrm{int}(S) is the interior of SS, i.e., the set of all points in SS that are not on the boundary of SS., then,

f(xˉ)={f(xˉ)}\partial f(\bar{\mathbf{x}}) = \{\nabla f(\bar{\mathbf{x}})\}

Further, if f:RnRf : \mathbb{R}^n \to \mathbb{R} and fC1f \in C^1, then,

f(xˉ)=0    xˉ is a global minimum of f\nabla f(\bar{\mathbf{x}}) = 0 \iff \bar{\mathbf{x}} \text{ is a global minimum of } f
Intuition (Proposition about Subdifferentials for Convex Functions)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be a convex function, then,

x is a global minimum of f    0f(x)\mathbf{x}^{\star} \text{ is a global minimum of } f \iff \mathbf{0} \in \partial f(\mathbf{x}^{\star})

Thus we can ask ourselves two questions,

  1. How can we use subgradients?
  2. How can we compute subgradients?

Subgradient Method

Intuition (Subgradient Method)

Step 0: Initialize x0,f0best=f(x0),k=0\mathbf{x}_0, f_0^\text{best} = f(\mathbf{x}_0), k = 0.

Step 1: Find pkf(xk)\mathbf{p}_k \in \partial f(\mathbf{x}_k).

Step 2: Update xk+1=xkαkpk\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \mathbf{p}_k.

Step 3: fk+1best=min{fkbest,f(xk+1)}f_{k+1}^\text{best} = \min\{f_k^\text{best}, f(\mathbf{x}_{k+1})\}.

If termination criteria is fulfilled, STOP; Else, go to (1) with kk+1k \leftarrow k + 1.

Note

For an element pf(x)\mathbf{p} \in \partial f(\mathbf{x}), p-\mathbf{p} might not be a descent direction from xˉ\bar{\mathbf{x}}.

Example 1

Consider,

min x1+2x2subject to xR2\begin{align*} \min \ & \Vert x_1 \Vert + 2 \Vert x_2 \Vert \newline \text{subject to} \ & \mathbf{x} \in \mathbb{R}^2 \end{align*}

Let xˉ=[10]T\bar{\mathbf{x}} = \begin{bmatrix} 1 & 0 \end{bmatrix}^T, then f(xˉ)=1f(\bar{\mathbf{x}}) = 1, further, let p=[11]T\mathbf{p} = \begin{bmatrix} 1 & 1 \end{bmatrix}^T,

f(xˉ)+pT(xxˉ)=1+[11][x11x20]=x1+x2x1+2x2x1+2x2f(x)    pf(xˉ)\begin{align*} f(\bar{\mathbf{x}}) + \mathbf{p}^T (\mathbf{x} - \bar{\mathbf{x}}) & = 1 + \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} x_1 - 1 \newline x_2 - 0 \end{bmatrix} \newline & = x_1 + x_2 \newline & \leq \Vert x_1 \Vert + \Vert 2 x_2 \Vert \newline & \leq \underbrace{\Vert x_1 \Vert + 2 \Vert x_2 \Vert}_{f(\mathbf{x})} \newline & \implies \mathbf{p} \in \partial f(\bar{\mathbf{x}}) \newline \end{align*}

However, notice that,

f(xˉαp)=xˉ1αp1+2xˉ2αp2=1α+2α1 x\begin{align*} f(\bar{\mathbf{x}} - \alpha \mathbf{p}) & = \Vert \bar{x}_1 - \alpha p_1 \Vert + 2 \Vert \bar{x}_2 - \alpha p_2 \Vert \newline & = \Vert 1 - \alpha \Vert + 2 \Vert - \alpha \Vert \geq 1 \ \forall \mathbf{x} \end{align*}
Intuition (Subgradient Step)

Although pk-\mathbf{p}_k might not be a descent direction, it moves us closer to the optimal solution,

pkf(xk)    f(x)f(xk)+pkT(xxk)\mathbf{p}_k \in \partial f(\mathbf{x}_k) \implies f(\mathbf{x}^{\star}) \geq f(\mathbf{x}_k) + \mathbf{p}_k^T (\mathbf{x}^{\star} - \mathbf{x}_k)

where x\mathbf{x}^{\star} is a global minimum of ff,

    pkT(xkx)points towards optimal solutionf(xk)f(x)0\begin{align*} \implies \mathbf{p}_k^T \underbrace{(\mathbf{x}_k - \mathbf{x}^{\star})}_{\text{points towards optimal solution}} & \geq f(\mathbf{x}_k) - f(\mathbf{x}^{\star}) \newline & \geq 0 \end{align*}

i.e., the directions define half-spaces where the optimal solution cannot be.

Definition 1 (Step Length Rule)

For the subgradient method, we set the step length sequence {αk}\{\alpha_k\} a priori, a common rule is a sequence that is square summable but not summable, i.e.,

k=0αk2<k=0αk=\begin{align*} \sum_{k=0}^{\infty} \alpha_k^2 & < \infty \newline \sum_{k=0}^{\infty} \alpha_k & = \infty \newline \end{align*}

e.g.,

αk=ab+ck, a,b,c>0\alpha_k = \frac{a}{b + ck}, \ a, b, c > 0
Theorem 1 (Convergence of Subgradient Method)

Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be a convex function and let X=argmin f(x)\mathbf{X}^{\star} = \arg \min \ f(\mathbf{x}) be non-empty. Let {xk}\{\mathbf{x}_k\} be the sequence generated by the subgradient method where {αk}\{\alpha_k\} is square summable but not summable. If {pk}\{\mathbf{p}_k\} is bounded, then,

f(xk)f and xkxXf(\mathbf{x}_k) \to f^{\star} \text{ and } \mathbf{x}_k \to \mathbf{x}^{\star} \in \mathbf{X}^{\star}

Connection to Lagrange Dual Problem

Recall (Primal, Lagrangian, and Dual Problem)

Consider the primal problem,

(P){inf f(x)subject to gi(x)0, i=1,,mxX\begin{align*} (P) \quad & \begin{cases} \inf \ & f(\mathbf{x}) \newline \text{subject to} \ & g_i(\mathbf{x}) \leq 0, \ i = 1, \ldots, m \newline & \mathbf{x} \in \mathbf{X} \end{cases} \end{align*}

We define the Lagrangian function as,

L(x,μ)=f(x)+i=1mμigi(x)\mathcal{L}(\mathbf{x}, \boldsymbol{\mu}) = f(\mathbf{x}) + \sum_{i=1}^m \mu_i g_i(\mathbf{x})

where μ0\boldsymbol{\mu} \geq 0 are the Lagrange multipliers. The Lagrange dual function is defined as,

q(μ)=infxXL(x,μ)q(\boldsymbol{\mu}) = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu})

The dual problem is then defined as,

(D){sup q(μ)subject to μ0\begin{align*} (D) \quad & \begin{cases} \sup \ & q(\boldsymbol{\mu}) \newline \text{subject to} \ & \boldsymbol{\mu} \geq 0 \end{cases} \end{align*}
Note (Convexity of the Dual Problem)

Since qq is concave and μ0\boldsymbol{\mu} \geq 0 is a convex set, the dual problem is a convex optimization problem. However, as we have noted before, to evaluate q(μ)q(\boldsymbol{\mu}), we must solve an optimization problem, which may not even be easier than the primal problem.

Theorem 2 (Subgradients of the Lagrange Dual Function)

Let x(μ)argminxXL(x,μ)\mathbf{x}(\boldsymbol{\mu}) \coloneqq \arg \min_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}). Further, let X\mathbf{X} be non-empty and compact and let μ0\boldsymbol{\mu} \geq 0. If xμx(μ)\mathbf{x}_{\boldsymbol{\mu}} \in \mathbf{x}(\boldsymbol{\mu}), then,

g(xμ)[gi(xμ)]i=1m is a subgradient to q at μg(\mathbf{x}_{\boldsymbol{\mu}}) \coloneqq \left[g_i(\mathbf{x}_{\boldsymbol{\mu}})\right]_{i=1}^m \text{ is a subgradient to } q \text{ at } \boldsymbol{\mu}

Furthermore,

q(μ)=conv{g(xμ)xμx(μ)}\partial q(\boldsymbol{\mu}) = \mathrm{conv} \{g(\mathbf{x}_{\boldsymbol{\mu}}) \mid \mathbf{x}_{\boldsymbol{\mu}} \in \mathbf{x}(\boldsymbol{\mu})\}

where conv(S)\mathrm{conv}(S) is the convex hull of the set SS.

Proof (Proof Idea)q(μ)=infxXL(z,μ)=infxX f(z)+i=1mμigi(z)=f(xμ)+i=1mμigi(xμ)=f(xμ)+μTg(xμ)=f(xμ)+μTg(xμ)(μˉμ)Tg(xμ)0=f(xμ)+μˉTg(xμ)q(μˉ)+(μμˉ)Tg(xμ)q(μˉ)+(μμˉ)Tg(xμ)\begin{align*} q(\boldsymbol{\mu}) & = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{z}, \boldsymbol{\mu}) \newline & = \inf_{\mathbf{x} \in \mathbf{X}} \ f(\mathbf{z}) + \sum_{i=1}^m \mu_i g_i(\mathbf{z}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \sum_{i=1}^m \mu_i g_i(\mathbf{x}_{\boldsymbol{\mu}}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\mu}^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline & = f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\mu}^T g(\mathbf{x}_{\boldsymbol{\mu}}) \underbrace{(\boldsymbol{\bar{\mu}} - \boldsymbol{\mu})^T g(\mathbf{x}_{\boldsymbol{\mu}})}_{0} \newline & = \underbrace{f(\mathbf{x}_{\boldsymbol{\mu}}) + \boldsymbol{\bar{\mu}}^T g(\mathbf{x}_{\boldsymbol{\mu}})}_{q(\boldsymbol{\bar{\mu}})} + (\boldsymbol{\mu} - \boldsymbol{\bar{\mu}})^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline & \geq q(\boldsymbol{\bar{\mu}}) + (\boldsymbol{\mu} - \boldsymbol{\bar{\mu}})^T g(\mathbf{x}_{\boldsymbol{\mu}}) \newline \end{align*}

Thus, g(xμ)g(\mathbf{x}_{\boldsymbol{\mu}}) is a subgradient to qq at μ\boldsymbol{\mu}.

Algorithm (Subgradient Method for the Lagrange Dual Problem)

Step 0: Initialize μ00,q0best=q(μ0),k=0\boldsymbol{\mu}_0 \geq 0, q_0^\text{best} = q(\boldsymbol{\mu}_0), k = 0.

Step 1: Evaluate q(μk)q(\boldsymbol{\mu}_k), i.e., solve q(μk)=infxXL(x,μk)q(\boldsymbol{\mu}_k) = \inf_{\mathbf{x} \in \mathbf{X}} \mathcal{L}(\mathbf{x}, \boldsymbol{\mu}_k). Let xk\mathbf{x}_k be a solution.

Step 2: Update μk+1=(μk+αkg(xk))+\boldsymbol{\mu}_{k+1} = \left(\boldsymbol{\mu}_k + \alpha_k g(\mathbf{x}_k)\right)_+, where ()+(\cdot)_+ is max{0,}\max\{0, \cdot\} applied element-wise.

Step 3: Let qk+1best=max{qkbest,q(μk+1)}q_{k+1}^\text{best} = \max\{q_k^\text{best}, q(\boldsymbol{\mu}_{k+1})\}.

Step 4: If termination criteria is fulfilled, STOP; Else, go to (1) with kk+1k \leftarrow k + 1.