Part 13 - Feasible Descent Methods

TMA947_13

Introduction

In this part we will cover feasible descent methods for constrained optimization problems. More specifically, we will cover the general framework for feasible descent methods in a constrained setting. Further, we will cover three specific algorithms that fall under this framework, the Frank-Wolfe method, the Simplicial Decomposition method, and the Gradient Projection method.

Feasible Descent Methods

Explanation: Problem Setting

Consider the constrained optimization problem, $$ \begin{align*} \min \ & f(\mathbf{x}) \newline \text{subject to} \ & \mathbf{x} \in S \newline \end{align*} $$ where $S$ is a non-empty, closed, and convex set. Further, let $f \in C^1$ on S.

Recall: Optimality Condition for Constrained Problems

All algorithms are essentially based on optimality conditions, since we need some signal to go by and these conditions also provide ideas on how we can formulate our algorithms.

Therefore, we recall the optimality condition for constrained problems.

Theorem: Stationary Points in Constrained Problems

Let $f$ and $S$ be defined as above. Then, $$ \mathbf{x}^{\star} \text{ is a local minimum } \implies \nabla f(\mathbf{x}^{\star})^T (\mathbf{x} - \mathbf{x}^{\star}) \geq 0, \ \forall \mathbf{x} \in S $$

Furthermore, recall the more generalized theorem about stationary points in constrained problems.

Theorem: Equivalent Statements of Stationary Points in Constrained Problems

Let $f$ and $S$ be defined as above. Then, the following statements are equivalent,

$$ \begin{cases} \nabla f(\mathbf{x}^{\star})^T (\mathbf{x} - \mathbf{x}^{\star}) \geq 0, \ \forall \mathbf{x} \in S \newline -\nabla f(\mathbf{x}^{\star}) \in N_S(\mathbf{x}^{\star}) \coloneqq \{\mathbf{p} \in \mathbb{R}^n \mid \mathbf{p}^T (\mathbf{x} - \mathbf{x}^{\star}) \leq 0, \ \forall \mathbf{x} \in S \} \newline \min_{\mathbf{x} \in S} \nabla f(\mathbf{x}^{\star})^T (\mathbf{x} - \mathbf{x}^{\star}) = 0 \newline \mathbf{x}^{\star} = \mathrm{proj}_S(\mathbf{x}^{\star} - \nabla f(\mathbf{x}^{\star})). \newline \end{cases} $$

where $\mathrm{proj}_S \coloneqq \underset{\mathbf{y} \in S}{\arg\min} \ \Vert \mathbf{y} - \mathbf{x} \Vert$, i.e., we choose the point in $S$ that is closest to $\mathbf{x}$.

Intuition: General Feasible Descent Method

With these, we can now formulate a general feasible descent method.

Algorithm: General Feasible Descent Method

Step 0: Initialize $\mathbf{x}_0 \in S$, $k = 0$ ¹

Step 1: Find a direction $\mathbf{p}_k$ such that there exists an $\bar{\alpha} > 0$, such that ² $$ \begin{cases} \mathbf{x}_k + \alpha \mathbf{p}_k \in S, \ \forall \alpha \in [0, \bar{\alpha}] \newline f(\mathbf{x}_k + \alpha \mathbf{p}_k) < f(\mathbf{x}_k), \ \forall \alpha \in (0, \bar{\alpha}] \newline \end{cases} $$ Step 2: Determine a step length $\alpha_k$ such that, $$ \begin{cases} \mathbf{x}_k + \alpha_k \mathbf{p}_k \in S \newline f(\mathbf{x}_k + \alpha_k \mathbf{p}_k) < f(\mathbf{x}_k) \newline \end{cases} $$

Step 3: Update our current iterate, $$ \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k $$ Step 4: If termination criteria is fulfilled, STOP; Else, go to (1) with $k \leftarrow k + 1$.

Note: Remarks on the General Feasible Descent Method

As we can see, this is quite similar to the unconstrained setting! One key difference is that, search directions are often computed as $\mathbf{p}_k = \mathbf{y}_k - \mathbf{x}_k$, where $\mathbf{y}_k$ solves some other optimization problem (we will see examples of this later).

Further, line search is also similar to the unconstrained setting, but, in this setting, we need to ensure that the new iterate stays feasible. However, finding feasible descent directions can be difficult for some sets $S$.

Frank-Wolfe Method

Intuition: Frank-Wolfe Method

Let $S$ be a polyhedron, from the third optimality condition, i.e., $$ \mathbf{x}^{\star} \text{ is a local minimum } \implies \min_{\mathbf{x} \in S} \nabla f(\mathbf{x}^{\star})^T (\mathbf{x} - \mathbf{x}^{\star}) = 0 $$ thus, conversely ³, $$ \min_{\mathbf{x} \in S} \nabla f(\mathbf{x}_k)^T (\mathbf{x} - \mathbf{x}_k) < 0 \implies \mathbf{x}_k \text{ is not a local minimum} $$ This gives us an idea how to find feasible descent directions, by solving the linear optimization problem.

At $\mathbf{x}_k \in S$, if $\min_{\mathbf{x} \in S} \nabla f(\mathbf{x}_k)^T (\mathbf{x} - \mathbf{x}_k) < 0$, find, $$ \mathbf{y}_k \in \underset{\mathbf{x} \in S}{\arg\min} \ \nabla f(\mathbf{x}_k)^T (\mathbf{x} - \mathbf{x}_k) $$ and set $\mathbf{p}_k = \mathbf{y}_k - \mathbf{x}_k$. Then, $\mathbf{p}_k$ is a feasible descent direction.

Note

The $\mathbf{y}_k$ problem is an LP, i.e., the cost is linear in $\mathbf{x}$ and the constraints are linear inequalities (since $S$ is a polyhedron). Thus, if one uses the simplex method, $\mathbf{y}_k$ will always be an extreme point of $S$. Which means the direction $\mathbf{p}_k$ always points towards an extreme point of $S$.

Algorithm: Frank-Wolfe Method

Step 0: Initialize $\mathbf{x}_0 \in S$, $k = 0$

Step 1: Find, $$ \mathbf{y}_k \in \underset{\mathbf{x} \in S}{\arg\min} \ \nabla f(\mathbf{x}_k)^T (\mathbf{x} - \mathbf{x}_k) $$ and set $\mathbf{p}_k = \mathbf{y}_k - \mathbf{x}_k$.

Step 2: Compute step length $\alpha_k \in [0, 1]$ ⁴ using, e.g., Armijo step length rule.

Step 3: Update, $$ \begin{align*} \mathbf{x}_{k+1} & = \mathbf{x}_k + \alpha_k \mathbf{p}_k \newline & = (1 - \alpha_k) \mathbf{x}_k + \alpha_k \mathbf{y}_k \newline \end{align*} $$

Step 4: If $\nabla f(\mathbf{x}_k)^T (\mathbf{y}_k - \mathbf{x}_k)$ is close to zero or $\alpha_k$ is close to zero, STOP; Else, go to (1) with $k \leftarrow k + 1$.

Theorem: Convergence of Frank-Wolfe Method

Let $S$ be a non-empty and bounded polyhedron. Further, let $f \in C^1$ on $S$.

If the Armijo step length rule is used, then the sequence $\{\mathbf{x}_k\}$ generated by the Frank-Wolfe method has bounded limit points and all limit points are stationary points of the problem.

Simplicial Decomposition Method

Intuition: Simplicial Decomposition Method

As we have seen, the Frank-Wolfe method finds the next iterate between our current iterate and an extreme point of $S$.

The Simplicial Decomposition method is similar, but instead of only considering the current iterate and one extreme point, we consider the convex hull of multiple extreme points. Thus, we can find the next iterate in a larger set, which can potentially give us a better descent direction. Further, we can also remove extreme points that are not useful anymore.

Algorithm: (Easy Version) Simplicial Decomposition Method

Step 0: Initialize $\mathbf{x}_0 \in S$, $k = 0$.

Step 1: Find, $$ \mathbf{y}_k \in \underset{\mathbf{x} \in S}{\arg\min} \ \nabla f(\mathbf{x}_k)^T (\mathbf{x} - \mathbf{x}_k) $$

Step 2: Find, $$ \begin{align*} (\mu_k, \mathbf{v}_k) \in & \arg\min \ f(\mu \mathbf{x}_k + \sum_{i=1}^{k} v_i \mathbf{y}_i) \newline \text{subject to} \ & \mu + \sum_{i=1}^{k} v_i = 1 \newline & \mu, v_i \geq 0, \end{align*} $$

Step 3: Update, $$ \mathbf{x}_{k+1} = \mu_k \mathbf{x}_k + \sum_{i=1}^{k} v^{(i)}_k \mathbf{y}_i $$

Note: Remarks on the Simplicial Decomposition Method

The Simplicial Decomposition method is better, in the sense of convergence, than the Frank-Wolfe method, since it considers a larger set to find the next iterate. However, it can be difficult to solve for $(\mu_k, \mathbf{v}_k)$, since the number of variables increases with $k$.

Gradient Projection Method

Intuition: Gradient Projection Method

The Gradient Projection method is based on the fourth optimality condition, i.e., $$ \begin{align*} \mathbf{x}^{\star} \text{ is a local minimum } & \implies \mathbf{x}^{\star} = \mathrm{proj}_S(\mathbf{x}^{\star} - \nabla f(\mathbf{x}^{\star})) \newline \mathbf{x}^{\star} = \mathrm{proj}_S(\mathbf{x}^{\star} - \alpha \nabla f(\mathbf{x}^{\star})) \end{align*} $$ Thus, conversely, $$ \mathbf{p} \coloneqq \mathrm{proj}_S(\mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)) - \mathbf{x}_k \neq 0 \implies \mathbf{x}_k \text{ is not a local minimum} $$ This gives us a way to find feasible descent directions, by projecting the gradient step onto the feasible set.

Algorithm: Gradient Projection Method

Step 0: Initialize $\mathbf{x}_0 \in S$, $k = 0$

Step 1: Try to update, $$ \mathbf{x}_{k+1} = \mathrm{proj}_S(\mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)) $$ where $\alpha_k$ is selected by the Armijo step length rule, i.e., $$ f(\mathrm{proj}_S(\mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k))) < f(\mathbf{x}_k) - \mu \alpha_k \nabla f(\mathbf{x}_k)^T \mathbf{p}_k $$ where $\mathbf{p}_k = \mathrm{proj}_S(\mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)) - \mathbf{x}_k$.

Step 2: If $\mathbf{p}_k$ is close to zero or $\alpha_k$ is close to zero, STOP; Else, go to (1) with $k \leftarrow k + 1$.

Note: Remarks on the Gradient Projection Method

To compute $\mathrm{proj}_X(\mathbf{x})$, we must solve, $$ \underset{\mathbf{y} \in S}{\min} \ \Vert \mathbf{y} - \mathbf{x} \Vert $$ The solution always exists for a non-empty, closed, and convex set $S$, but is not necessarily easy to compute.

However, projection can be (relatively) easy in some cases. Consider, $$ X = \{\mathbf{x} \in \mathbb{R}^n \mid \mathbf{x} \geq 0\} $$ then, $$ \begin{align*} \mathrm{proj}_X(\mathbf{x}) & = \left[\max\{0, x_i\}\right]_{i=1}^n \newline & = \left[(x_i)_+\right]_{i=1}^n \newline \end{align*} $$

where $(\cdot)_+$ is the positive part function, i.e., $(s)_+ = \max\{0, s\}$

Further, consider, $$ X = \{\mathbf{x} \in \mathbb{R}^n \mid \sum_{i=1}^n x_i = 1, \ x_i \geq 0 \} $$ then, $$ \mathrm{proj}_X(\mathbf{x}) = \left[(x_i - \lambda)_+\right]_{i=1}^n $$

where $\lambda$ is chosen such that, $$ \sum_{i=1}^n (x_i - \lambda)_+ = 1 $$

Theorem: Convergence of Gradient Projection Method

Let $S$ be a non-empty, closed, and convex set and $f \in C^1$ on $S$.

If the Armijo step length rule is used, then the sequence $\{\mathbf{x}_k\}$ generated by the Gradient Projection method has bounded limit points and all limit points are stationary points of the problem.