Introduction
We want to solve problems of the form, $$ \underset{\mathbf{x} \in \mathbb{R}^n}{\min} f(\mathbf{x}), $$
Most algorithms for solving such problems are iterative,
- Start in some point $\mathbf{x}_0$.
- Go from $\mathbf{x}_k$ to $\mathbf{x}_{k+1}$ by some rule.
- Terminate based on some criterion.
Line Search Methods
We will firstly look at line search methods.
Definition: Line Search Methods
A line search algorithm has the form, Step 0: Set $k=0$ and choose an initial guess $\mathbf{x}_0$.
Step 1: Find search direction, $\mathbf{p}_k \in \mathbb{R}^n$.
Step 2: Perform line search, i.e., find step length $\alpha_k > 0$, such that $f(\mathbf{x}_k + \alpha_k \mathbf{p}_k) < f(\mathbf{x}_k)$.
Step 3: Update, $\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k$.
Step 4: If termination criterion is satisfied, stop. Otherwise, set $k = k + 1$ and go to Step 1.
Steps 0, 3, and 4 are trivial, therefore, we will spend most of our time on Steps 1 and 2.
Step 1 - Search Direction
To understand how we should find a search direction, let’s recap what a descent direction is.
Definition: Descent direction
Let $\mathbf{x}_k \in S$, a vector $\mathbf{p}_k \in \mathbb{R}^n$ is called a descent direction with respect to $f$ at the point $\mathbf{x}_k$, if, $$ \exists \delta > 0 : f(\mathbf{x}_k + \alpha \mathbf{p}_k) < f(\mathbf{x}_K), \quad \forall \alpha \in (0, \delta] $$
Thus, $\mathbf{p}_k = -\nabla f(\mathbf{x}_k)$ is a descent direction (if $\nabla f(\mathbf{x}_k) \neq 0$). It is called the steepest descent direction (since the gradient points in the direction of steepest ascent, the negative gradient points in the direction of steepest descent).
Furthermore, if $\nabla f(\mathbf{x}_k)^T \mathbf{p}_k < 0$, then $\mathbf{p}_k$ is a descent direction.
We will mainly derive two methods for finding a search direction, but for this we need the following lemma.
Lemma: Descent direction
Let $f \in C^1$ and $\mathbf{Q} \in \mathbb{R}^{n \times n}$ be symmetric and positive definite matrix, if $\nabla f(\mathbf{x}_k) \neq 0$, then, $$ \mathbf{p}_k = -\mathbf{Q} \nabla f(\mathbf{x}_k) \ \text{is a descent direction.} $$
Note
$\mathbf{Q} = \mathbf{I}$ gives the steepest descent direction.
We will now introduce Newton’s method.
Newton’s Method
Definition: Newton’s Method
Let $f \in C^2$ and $\nabla^2 f(\mathbf{x}_k)$ be positive definite, i.e., $\nabla^2 f(\mathbf{x}_k) \succ 0$. By a second-order Taylor expansion of $f$ around $\mathbf{x}_k$, we have, $$ f(\mathbf{x}_k + \mathbf{p}) \approx \underbrace{f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T \mathbf{p} + \frac{1}{2} \mathbf{p}^T \nabla^2 f(\mathbf{x}_k) \mathbf{p}}_{\eqqcolon \varphi_{\mathbf{x}_k}(\mathbf{p})} + \mathcal{O}(\ldots) $$ Now, consider the direction, $$ \mathbf{p}_k = \underset{\mathbf{p} \in \mathbb{R}^n}{\arg \min} \ \varphi_{\mathbf{x}_k}(\mathbf{p}). $$ Taking the gradient of $\varphi_{\mathbf{x}_k}(\mathbf{p})$ and setting it to zero, we get, $$ \begin{align*} \nabla \varphi_{\mathbf{x}_k}(\mathbf{p}) & = 0 \newline \nabla f(\mathbf{x}_k) + \nabla^2 f(\mathbf{x}_k) \mathbf{p} & = 0 \newline \mathbf{p} & = -\underbrace{\left(\nabla^2 f(\mathbf{x}_k)\right)^{-1}}_{\text{has inverse due to full rank assumption, i.e.,} \ \succ 0} \nabla f(\mathbf{x}_k) \end{align*} $$
Some remarks about Newton’s method. When the Hessian $\nabla^2 f(\mathbf{x}_k)$ is positive definite, then $\mathbf{p}_k$ is a descent direction (by the lemma).
However, if $\nabla^2 f(\mathbf{x}_k)$ is negative definite (may also be non invertible), then $\mathbf{p}_k$ is an ascent direction. This means that Newton’s method works for both minimization and maximization problems, but may be problematic at times.
How do we solve this issue? Perturbing the Hessian matrix so it can be made positive definite.
Definition: Levenberg-Marquardt modification
We take the search direction, $$ \mathbf{p}_k = -\left(\nabla^2 f(\mathbf{x}_k) + \gamma \mathbf{I}\right)^{-1} \nabla f(\mathbf{x}_k), $$ for $\gamma > 0$ such that $\nabla^2 f(\mathbf{x}_k) + \gamma \mathbf{I} \succ 0$. An interesting thing to consider is what is the lower bound needed for $\gamma$ to make the matrix positive definite?
Let $(\lambda_i, \mathbf{v}_i)$ be the eigenpairs of $\nabla^2 f(\mathbf{x}_k)$, then the eigenpairs of $\nabla^2 f(\mathbf{x}_k) + \gamma \mathbf{I}$ are $(\lambda_i + \gamma, \mathbf{v}_i)$. Thus, we need $\gamma > -\lambda_{\min}$, where $\lambda_{\min}$ is the smallest eigenvalue of $\nabla^2 f(\mathbf{x}_k)$.
Note
- When $\gamma = 0$, we have the original Newton’s method, why?
Solution
Because $\nabla^2 f(\mathbf{x}_k) + 0 \mathbf{I} = \nabla^2 f(\mathbf{x}_k)$.
- When $\gamma \to \infty$, we have the steepest descent method, why?
Solution
Because $\left(\nabla^2 f(\mathbf{x}_k) + \gamma \mathbf{I}\right)^{-1} \approx \mathbf{I}$, when $\gamma$ is large, thus $\mathbf{p}_k \approx -\mathbf{I} \nabla f(\mathbf{x}_k) = -\nabla f(\mathbf{x}_k)$.
We will note that computing the Hessian and its inverse is expensive, therefore, quasi-Newton methods are often used instead, where one approximatates the Hessian.
We will not go into them, but it is good to know that they exist.
Let’s also summarize the algorithms strengths and weaknesses.
Summary: Strengths and Weaknesses of Steepest Descent and Newton’s Method
Steepest Descent:
- Strengths:
- Cheap (only need gradient).
- Weaknesses:
- “Zig-zagging” behavior ~ needs many (cheap) iterations.
Newton’s Method:
- Strengths:
- Fast convergence ~ needs few (expensive) iterations.
- Weaknesses:
- Expensive (need Hessian and its inverse).
Step 2 - Line Search
We will now look at how to perform a line search, i.e., how to find a step length $\alpha_k > 0$ such that $f(\mathbf{x}_k + \alpha_k \mathbf{p}_k) < f(\mathbf{x}_k)$.
In a point $\mathbf{x}_k$ and a direction $\mathbf{p}_k$ (from Step 1), we would like to solve,
$$ \underset{\alpha > 0}{\min} \ \underbrace{f(\mathbf{x}_k + \alpha \mathbf{p}_k)}_{\eqqcolon \varphi(\alpha)}. $$
This procedure is called exact step length calculation. Most of the time, it is unnecessary (since (on average) it may be unlikely that we are on the exact line for the minimum).
However, we still want to take a step that decreases the function value sufficiently, or “sufficient decrease”.
We will often use the Armijo Step Length Rule,
Definition: Armijo Step Length Rule
To find $\alpha$ that gives “sufficient decrease”, note that, $$ f(\mathbf{x}_k + \alpha \mathbf{p}_k) \approx f(\mathbf{x}_k) + \alpha \nabla f(\mathbf{x}_k)^T \mathbf{p}_k, $$ where $\alpha \nabla f(\mathbf{x}_k)^T \mathbf{p}_k$ is the prediction of change of $f$ from point $\mathbf{x}_k$ to $\mathbf{x}_k + \alpha \mathbf{p}_k$. But we want to ensure that the predicted change is not too large and that the actual change is “sufficiently” large.
Solution to this problem, accept a step length $\alpha$ if it achieves a fraction $\mu$ of the predicted decrease, i.e., if, $$ f(\mathbf{x}_k + \alpha \mathbf{p}_k) \leq f(\mathbf{x}_k) + \mu \alpha \nabla f(\mathbf{x}_k)^T \mathbf{p}_k, $$ then we accept $\alpha$ as our step length.
else, let $\alpha = \frac{\alpha}{2}$ and repeat until the condition is satisfied.
Note
Eventually, some $\alpha$ will satisfy the condition.
Step 4 - Termination Criteria
We will now look at some termination criterias that are commonly used.
Definition: Termination Criteria
Three common termination criteria are,
- $\Vert \nabla f(\mathbf{x}_k) \Vert < \epsilon_1(1 + |f(\mathbf{x}_k|)$ for some small $\epsilon_1 > 0$.
- $f(\mathbf{x}_{k - 1}) - f(\mathbf{x}_k) < \epsilon_2(1 + |f(\mathbf{x}_k|)$ for some small $\epsilon_2 > 0$.
- $\Vert \mathbf{x}_k - \mathbf{x}_{k - 1} \Vert < \epsilon_3(1 + \Vert \mathbf{x}_k \Vert)$ for some small $\epsilon_3 > 0$.
The first criterion checks if we are close to a stationary point, the second checks if we have a small change in function value, and the third checks if we have a small change in the solution.
The factors $(1 + |f(\mathbf{x}_k|)$ and $(1 + \Vert \mathbf{x}_k \Vert)$ are included to make the criteria scale-invariant (i.e., independent of the scaling of $f$ and $\mathbf{x}$).
Now, we have covered all the steps of a line search method, but it is also of interest to look at the convergence of such methods, so let’s do that.
Convergence
If $\{\mathbf{x}_k \}$ are the iterates of a line search method, we want, $$ \{\mathbf{x}_k \} \to \mathbf{x}^* $$
We will formalize this, but for now, let’s go over some conditions over our search direction $\mathbf{p}_k$.
Definition: Descent direction conditions
All search directions $\mathbf{p}_k$ should satisfy the following conditions,
- $\Vert \mathbf{p}_k \Vert$ is bounded, i.e., $\exists M > 0 : \Vert \mathbf{p}_k \Vert \leq M, \ \forall k$.
- $\Vert \mathbf{p}_k \Vert \geq s_1 \Vert \nabla f(\mathbf{x}_k) \Vert$, for some $s_1 > 0$.
- $-\frac{\nabla f(\mathbf{x}_k)^T \mathbf{p}_k}{\Vert \nabla f(\mathbf{x}_k) \Vert \Vert \mathbf{p}_k \Vert} \geq s_2$, for some $s_2 > 0$.
Condition 1 ensures that the step length does not become too large.
Condition 2 ensures that the step length does not become too small (or, that we actually move and do not get stuck).
Condition 3 ensures that $\mathbf{p}_k$ is a descent direction (since $\nabla f(\mathbf{x}_k)^T \mathbf{p}_k < 0$).
We will now state a theorem regarding convergence.
Theorem: Convergence of Line Search Methods
Let $f \in C^1$ and assume for $\mathbf{x}_0$, the level set, $$ \{\mathbf{x} \in \mathbb{R}^n \mid f(\mathbf{x}) \leq f(\mathbf{x}_0) \} \ \text{is bounded.} $$ Furthermore, assume $\mathbf{p}_k$ satisfies the descent direction conditions and that $\alpha_k$ is chosen by the Armijo step length rule. Then,
- The sequence $\{\mathbf{x}_k \}$ is bounded.
- The sequence $\{f(\mathbf{x}_k) \}$ is decreasing and lower bounded.
- Every limit point of $\{\mathbf{x}_k \}$ is a stationary point, or in looser terms, $$ “\lim_{k \to \infty} \mathbf{x}_k = \mathbf{x}^* : \nabla f(\mathbf{x}^*) = 0” $$
Note
The steepest descent direction fulfills conditions 2 and 3 with $s_2 = s_3 = 1$.