Prerequisites and Background
We will use the following notation, concepts, and conventions for this course,
Notation: Notation, Concepts, and Conventions
- Vectors are written with bold face, i.e., $\mathbf{x} \in \mathbb{R}^{n}$.
- Elements in a vector are written as $x_{j}, 1, \ldots, n$.
- All vectors are column vectors unless otherwise stated.
- The inner product of $\mathbf{a}$ and $\mathbf{b}$ is written as $\mathbf{a}^{T} \mathbf{b} = \mathbf{b}^{T} \mathbf{a} = \sum_{j=1}^n a_j b_j$.
- The norm $\Vert \cdot \Vert$ denotes the Euclidean norm, i.e., $\Vert \mathbf{x} \Vert = \sqrt{\mathbf{x}^{T} \mathbf{x}} = \sqrt{\sum_{j=1}^n x^{2}_j}$.
- We utilize vector inequalities, $\mathbf{a} \leq \mathbf{b}$, meaning that $a_j \leq b_j, j = 1, \ldots, n$.
Notation: Notation, Concepts, and Conventions, Cont.
- For a function $f : \mathbb{R}^n \mapsto \mathbb{R}$,
- If it is continously differentiable in a point $\mathbf{x}$, the gradient in that point is the column vector, $$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1} \newline \vdots \newline \frac{\partial f(\mathbf{x})}{\partial x_n} \end{bmatrix} $$
- If it is twice continously differentiable in a point $\mathbf{x}$, the Hessian in that point is the matrix, $$ \nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f(\mathbf{x})}{\partial x_1^2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_n} \newline \vdots & \ddots & \vdots \newline \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_1} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_n^2} \end{bmatrix} $$
- If it is countinously differentiable in a point $\mathbf{x}_0$, we can make a Taylor series expansion around the point that takes the form, $$ f(\mathbf{x}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \mathcal{O}\left(\Vert \mathbf{x} - \mathbf{x}_0 \Vert^2\right), $$ where $\mathcal{O} : \mathbb{R} \mapsto \mathbb{R}$ is a function such that, $$ \lim_{t \to 0^+} \frac{\mathcal{O}(t)}{t} = 0. $$
- If it is twice continously differentiable in a point $\mathbf{x}_0$, we can make a Taylor series expansion around the point that takes the form, $$ f(\mathbf{x}) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)^T (\mathbf{x} - \mathbf{x}_0) + \frac{1}{2} (\mathbf{x} - \mathbf{x}_0)^T \nabla^2 f(\mathbf{x}_0) (\mathbf{x} - \mathbf{x}_0) + \mathcal{O}\left(\Vert \mathbf{x} - \mathbf{x}_0 \Vert^2\right). $$
Notation: Notation, Concepts, and Conventions, Cont.
- For a matrix $A \in \mathbb{R}^{n \times n}$, $\lambda \in \mathbb{R}$ is called an eigenvalue and $\mathbf{v} \in \mathbb{R}^{n} \setminus \{\mathbf{0}\}$ is called the corresponding eigenvector if $A \mathbf{v} = \lambda \mathbf{v}$.
- $A \in \mathbb{R}^{n \times n}$ is called symmetric if $A^{T} = A$.
- A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is called positive semi-definite if any of the following equivalent conditions hold,
- All eigenvalues of $A$ are non-negative.
- $\mathbf{x}^{T} A \mathbf{x} \geq 0$ for all $\mathbf{x} \in \mathbb{R}^{n}$.
- A symmetric matrix $A \in \mathbb{R}^{n \times n}$ is called positive definite if any of the following equivalent conditions hold,
- All eigenvalues of $A$ are positive.
- $\mathbf{x}^{T} A \mathbf{x} > 0$ for all $\mathbf{x} \in \mathbb{R}^{n} \setminus \{\mathbf{0}\}$.
Optimization
Let’s first start by defining what optimization is.
Definition: Definition of Optimization
Optimization is a mathematical discipline which is concerned with finding the maxmima or minima of functions, possibly subject to constraints.
Which leads us to the general optimization problem definition.
Definition: General Optimization Problem
A general optimization problem is to, $$ \begin{align*} \min_{\mathbf{x}} & \ f(\mathbf{x}), \newline \text{subject to} & \ g_i(\mathbf{x}) \leq 0, \ i \in \mathcal{I}, \newline & \ g_i(\mathbf{x}) = 0, \ i \in \mathcal{E}, \newline & \ \mathbf{x} \in X. \end{align*} $$ where, $$ \begin{align*} & \mathbf{x} \in \mathbb{R}^{n} & \text{: vector of decision variables, } x_j, j = 1, \ldots, n \newline & f : \mathbb{R}^{n} \mapsto \mathbb{R} \cup \pm \infty & \text{: objective function,} \newline & X \subseteq \mathbb{R}^{n} & \text{: ground set, } \newline & g_i : \mathbb{R}^{n} \mapsto \mathbb{R} & \text{: constraint function defining restrictions on } \mathbf{x}, \newline & g_i, i \in \mathcal{I} & \text{: inequality constraints, } \newline & g_i, i \in \mathcal{E} & \text{: equality constraints, } \end{align*} $$
Further, we can divide the (general) optimization problem into different categories.
Definition: Categories of Optimization Problems
- Linear Programming (LP),
- Linear objective function $f(\mathbf{x}) = \mathbf{c}^{T} \mathbf{x} = \sum_{j=1}^n c_j x_j$,
- Affine constraint functions $g_i(\mathbf{x}) = \mathbf{a}_i^{T} \mathbf{x} - b_i, i \in \mathcal{I} \cup \mathcal{E}$.
- Ground set $X \subseteq \mathbb{R}^{n}$ defined by affine equalites and or inequalities.s
- Nonlinear Programming (NLP),
- Some functions $f, g_i, i \in \mathcal{I} \cup \mathcal{E}$ are nonlinear.
- Unconstrained Optimization,
- $\mathcal{I} \cup \mathcal{E} = \emptyset$,
- $X = \mathbb{R}^{n}$.
- Constrained Optimization,
- $\mathcal{I} \cup \mathcal{E} \neq \emptyset$, and or,
- $X \subset \mathbb{R}^{n}$.
- Integer programming (IP),
- $X \subseteq \mathbb{Z}^{n}$ or $X \subseteq \{0, 1\}^{n}$.
- Convex Programming (CP),
- $f, g_io, i \in \mathcal{I}$ are convex functions,
- $g_i, i \in \mathcal{E}$ are affine functions,
- $X$ is a closed convex set.
We will usually pack all our constraints and the ground set into a single set,
$$ S = \{\mathbf{x} \in \mathbb{R}^{n} | g_i(\mathbf{x}) \leq 0, i \in \mathcal{I},g_i(\mathbf{x}) = 0, i \in \mathcal{E}, \mathbf{x} \in X \}. $$
But, what do we mean by solving the problem to, $$ \min_{\mathbf{x} \in S} f(\mathbf{x})? $$
Let’s define this mathematical operation.
Let,
$$ f^{\star} \coloneqq \inf_{\mathbf{x} \in S} f(\mathbf{x}), $$
denote the infimum value of $f$ over the set $S$. If the value $f^{\star}$ is attained at some point $\mathbf{x}^{\star}$ in $S$, we can (are allowed) to write,
$$ f^{\star} \coloneqq \min_{\mathbf{x} \in S} f(\mathbf{x}), $$
and have $f(\mathbf{x}^{\star}) = f^{\star}$. Another well-defined operator defines the set of minimal solutions to the problem,
$$ S^{\star} \coloneqq \underset{\mathbf{x} \in S}{\arg \min} f(\mathbf{x}), $$
where $S^{\star} \subseteq S$ is nonempty if and only if the infimum value $f^{\star}$ is attained at some point $\mathbf{x}^{\star} \in S$.
Definition: Definition of $\underset{\mathbf{x} \in S}{\min} \ f(\mathbf{x})$
“to $\underset{\mathbf{x} \in S}{\min} \ f(\mathbf{x})$” means to “find $f^{\star}$ and an $\mathbf{x}^{\star} \in S^{\star}$”.
If we have an optimization problem $P$,
$$ P : \quad \min_{\mathbf{x} \in S} f(\mathbf{x}), $$
- A point $\mathbf{x}$ is feasible in problem $P$ if $\mathbf{x} \in S$. The point is infeasible in problem $P$ if $\mathbf{x} \notin S$.
- The problem $P$ is feasible if there exist a $\mathbf{x} \in S$ and the problem $P$ is infeasible if $S = \emptyset$.
- A point $\mathbf{x}^{\star}$ is an optimal solution to $P$ if $\mathbf{x}^{\star} \in \underset{\mathbf{x} \in S}{\arg \min} f(\mathbf{x})$.
- $f^{\star}$ is an optimal value to $P$ if $f^{\star} = \underset{\mathbf{x} \in S}{\min} \ f(\mathbf{x})$.
Example I
Consider the problem to,
$$ \begin{align*} \min & \ (x + 1)^2, \newline \text{subject to} & \ x \in \mathbb{R}, \end{align*} $$
We will learn how to “formally” solve and prove the optimality of the solution later in the course, but we can “intuitively” see that the optimal solution is $x^{\star} = -1$ with optimal value $f^{\star} = 0$.
Important to note here, this is how we “answer” the question, we state the optimal solution and the optimal value ($x^{\star}$ and $f^{\star}$).
Example II
A more complicated problem is to,
$$ \begin{align*} \min & \ (x + 1)^2, \newline \text{subject to} & \ x \geq 0, \end{align*} $$
Again, we can “intuitively” see that the optimal solution is $x^{\star} = 0$ with optimal value $f^{\star} = 1$.
Example III
If we consider the following problem, is there a minimizer?
$$ \begin{align*} \inf & \ (x + 1)^2, \newline \text{subject to} & \ x > 0, \end{align*} $$
We can see that the infimum value is $f^{\star} = 1$, as $x$ grows closer and closer to $0$, but there is no feasible point $x^{\star} \in S$ such that $f(x^{\star}) = 1$. Hence, there is no minimizer, we can attain the infimum value, but not at any feasible point.
An Optimization Problem: The Diet Problem
As a first example of an (real) optimization problem, we consider the diet problem 1.
Example: The Diet Problem
For a moderately active person, how much of each of a number of foods should be eaten on a daily basis so that the person’s intake of nutrients will be at least equal to the recommended dietary allowances (RDAs), with the cost of the diet being minimal?
This will be a good example to show how to model a real optimization problem and why a “realistic” model sometimes can be difficult to achieve.
We will furthert modify the question and only consider food items from McDonald’s. What we have to our disposal is the following table.
Table 1: Nutritional values and cost of food items at McDonald’s, along with the recommended daily allowances (RDA) for a moderately active person.
| Food | Calories (kcal) | Carb (g) | Protein (g) | Vit. A (%) | Vit. C (%) | Calc. (%) | Iron (%) | Cost (SEK) |
|---|---|---|---|---|---|---|---|---|
| Big Mac | 550 | 46 | 25 | 6 | 2 | 25 | 25 | 30 |
| Cheeseburger | 300 | 33 | 15 | 6 | 2 | 20 | 15 | 10 |
| McChicken | 360 | 40 | 14 | 0 | 2 | 10 | 15 | 35 |
| McNuggets | 280 | 18 | 13 | 0 | 2 | 2 | 4 | 40 |
| Caesar Salad | 350 | 24 | 23 | 160 | 35 | 20 | 10 | 50 |
| French Fries | 380 | 48 | 4 | 0 | 15 | 2 | 6 | 20 |
| Apple Pie | 250 | 32 | 2 | 4 | 25 | 2 | 6 | 10 |
| Coca-Cola | 210 | 58 | 0 | 0 | 0 | 0 | 0 | 15 |
| Milk | 100 | 12 | 8 | 10 | 4 | 30 | 8 | 15 |
| Orange Juice | 150 | 30 | 2 | 0 | 140 | 2 | 0 | 15 |
| RDA | 2000 | 350 | 55 | 100 | 100 | 100 | 100 |
Let’s formulate the problem.
We will define the following sets,
$$ \begin{align*} \text{Foods} & \coloneqq \{\text{Big Mac, Cheeseburger, McChicken, McNuggets, Caesar Salad, French Fries, Apple Pie, Coca-Cola, Milk, Orange Juice}\}, \newline \text{Nutrients} & \coloneqq \{\text{Calories, Carb, Protein, Vit. A, Vit. C, Calc., Iron}\}. \end{align*} $$
We define the following parameters,
$$ \begin{align*} & a_{ij} = \text{Amount of nutrient } i \text{ in food } j, \ i \in \text{Nutrients}, j \in \text{Foods}, \newline & b_i = \text{Recommended daily amount (RDA) for nutrient } i, \ i \in \text{Nutrients}, \newline & c_j = \text{Cost of food } j, \ j \in \text{Foods}. \end{align*} $$
Thus, we will have the following decision variables,
$$ x_j = \text{Amount of food } j \text{ we shoujld eat each day, } j \in \text{Foods}. $$
With these, we can define the optimization problem to,
$$ \begin{align*} \min & \ \sum_{j \in \text{Foods}} c_j x_j, \newline \text{subject to} & \ \sum_{j \in \text{Foods}} a_{ij} x_j \geq b_i, \ i \in \text{Nutrients}, \newline & \ x_j \geq 0, \ j \in \text{Foods}. \end{align*} $$
Here, if we translate each line of mathematical expressions into words, we get,
- Minimize the total cost such that,
- We get enough of each nutrient, and such that,
- We don’t sell anything to McDonald’s, we are only buying.
One can solve this problem using different methods, but the optimal solution is then,
$$ \mathbf{x}^{\star} = \begin{bmatrix} x_{\text{Big Mac}} \newline x_{\text{Cheeseburger}} \newline x_{\text{McChicken}} \newline x_{\text{McNuggets}} \newline x_{\text{Caesar Salad}} \newline x_{\text{French Fries}} \newline x_{\text{Apple Pie}} \newline x_{\text{Coca-Cola}} \newline x_{\text{Milk}} \newline x_{\text{Orange Juice}} \end{bmatrix} = \begin{bmatrix} 0 \newline 7.48 \newline 0 \newline 0 \newline 0.27 \newline 0 \newline 3.03 \newline 0 \newline 0 \newline 0 \newline \end{bmatrix}, $$
With a total cost of $118.47$ SEK and a total of $3093.51$ kcal.
But we can’t really buy $0.27$ of a Caesar Salad, so we would have to modify the model to account for this. If we add the constraint that $x_j$ should be an integer, the optimal solution becomes,
$$ \mathbf{x}^{\star} = \begin{bmatrix} x_{\text{Big Mac}} \newline x_{\text{Cheeseburger}} \newline x_{\text{McChicken}} \newline x_{\text{McNuggets}} \newline x_{\text{Caesar Salad}} \newline x_{\text{French Fries}} \newline x_{\text{Apple Pie}} \newline x_{\text{Coca-Cola}} \newline x_{\text{Milk}} \newline x_{\text{Orange Juice}} \end{bmatrix} = \begin{bmatrix} 0 \newline 7 \newline 0 \newline 0 \newline 1 \newline 0 \newline 3 \newline 0 \newline 0 \newline 0 \newline \end{bmatrix}, $$
with a total cost of $150$ SEK and a total of $3200$ kcal. It might look like we just rounded the previous solution, and that can sometimes give an okay answer for integer programming problems, but we can get very different solutions by solving the integer programming problem directly.
Supervised Learning
Let’s also define supervised learning 2 as an optimization problem.
Supervised learning for classification can be defined as follows.
- Given data pairs $(\mathbf{x}_{\ell}, y_{\ell}) \in \mathbb{R}^{n} \times \{1, \ldots, M \}, \ell = 1, \ldots N$,
- We want to find a mapping $G : \mathbb{R}^{n} \mapsto \{1, \ldots, M\}$ such that $G(\mathbf{x}_i) = y_i$.
One way to do it,
- Parameterize a neural network $G_{\mathbf{\theta}} : \mathbb{R}^{n} \mapsto \Delta$, where $\Delta = \{\mathbf{s} \in \mathbb{R}^{M} | \mathbf{s} \geq \mathbf{0}, \sum_{i=1}^M s_i = 1 \}$.
- Let $\mathbf{e}_i \in \mathbb{R}^{M}$ be the $i$-th unit vector, and $E : \{1, \ldots , M \} \mapsto \{\mathbf{e}_i \}_{i = 1}^M, E : i \mapsto \mathbf{e}_i$.
- Let $F : \Delta \times \Delta \mapsto \mathbb{R}, F : (\mathbf{t}, \mathbf{s}) \mapsto - \sum_{i=1}^M t_i \log(s_i)$ (Cross-Entropy loss function).
- Solve, $$ \mathbf{\hat{\theta}} = \underset{\mathbf{\theta} \in \Theta}{\arg \min} \ \sum_{\ell = 1}^N F(E(y_{\ell}, G_{\mathbf{\theta}}(\mathbf{x}_{\ell}))). $$
So, our ML algorithm for classification is $G_{\mathbf{\hat{\theta}}}$.