Part 10 - Neural Networks and Deep Learning Part 2

CS4487_10

Training Deep Neural Networks

As we discussed last time, training neural networks is a highly non-convex optimization problem in high dimensional space. In the loss landscape, we will have lots of plateaus, saddle points and local optimas that can hinder our training process.

So far, we have learned about Gradient Descent (GD) and Stochastic Gradient Descent (SGD, along with its mini-batch version) as optimization techniques.

Gradient Descent (GD)

Recall the algorithm for (batch) Gradient Descent:

Require: Learning Rate $\alpha^{(t)}$
Require: Initial Parameter $\mathbf{\theta}^{(0)}$
while stopping criterion not met do
$\quad$ Compute gradient estimate over $M$ examples:
$\quad$ $\mathbf{g}^{(t)} = \nabla_{\mathbf{\theta}} \frac{1}{M} \sum_{i=1}^{M} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)})$
$\quad$ Apply update: $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \alpha^{(t)} \mathbf{g}^{(t)}$
$\quad$ $t = t + 1$
end while

Pros: Gradient estimates are stable
Cons: Need to compute gradients over the entire training set for one update.

Stochastic Gradient Descent (SGD)

Recall the algorithm for Stochastic Gradient Descent:

Require: Learning Rate $\alpha^{(t)}$
Require: Initial Parameter $\mathbf{\theta}^{(0)}$
while stopping criterion not met do
$\quad$ Sample example $(\mathbf{x}^{(i)}, y^{(i)})$ from the training set.
$\quad$ Compute gradient estimate:
$\quad$ $\mathbf{g}^{(t)} = \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)})$
$\quad$ Apply update: $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \alpha^{(t)} \mathbf{g}^{(t)}$
$\quad$ $t = t + 1$
end while

Pros: Computation time per update does not depend on $M$, allowing convergence on extremely large datasets.
Cons: Gradient estimates can be noisy, use mini-batches to mitigate this.

Optimization Techniques

We will discuss other optimization techniques that can help us train deep neural networks more effectively.

Momentum

How do we solve this problem? One way is to use momentum.

We introduce a new variable $\mathbf{v}$, the velocity.

We think of $\mathbf{v}$ as the direction and speed by which the parameters move as the learning dynamics progress.

The velocity is an exponentially decaying moving average of the negative gradients,

$$ \mathbf{v}^{(t + 1)} = \rho \mathbf{v}^{(t)} - \alpha^{(t)} \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)}) $$

where $\rho \in [0, 1)$.

Update rule: $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} + \mathbf{v}^{(t+1)}$

Let’s take a closer look at the velocity term,

$$ \mathbf{v}^{(t + 1)} = \rho \mathbf{v}^{(t)} - \alpha^{(t)} \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)}) $$

We can see that the velocity accumulates the previous gradients.

But, what is the role of $rho$?

If $rho$ is larger than $\alpha^{(t)}$, the current update is more affected by the previous gradients.

We usually set $rho$ to a high value, e.g., 0.9.

SGD with Momentum

Require: Learning Rate $\alpha^{(t)}$
Require: Momentum $\rho$
Require: Initial Parameter $\mathbf{\theta}^{(0)}$
Require: Initial Velocity $\mathbf{v}^{(0)}$
while stopping criterion not met do
$\quad$ Sample example $(\mathbf{x}^{(i)}, y^{(i)})$ from the training set.
$\quad$ Compute gradient estimate: $\mathbf{g}^{(t)} = \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)})$
$\quad$ Update velocity: $\mathbf{v}^{(t+1)} = \rho \mathbf{v}^{(t)} - \alpha^{(t)} \mathbf{g}^{(t)}$
$\quad$ Apply update: $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} + \mathbf{v}^{(t+1)}$
$\quad$ $t = t + 1$
end while

Nesterov Momentum

Another approach that we can take, is to use Nesterov Momentum.

First take a step in the direction of the accumulated gradient, then calculate the gradient and make a correction.

Let’s write it out.

Recall the velocity term in the momentum update,

$$ \mathbf{v}^{(t + 1)} = \rho \mathbf{v}^{(t)} - \alpha^{(t)} \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)}) $$

Nesterov Momentum changes the update rule to,

$$ \begin{align*} \tilde{\mathbf{\theta}} = \mathbf{\theta}^{(t)} + \rho \mathbf{v}^{(t)} \newline \mathbf{v}^{(t + 1)} = \rho \mathbf{v}^{(t)} - \alpha^{(t)} \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \tilde{\mathbf{\theta}}), y^{(i)}) \newline \end{align*} $$

Update: $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} + \mathbf{v}^{(t+1)}$

SGD with Nesterov Momentum

Require: Learning Rate $\alpha^{(t)}$
Require: Momentum $\rho$
Require: Initial Parameter $\mathbf{\theta}^{(0)}$
Require: Initial Velocity $\mathbf{v}^{(0)}$
while stopping criterion not met do
$\quad$ Sample example $(\mathbf{x}^{(i)}, y^{(i)})$ from the training set.
$\quad$ Update: $\tilde{\mathbf{\theta}} = \mathbf{\theta}^{(t)} + \rho \mathbf{v}^{(t)}$
$\quad$ Compute gradient estimate: $\mathbf{g}^{(t)} = \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \tilde{\mathbf{\theta}}), y^{(i)})$
$\quad$ Update velocity: $\mathbf{v}^{(t+1)} = \rho \mathbf{v}^{(t)} - \alpha^{(t)} \mathbf{g}^{(t)}$
$\quad$ Apply update: $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} + \mathbf{v}^{(t+1)}$
$\quad$ $t = t + 1$
end while

Adaptive Learning Rate Methods

Till now we have assigned the same learning rate to all parameters $\theta_j$‘s.

But, if $\theta_j$‘s vary in importance and convergence speed, is this really a good idea?

Probably not.

AdaGrad

One way to address this issue is to use AdaGrad.

Idea: Scale the gradient of a model parameter by the square root of sum of squares of all its historical values.

Or in other words, progress along “steep” directions (with large partial derivatives) are damped.

Progress along “flat” directions (with small partial derivatives) are accelerated.

Here is a conceptual question that is quite important.

What will happen to the gradient magnitude over a long time?

It will decay to zero.

Require: Initial Learning Rate $\alpha$
Require: Initial Parameter $\mathbf{\theta}^{(0)}$
Initialize $\mathbf{r}^{(0)} = 0$
while stopping criterion not met do
$\quad$ Sample example $(\mathbf{x}^{(i)}, y^{(i)})$ from the training set.
$\quad$ Compute gradient estimate: $\mathbf{g}^{(t)} = \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)})$
$\quad$ Accumulate: $\mathbf{r}^{(t+1)} = \mathbf{r}^{(t)} + \mathbf{g}^{(t)} \odot \mathbf{g}^{(t)}$
$\quad$ Compute update: $\Delta \mathbf{\theta} = -\frac{\alpha}{\epsilon + \sqrt{\mathbf{r}^{(t+1)}}} \odot \mathbf{g}^{(t)}$
$\quad$ Apply update: $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} + \Delta \mathbf{\theta}$
$\quad$ $t = t + 1$
end while

RMSProp: “Leaky AdaGrad”

AdaGrad is good when the objective is convex, however, it may not be the best choice for non-convex problems.

AdaGrad might shrink the learning rate too aggressievly sometimes as well, but we can overcome these problems.

We can adapt it to perform better in non-convex settings by accumulating an exponentially decaying average of the gradient. This is an idea that we will see time and time again in deep learning.

Require: Global Learning Rate $\alpha$
Require: Decay Rate $\rho$
Require: Initial Parameter $\mathbf{\theta}^{(0)}$
Initialize $\mathbf{r}^{(0)} = 0$
while stopping criterion not met do
$\quad$ Sample example $(\mathbf{x}^{(i)}, y^{(i)})$ from the training set.
$\quad$ Compute gradient estimate: $\mathbf{g}^{(t)} = \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)})$
$\quad$ Accumulate: $\mathbf{r}^{(t+1)} = \rho \mathbf{r}^{(t)} + (1 - \rho) \mathbf{g}^{(t)} \odot \mathbf{g}^{(t)}$
$\quad$ Compute update: $\Delta \mathbf{\theta} = -\frac{\alpha}{\epsilon + \sqrt{\mathbf{r}^{(t+1)}}} \odot \mathbf{g}^{(t)}$
$\quad$ Apply update: $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} + \Delta \mathbf{\theta}$
$\quad$ $t = t + 1$
end while

Adam: ADAptive Moments

Adam is currently the default optimization algorithm for training deep neural networks.

Adam is like RMSProp with momentum but with bias correction terms for the first (mean) and second (uncentered variance) moments.

Require: Global Learning Rate $\alpha$
Require: Decay Rates $\rho_1, \rho_2$
Require: Initial Parameter $\mathbf{\theta}^{(0)}$
Initialize moment variables $\mathbf{s}^{(0)} = 0$, $\mathbf{r}^{(0)} = 0$
while stopping criterion not met do
$\quad$ Sample example $(\mathbf{x}^{(i)}, y^{(i)})$ from the training set.
$\quad$ Compute gradient estimate: $\mathbf{g}^{(t)} = \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)})$
$\quad$ Update: $\mathbf{s}^{(t+1)} = \rho_1 \mathbf{s}^{(t)} + (1 - \rho_1) \mathbf{g}^{(t)}$
$\quad$ Update: $\mathbf{r}^{(t+1)} = \rho_2 \mathbf{r}^{(t)} + (1 - \rho_2) \mathbf{g}^{(t)} \odot \mathbf{g}^{(t)}$
$\quad$ Correct biases: $\hat{\mathbf{s}} = \frac{\mathbf{s}^{(t+1)}}{1 - \rho_1^{t+1}}$, $\hat{\mathbf{r}} = \frac{\mathbf{r}^{(t+1)}}{1 - \rho_2^{t+1}}$
$\quad$ Compute and apply update: $\Delta \mathbf{\theta} = -\frac{\hat{\mathbf{s}}}{\delta + \sqrt{\hat{\mathbf{r}}}}, \mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} + \Delta \mathbf{\theta}$
$\quad$ $t = t + 1$
end while

AdamW: ADAptive Moments with Weight Decay

AdamW is an extension of Adam, and has gained popularity, especially in training transformer-based deep models.

Weight decay (which is just a fancy word for $\ell_2$-regularization and ridge regularization) is a regularization technique to prevent ovefitting. It is just adding a $\Vert \mathbf{\theta} \Vert_2^2$ term to the loss function.

AdamW applies weight decay as a separate regularization term during the update step, decoupling it from the adaptive learning rate mechanism.

AdamW addresses the issue of large weights being incorrectly penalized by weight decay.

It also provides improved regularization and better generalization compared to original Adam.

$\colorbox{pink}{\text{Adam}}$ VS. $\colorbox{lime}{\text{AdamW}}$

Require: Global Learning Rate $\alpha$
Require: Decay Rates $\rho_1, \rho_2$
Require: Weight Decay $\lambda$
Require: Initial Parameter $\mathbf{\theta}^{(0)}$
Initialize moment variables $\mathbf{s}^{(0)} = 0$, $\mathbf{r}^{(0)} = 0$
while stopping criterion not met do
$\quad$ Sample example $(\mathbf{x}^{(i)}, y^{(i)})$ from the training set.
$\quad$ Compute gradient estimate: $\mathbf{g}^{(t)} = \nabla_{\mathbf{\theta}} \ell(f(\mathbf{x}^{(i)}; \mathbf{\theta}), y^{(i)}) \colorbox{pink}{$+ \lambda \mathbf{\theta}^{(t)}$}$
$\quad$ Update: $\mathbf{s}^{(t+1)} = \rho_1 \mathbf{s}^{(t)} + (1 - \rho_1) \mathbf{g}^{(t)}$
$\quad$ Update: $\mathbf{r}^{(t+1)} = \rho_2 \mathbf{r}^{(t)} + (1 - \rho_2) \mathbf{g}^{(t)} \odot \mathbf{g}^{(t)}$
$\quad$ Correct biases: $\hat{\mathbf{s}} = \frac{\mathbf{s}^{(t+1)}}{1 - \rho_1^{t+1}}$, $\hat{\mathbf{r}} = \frac{\mathbf{r}^{(t+1)}}{1 - \rho_2^{t+1}}$
$\quad$ Compute and apply update: $\Delta \mathbf{\theta} = -\frac{\hat{\mathbf{s}}}{\delta + \sqrt{\hat{\mathbf{r}}}} \colorbox{lime}{$- \lambda \mathbf{\theta}^{(t)}$}$, $\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} + \Delta \mathbf{\theta}$
$\quad$ $t = t + 1$
end while

Other Emerging Optimization Techniques

AdaBelief, unlike Adam, AdaBelief divides by the exponentially decaying moving average of variance of the gradients.

In Adam: $\mathbf{r}^{(t+1)} = \rho_2 \mathbf{r}^{(t)} + (1 - \rho_2) \mathbf{g}^{(t)} \odot \mathbf{g}^{(t)}$ In AdaBelief: $\mathbf{r}^{(t+1)} = \rho_2 \mathbf{r}^{(t)} + (1 - \rho_2) (\mathbf{g}^{(t)} - \mathbf{m}^{(t)}) \odot (\mathbf{g}^{(t)} - \mathbf{s}^{(t + 1)})^2$

AdaHessian: Second-order optimization technique that incorporates the curvature of the loss function via adaptive estimates of the Hessian diagonals Since calculating the Hessian is a computationally expensive operation, it can be approximated with a Hessian vector product.

Regularization

We’ve discussed regularization before, but this has mostly been in the loss function. However, when it comes to preventing overfit for models, we can apply regularization elsewhere, for example, in the data.

Data Augmentation

If we artificially permute the data to increase the dataset size, we will have more data to train on!

Common ways to do this are,

Horizontal flipping
- As the name suggests, we flip the image horizontally.
Random crops and scales.
- Training: Sample random crops/scales (given that the input resolution is $512 \times 512$).
  1. Pick random $L$ in range $[256, 480]$.
  2. Resize training image short side to $L$.
  3. Sample random $224 \times 224$ patch.
- Testing: Average a fixed set of crops.
  1. Resize image at 5 scales, $\{224, 256, 384, 480, 640\}$.
  2. For each size, use 10 $224 \times 224$ crops, 4 corners + center + flips.
Color jittering
- Randomly change the brightness, contrast, saturation, and hue of the image.
And so on…
- Random mix of:
  - Translation
  - Rotation
  - Stretching
  - Shearing
  - Lens distortion
  - Add noise to pixels
  - $\ldots$

But we have to be careful about the transformation we choose, it has to be label preserving.

For example, if we are dealing with text, we simply can not flip the characters, a ‘b’ will become a ‘d’. If we are working with the MNIST dataset, we can not perform a 180$^\circ$ rotation, a ‘6’ will become a ‘9’.

Emerging Data Augmentation Techniques

There are a few emerging data augmentation techniques that are worth mentioning.

Cutout
- Idea: Randomly remove a square region of pixels in an image during training.
  - Pros: Can remove noise, e.g., irrelevant background.
  - Cons: Can remove important information.
Mixup
- Idea: Linearly mix two random images and their labels during training.
  - Pros: Increase diversity in the training data.
  - Cons: Can create blurred images (which can be a pro sometimes!), especially for images with complex textures.
Cutmix
- Idea: Randomly select two images during training, cut a random patch of pixels from one image and paste it to the other, and then mix their labels proportionally to the area of the patch.
  - Pros: Increase diversity in the training data.
  - Cons: Can create unrealistic images, can also remove important features.

Other Regularization Techniques

We can also apply regularization to the training process itself. Here are a few examples.

Early Stopping

As the name suggests, we do not want to train a network to have a too small training error.

Recall that overfitting, with a large and complex model, means that it is easier to (over)fit the training samples into them.

To prevent this, we use the validation error to decide when to stop training.

We can think of early stopping as a type of regularization, why?

The effective complexity of the network starts out small, as the weights are initialized small

The effect complexity grows during training, as the weights get larger and larger.

Early stopping then represents a way of limiting the effective network complexity.

In practice, when we are training we also output the validation error, to get a sense of both.

Every time the validation error improves, we want to save this checkpoint.

When the validation error plateaus for some time (or starts to increase for some time), we should stop.

The number of training steps (i.e., epochs) is also a hyperparameter that we can tune.

Dropout

Dropout is a regularization technique that is applied to the network itself.

In each update step, randomly sample a different binary mask to all the input and hidden units.

Multiply the mask with the units and do the update as usual, typical dropout probabilities are 0.8/0.5 for input/hidden units.

This is especially useful for fully connected layers, less so for convolutional layers.

But why is this even a good idea, don’t we want more units and parameters to represent the data?

This is true, but we want representative parameters, this forces the network to avoid having redundant or dependent representations. I.e., prevents co-adaptation of features, in which a feature is only helpful in the context/presence of several other specific features (although, this can be useful in some cases).

We also train a large ensemble of models (that share parameters), each binary mask corresponds to one model.

But how do we make a final decision?

At test time all the neurons are active always (usually). We scale the activations so that for each neuron, the output at test time = expected output at training time.

We can also use mask sampling (not as common). We randomly sample some (typically, 10-20) masks, for each mask, we apply it to the trained model to get a prediction, then we take the majority vote (i.e., average) over the predicitons.

Dropout at Test Time

So, as we can see, dropout makes our output random!

$$ \mathbf{y} = f_{\mathbf{W}} (\mathbf{x}, \mathbf{z}) $$

where $\mathbf{z}$ is the random mask.

We want to “average out” the randomness at test time.

$$ \mathbf{y} = f_W (\mathbf{x}) = \mathbb{E}_z [f_W (\mathbf{x}, \mathbf{z})] = \int_z f_W (\mathbf{x}, \mathbf{z}) p(\mathbf{z}) d\mathbf{z} $$

Consider a single neuron, $a = w_1 x_1 + w_2 x_2$.

At test time, we have $\mathbb{E}[a] = w_1 x_1 + w_2 x_2$.

At training time (with a dropout probability of 0.5), we have,

$$ \begin{align*} \mathbb{E}[a] &= \frac{1}{4}(w_1 x_1 + w_2 x_2) + \frac{1}{4}(w_1 x_1 + w_2 0) \newline &+ \frac{1}{4}(w_1 0 + w_2 x_2) + \frac{1}{4}(w_1 0 + w_2 0) \newline &= \frac{1}{2}(w_1 x_1 + w_2 x_2) \end{align*} $$

At test time, we multiply by the dropout probability to get the expected value.

However, this is quite tedious, as we need to change the test time.

A more common way is the “inverted dropout”, where we instead scale the activations at training time.

Thus, the test time remains the same.

Stochastic Depth

Similar to Droput, in each iteration, Stochastic Depth randomly drops a subset of layers with some survival probabilities and bypasses them with the identity function.

At test time, all layers are activated and re-calibrated by multiplying the corresponding survival probabilities.

This reduces training time substantially and improves generalization as an implicit model ensemble.

Batch Normalization

The idea of batch normalization is to adjust activations to lie within a desired operating range, while maintaining their relative values.

Given a mini-batch $\mathcal{B} = \{(\mathbf{x}^{(i)}, y^{(i)}, i = 1, \ldots, M\}$.

We compute the per-channel (i.e., feature) mean, $\mu_j = \frac{1}{M} \sum_{i = 1}^M x_j^{(i)}$ and,

compute the per-channel variance, $\sigma_j^2 = \frac{1}{M} \sum_{i = 1}^M (x_j^{(i)} - \mu_j)^2$.

Normalize $\mathbf{x}^{(i)}$ across the channel/feature, $\hat{x}_j^{(i)} = \frac{x_j^{(i)} - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}$.

But we have a issue, what if a zero mean and unit variance are too hard of a constraint?

The answer is more learnable parameters (as usual), we incorporate learnable scale and shift parameters $\gamma$ and $\beta$ such that,

$$ z_j^{(i)} = \gamma_j \hat{x}_j^{(i)} + \beta_j $$

Batch Normalization at Test Time

Mean and variance estimates depend on mini-batch at train time, but we can not do this at test time.

What’s the solution? We keep exponentially decaying the moving average of values seen during training and use them for testing.

During testing, batch normalization becomes a linear operator, which means it can be fused with the previous (convolutional) layer.

We usually use batch normalization after convolutional layers and before nonlinearity.

So remember, it behaves differently during training and testing, this can be a source for lots of pesky bugs.

Layer Normalization

Layer normalization is similar to batch normalization, but it normalizes across the feature dimension.

Since we are normalizing across the channel dimension, we get the same behavior at training and testing time.

Layer normalization for convolutional layers normalizes along both channel and spatial dimensions.

Group Normalization

Group normalization is a compromise between batch and layer normalization.

It divides the channels into groups and computes the mean and variance within each group.

Group normalization is more stable than batch normalization when the batch size is small.

Practical Tricks & Tips

We will discuss some practical tips and tricks that can help you train deep neural networks more effectively.

Data Pre-Procesing

Data pre-processing is a crucial step in training deep neural networks.

Assume $\mathbf{X} \in \mathbb{R}^{M \times N}$ is the data matrix (each example in a row).

We can make the data zero-centered by subtracting the mean of each feature,

$$ \mathbf{X} = \mathbf{X} - \mathbf{1} \mu $$

where $\mu = \frac{1}{M} \mathbf{X}^T \mathbf{1}$.

We can also normalize the data by dividing by the standard deviation of each feature,

$$ \mathbf{X} = \mathbf{X} \oslash \sigma $$

where $\sigma = \sqrt{\frac{1}{M} \mathbf{X}^T \mathbf{X}}$.

In practice, we may also want to PCA (Principal Component Analysis) the data.

The data is first centered, then projected into the eigenbasis, followed by divding every dimension by the corresponding eigenvalue.

We call the first step for decorrelation and the second step for whitening.

PCA as Whitening

Recall the eigendeomposition,

$$ \mathbf{C} = \frac{1}{M} \mathbf{X}^T \mathbf{X} = \frac{1}{M} \sum_{i = 1}^M \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^T = \mathbf{U} \mathbf{\Sigma}^2 \mathbf{U}^T $$

Defining $\hat{\mathbf{x}^{(i)}} = \mathbf{\Sigma}^{-1} \mathbf{U}^T \mathbf{x}^{(i)}$, we have,

$$ \begin{align*} \hat{\mathbf{C}} &= \frac{1}{M} \sum_{i = 1}^M \hat{\mathbf{x}^{(i)}} (\hat{\mathbf{x}^{(i)}})^T \newline &= \frac{1}{M} \sum_{i = 1}^M \mathbf{\Sigma}^{-1} \mathbf{U}^T \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^T \mathbf{U} \mathbf{\Sigma}^{-1} \newline &= \mathbf{\Sigma}^{-1} \mathbf{U}^T \left( \frac{1}{M} \sum_{i = 1}^M \mathbf{x}^{(i)} (\mathbf{x}^{(i)})^T \right) \mathbf{U} \mathbf{\Sigma}^{-1} \newline &= \mathbf{\Sigma}^{-1} \mathbf{U}^T \mathbf{U} \mathbf{\Sigma}^2 \mathbf{U}^T \mathbf{U} \mathbf{\Sigma}^{-1} \newline &= \mathbf{I} \end{align*} $$

In practice, for color (RGB) images, we only do this for the center, it is not common to do PCA whitening for the entire image.

But, we can three variants.

Subtract the mean image.
- Mean image = [3, height, width] array.
Subtract per-channel mean.
- Mean along each channel = 3 numbers.
Subtract per-channel mean and divide by per-channel std.
- Mean and std along each channel = $2 \times 3$ numbers.

As we lightly discussed earlier, we typically train our network several times on the entire dataset, one complete pass through the data is calle an epoch.

Also, never forget to shuffle your training data per epoch, since otherwise the training sequence can introduce bias.

Weight Initialization

Weight initialization is another crucial step in training deep neural networks.

Constant (including all zero) initialization is a bad idea and wrong.

Why? If every neuron in the netwrok computes the same output, then all of them will also compute the same gradients during backpropagation and undergo the exact same parameter updates.

Small random initialization is a good idea, but we have to be careful. It works okay for small networks, but for deeper and more complex networks this doesn’t work well.

The current recommendation for initializing CNNs with ReLU, for example is,

$$ w = np.random.randn(n) \times \sqrt{\frac{2}{n}} $$

where randn is Gaussian and $n$ is the number of input channels.

Proper initialization is an active research area, and there are many other initialization techniques.

Learning Rate Decay

SGD, AdaGrad, RMSProp, and Adam all have learning rate as a hyperparameter, but which learning rate should we choose?

If we have a very high learning rate, our loss will diverge.
If we have a high learning rate, our loss will quickly drop then plateau.
If we have a low learning rate, our loss will slowly drop, but we may get stuck in a local minima.

Which one should we choose? All of them, we first start with a high learning rate and decay it over time.

We can do this stepwise, i.e., we reduce the learning rate at a few fixed points, for example, multiply $\alpha$ by 0.1 after every 30 epochs.

There are also some other common schedulers for learning rate decay.

Cosine: $\alpha^{(t)} = \frac{1}{2} \alpha^{(0)} \left(1 + \cos \left(\frac{t}{T} \pi \right) \right)$
Linear: $\alpha^{(t)} = \alpha^{(0)} \left(1 - \frac{t}{T} \right)$

where $t$ is the epoch index and $T$ is the total number of epochs.

Transfer Learning

Transfer learning is a technique where we use a pre-trained model on a different task as a starting point for training a new model.

Deep features are fairly transferable, and open source pre-trained models are everywhere, so this is a common practice

	Very Similar Dataset	Very Different Dataset
Very Little Data	Use Linear classifier on top layer	You’re in trouble… Try linear classifier from different stages
Quite a lot of data	Finetune a few layers	Finetune a large number of layers

Choosing Hyperparameters

Choosing hyperparameters is a crucial step in training deep neural networks.

Check initial loss.
- Turn off weight decay and sanity check loss at initialization.
  - E.g. $\log(C)$ for softmax with $C$ classes.
Overfit a small sample.
- Try to train to 100% training accuracy on a small sample of training data (5-10 mini-batches).
  - Fiddle with the architechture, learning rate, weight initialization, etc.
- Loss not going down?
  - Learning rate too low, bad initialization, etc.
- Loss explodes to Inf or NaN?
  - Learning rate too high, bad initialization, etc.
Find learning rate that mkaes loss go down.
- Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that mkaes the loss drop significantly within ~100 iterations.
- Good learning rates to usually start off with are, $1e^{-1}, 1e^{-2}, 1e^{-3}, 1e^{-4}$.
Coarse grid, train for 1-5 epochs.
- Choose a few values of learning rate and weight decay around what worked from (3) and train a few models for 1-5 epochs.
- Good weight decay to usually start off with are, $1e^{-4}, 1e^{-5}, 0$.
Refine grid and train longer.
- Pick the best model(s) from (4) and train them for longer (10-20 epochs) without learning rate decay.
Look at loss curves.
- Losses may be noisy
  - Use a scatter plot
  - Also plot exponentially decaying moving average to see trends better.
Go to (5)

Summary

Improve your training error
- Network architectures
- Initializations
- Optimizers
- Learning rate schedulers
Improve your test error
- Regularization
- Choosing Hyperparameters

Part 10 - Neural Networks and Deep Learning Part 2

Training Deep Neural Networks

Gradient Descent (GD)

Stochastic Gradient Descent (SGD)

Optimization Techniques

More Problems with SGD

Momentum

SGD with Momentum

Nesterov Momentum

SGD with Nesterov Momentum

Adaptive Learning Rate Methods

AdaGrad

RMSProp: “Leaky AdaGrad”

Adam: ADAptive Moments

AdamW: ADAptive Moments with Weight Decay

Other Emerging Optimization Techniques

Regularization

Data Augmentation

Emerging Data Augmentation Techniques

Other Regularization Techniques

Early Stopping

Dropout

Dropout at Test Time

Stochastic Depth

Batch Normalization

Batch Normalization at Test Time

Layer Normalization

Group Normalization

Practical Tricks & Tips

Data Pre-Procesing

PCA as Whitening

Weight Initialization

Learning Rate Decay

Transfer Learning

Choosing Hyperparameters

Summary