When designing CNN’s, or any neural network for that matter there is always a debate over what you want to use, and if methods for improving neural networks actually work. If methods actually work and if they work for the reason stated or a different reason is always a difficult thing to say. In the following few posts I will go over commonly used methods and what their effect is on the Neural networks. To start off this series I will begin with optimizers in a CNN (convolutional neural network). Using a control network I will go over the different optimizers that exist and go over the effects they have. In order to keep things fair we will use the same network but only change the optimizer.

The general setup for these experiments are a rather shallow CNN, that will classify the CIFAR-10 data set. The code being used can be seen below:

```
import numpy as np
import tensorflow as tf
import keras
from tensorflow.keras import datasets, layers, models
n_classes = 10
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
y_train_one_hot = keras.utils.to_categorical(y_train, n_classes)
y_test_one_hot = keras.utils.to_categorical(y_test, n_classes)
x_train = x_train/255.0
x_test = x_test/255.0
def make_model():
return models.Sequential([
layers.Conv2D(32, (4, 4), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (2, 2), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (2, 2), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])
model = make_model()
model.compile(optimizer='SGD', loss='categorical_crossentropy', metrics=['accuracy'])
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
model.fit(x=x_train,
y=y_train_one_hot,
epochs=25,
validation_data=(x_test, y_test_one_hot))
```

For the first control test we will not use any fancy optimizers, only the basic Stochastic Gradient Descent for learning. I do not expect this to work very well, as it can be quite slow at learning, but we will see with the results.

Other than using SGD we will also use the optimizers Adam, Nadam, RMSProp, Adagrad, and Adadelta.

But how do these optimzers work? here is a quick layout of what exactly they are doing.

## RMSProp

More accurately Root Mean Squared Propagation works similar to SGD with a momentum coefficient, by trying to dampen oscillations in the descent. However unlike SGD, RMSProp has a different learning rate for each of the parameters, which, even though it adds a lot of calculation, helps find solution where the gradient vary wildly between different parameters.

It does this by updating its weights according to the following equation:

\(\begin{equation}

\begin{split}

v_t = &\rho v_{t-1} + (1-\rho) \omega’^2\\

\Delta \omega_t =& -\frac{\nu}{\sqrt{v_t}+\epsilon} \omega’\\

\omega_{t+1} = &\omega_t + \Delta \omega_t

\end{split}

\end{equation}\)

Where \(\omega\) is the parameter that is being update, and \(\omega’\) is the derivative.

## Adagrad

Adagrad, the Adaptive Gradient method is a optimizer that is especially useful in image recognition and natural language processing as it has a higher learning rate for sparser variables. Like RMSProp, Adagrad has a learning rate for each of the variables. Initially Adagrad was only meant to be used in a convex optimization, it has however been successfully adapted for the non-convex case.

The equation to update the parameters is:

\(\begin{equation}

\begin{split}

G_{j,j} =& \sum^t_{\tao = 1} \nablaQ_\tao(w)^2\\

w_j:=&w_{j-1} – \frac{\nu}{\sqrt{G_{j,j}}\nabla Q_j(w)

\end{split}

\end{equation}\)

Looking at the diagonal second equation we can see that we are dividing by $\sqrt{G_{j,j}}$ which is essentially the \(l_2\) norm of past gradients, so big updates get dampened, and small updates get scaled.

## Adadelta

Adadelta, is a optimizer that is similar to Adagrad, according to the paper it fixes some of the errors the authors felt Adagrad had. The main benefits of Adadelta over Adagrad is that it does not need a learning rate. This removes one of the hyper-parameters and makes it a lot easier to train. Also, as you may have notices for Adagrad, the learning rate is divided by the sum of the \(l_2\) norm, which as time goes by will make all updates infinitely small.

In Adadelta they have a solution to the vanishing gradient mentioned before, instead of taking the sum over all values, they use a sliding window to only use the last n. This means that it is not guaranteed that the gradient will go to 0.

The algorithm is described as follow in the paper:

## Adam

Adam is an optimizer that according to the paper, combines the benefit of AdaGrad and RMSProp and is a memory efficient algorithm that runs with little memory usage. Now how exactly does it work. In the paper they give the following pseudo code for how the algorithm works:

This shows us what it does, but it can do with some explaining. The way Adam works is by taking the exponentially weighted moving average of the first and second order moment of the gradient of the loss function at a time-step t, and updating the parameters according to \(\theta_t = \theta_{t-1} -\alpha \frac{\hat{m}_t}{\sqrt{\hat{v_t}} + \epsilon}\).

As a note, there is a small bias correction as otherwise the estimation would be off for low values of t.

The parameters that this function needs is the 2 decay rates \(\beta_1\) and \(\beta_2\) and an \(\alpha\). In the paper they give a suggestion for the values, which we will use for this experiment, namely; \(\alpha = 0.001\), \(\beta_1 = 0.9\), \(\beta_2 = 0.999\).

## Nadam

Nadam, as the name may suggest is closely related to Adam, In fact there is only a small difference between the two, namely that Nadam uses the Nesterov momentum when updating the weights.

The Nesterov momentum is the exact same as momentum, only that it is evaluated where the next step will be instead of where you currently are. This allows for slightly quicker convergence, without changing the calculation too much.

## Results

Now that we know more or less how all these optimizers work, it is time to look at the results. We use data that was logged from Tensorflow and displayed in Tensorboard. Unfortunately, to the best of my knowledge, Tensorboard does not support adding legends to the graphs so I will add labels in the description.

Looking at the data above we can see that all methods are an improvement over the standard SGD, which is not much of a surprise as theoretically they all offered an improvement on it. Also, as we can see by the fact that the validation accuracy drops off for some, we have over-fit on the test data.

This over-fitting is most obvious in the case of RMSProp (lightblue) which drops in validation accuracy after around 15 epochs, yet the test accuracy continues to improve. Other methods seem not to over-fit as much, and it seems Adagrad is the best when it comes to not over-fitting as even though it has the worst training accuracy, it has the highest validation accuracy. It seems that Adagrad would also have been one of the few that could have done with some more training, as it was still increasing in validation accuracy until the last epoch.

As a final note, it seems that for relatively shallow networks, and few training epochs, Adagrad is the best. However from a theoretical point of view I would suggest looking at using Adadelta for larger networks that are trained for longer.