Optimization of the CNN was done using the algorithm ADAM developed by Diederik P. Kingma and Jimmy Lei Ba described in their paper Adam: A Method for Stochastic Optimization. The motivation for choosing this algorithm is the low computational complexity of the method. As stated in the paper it only requires first order gradients on the input objective function which has the same computational complexity as evaluating the original function. Some other advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are bounded by the stepsize hyperparameter, and naturally performs a form of step size annealing. In any case this means the algorithm has a degree of automation and adaptivity that many other algorithms do not present.
We start with defining the pseudo code which defines how this functions. Adam uses the method of moments which is where unknown parameters are estimated using 0th, 1st, and higher order moments. Adam uses only the first two. For example in the case of a random vector, 1st order moment is the mean vector and second order is the covariance matrix, a measure of how related two variables are.
Inputs: α (step size), β1β2 (exponential decay rates for moment estimation ranging from [0,1] ), f(θ) (stochastic objective function), θ_0 (initial parameter vector). As advised by the paper the input values for α=0.001, β1=0.9, β2=.999 and ε=10^-8 (for update last step)
Initialization: m_0=0 (1st moment vector), v_0=0 (2nd moment vector), t=0 (Initialize timestep)
In general, we are trying to minimize a noisy objective function such that it matches our expected outoput (an image) with respect to the parameters. Moment estimates are used to compute a converging parameter using bias correction. Bias correction is required because moments are initialized to zero thus without the bias the estimate may or may not change at all.
Given the gradient (g) of the stochastic objective f we can estimate the moment based upon all previous moments. This is done using the following summation formula. Where the beta represents the decay rate of the function and gradient squared is the elementwise square of the vector. (second moment shown here).
The expected value of the moment can be calculated in the same manner where ζ represents an offset value to account for the moving average of the second moment.
Looking at the final line of the equation we are able to see how we can calculate the expected moment and the scalar value multiplying the expected second order gradient to obtain our biased second order moment. This in turn is what is used to scale to obtain the correct value for the newly defined timestep scalar below.
Kingma and Ba have stated that the following conditions are a good choice of how to choose alpha as a step size. |∆t| ≤ α · (1 − β1)/ √( 1 − β2) in the case (1 − β1) > √ (1 − β2), and |∆t| ≤ α. Recall α is the step size and essentially the bound on how many iterations the algorithm will have to go through before we are able to reach an estimate of our parameter. Often we are able to choose α based upon some physical limitation in the model we are building. However, although inefficient often choosing a small enough step size (thus large amount of iterations) will still be able to generate the same result as long as the algorithm is run long enough. In Adam’s case the step size will be automatically scaled as we reach our expected parameter values in the asymptotic case. Thus, larger step size values will still be relatively effective. We are able to see this in the equation for the update step where α is scaled by the moments
Kingma and Ba “define” a signal to noise ratio which describes the condition in which we have converged on the expected parameter when the SNR reaches zero. This is easy to see because we can note that a value multiplied by zero is zero.
Kingma, Diederik P., and Jimmy Lei Ba. “ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION.” ICLR (2015): 1-15. 23 July 2015. Web.