Gradient Descent and Stochastic Gradient Descent

Gradient descent implementation

We try to solve the previous nonlinear least squares problem using gradient descent. The difference in magnitudes of optimal parameter values for \(b\) in this example causes gradient descent algorithm to converge very slowly, if at all (have a try yourself). To illustrate the working of the gradient descent algorithm, we scale \(b_3\) by 100 in the code below so that the optimal values of \(b\) are of roughly the same magnitude.

Below we define the function to be minimised \(f\) as well as the residue function for a single sample \(r_1\), for which we take advantage of the symbolic differentiation functionality of R (this is not a very efficient implementation as we also need to differentiate with respect to variables \(x\) and \(y\) in \(r_1\), for which there is no need in reality).

# data
ydat <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443, 38.558, 50.156, 62.948, 75.995, 91.972)
tdat <- seq_along(ydat)
n_iter <- 20000
f_val_trace <- matrix(NA, n_iter, 2) # to keep track of value of functions

# define the function and the gradient of residue
f <- function(b, tdat, ydat) sum((ydat -     b[1]/(1+b[2]*exp(-b[3]/100*tdat)))^2)
r1 <- deriv(expression( (y - b1/(1+b2*exp(-b3/100*x)))^2), namevec = c('b1','b2','b3','x','y'), function.arg=T)

# steepest descent algorithm with line search
b <- c(10, 10, 10) # initial values of b
gammas <- 0.1 / (2^(1:10)) # the gammas to try in the line search
f_vals <- numeric(length(gammas)) # to store values of f in the line search
ndat <- length(tdat)
temp <- matrix(NA, nrow=ndat, ncol=4) # to store partial derivatives for each data point
for (k in 1:n_iter) { # number of iterations
  # compute partial derivatives
  for (i in 1:ndat) {
    g_out <- r1(b[1], b[2], b[3], tdat[i], ydat[i])
    temp[i,4] <- as.numeric(g_out) # store value of function at 4th position
    temp[i,1:3] <- attr(g_out, 'gradient')[1:3]
  }
  # gradient vector
  f_grad_val <- colSums(temp)
  # line search
  for (i_gamma in 1:length(gammas)) {
    f_vals[i_gamma] <- f(b-gammas[i_gamma]*f_grad_val[1:3], tdat, ydat)
  }
  i_gamma <- which.min(f_vals)
  if (f_vals[i_gamma] < f_grad_val[4]) {
    b <- b - gammas[i_gamma] * f_grad_val[1:3]
  } else {
    # if function value cannot be decreased, stop with a warning
    warning('Unable to decrease residue')
    break
  }
  f_val_trace[k, 1] <- f_grad_val[4]
  # print every 1000 iterations
  if (k %% 1000 == 0)
    cat('k=', k, ', f_value=', f_grad_val[4], ', gamma=', gammas[i_gamma], ',  b=', b, '\n')
}

## k= 1000 , f_value= 17.42585 , gamma= 0.0125 ,  b= 143.4445 46.00969 36.08222 
## k= 2000 , f_value= 9.278284 , gamma= 0.0125 ,  b= 156.1346 45.94907 34.48427 
## k= 3000 , f_value= 6.471148 , gamma= 0.00625 ,  b= 163.5951 46.26988 33.73238 
## k= 4000 , f_value= 5.131484 , gamma= 0.00625 ,  b= 168.6628 46.58139 33.27617 
## k= 5000 , f_value= 4.16643 , gamma= 0.05 ,  b= 173.6262 46.94417 32.86367 
## k= 6000 , f_value= 3.673208 , gamma= 0.025 ,  b= 176.979 47.21571 32.60649 
## k= 7000 , f_value= 3.387468 , gamma= 0.00625 ,  b= 179.3912 47.42299 32.43019 
## k= 8000 , f_value= 3.201855 , gamma= 0.00625 ,  b= 181.2615 47.59035 32.29685 
## k= 9000 , f_value= 3.071262 , gamma= 0.00625 ,  b= 182.7896 47.73091 32.19096 
## k= 10000 , f_value= 2.973355 , gamma= 0.0125 ,  b= 184.1047 47.85463 32.10147 
## k= 11000 , f_value= 2.900118 , gamma= 0.00625 ,  b= 185.222 47.9615 32.02727 
## k= 12000 , f_value= 2.844853 , gamma= 0.00625 ,  b= 186.1695 48.0535 31.96503 
## k= 13000 , f_value= 2.79941 , gamma= 0.05 ,  b= 187.0407 48.13925 31.90797 
## k= 14000 , f_value= 2.763068 , gamma= 0.00625 ,  b= 187.8133 48.21583 31.85916 
## k= 15000 , f_value= 2.73755 , gamma= 0.00625 ,  b= 188.4119 48.27579 31.82123 
## k= 16000 , f_value= 2.715636 , gamma= 0.00625 ,  b= 188.9725 48.33233 31.78602 
## k= 17000 , f_value= 2.696745 , gamma= 0.025 ,  b= 189.5001 48.38595 31.75283 
## k= 18000 , f_value= 2.681187 , gamma= 0.00625 ,  b= 189.9728 48.43417 31.72383 
## k= 19000 , f_value= 2.667089 , gamma= 0.00625 ,  b= 190.439 48.482 31.69525 
## k= 20000 , f_value= 2.655593 , gamma= 0.05 ,  b= 190.8546 48.52496 31.66949

Upon running this steepest search algorithm, we see that it is much slower than Newton type algorithm such as BFGS. In fact, even after 10,000 iterations, the algorithm still has not converged. In reality, it takes around 50,000 iterations for steepest descent with line search to converge for this problem, compared to 17 iterations (solving the harder unscaled problem) using the Levenberg-Marquardt algorithm.

Even though there is no need to use stochastic gradient descent on a dataset with so few points, we nevertheless implement it below to have a rough idea of how things work. I have picked \(\gamma=0.0005\) below, which is as large as possible for the function value trace plot not to have multiple weird spikes.

b <- c(10, 10, 10)
gamma <- 0.0005 # we use a fixed gamma without line search
ndat <- length(tdat)
temp <- matrix(NA, nrow=ndat, ncol=4)
for (k in 1:n_iter) {
  # here we can also randomly choose an order to apply the dataset using sample.int
  for (i in 1:ndat) {
    r_grad_val <- r1(b[1], b[2], b[3], tdat[i], ydat[i])
    temp[i,4] <- as.numeric(r_grad_val)
    temp[i,1:3] <- attr(r_grad_val, 'gradient')[1:3]
    b <- b - gamma * temp[i,1:3]
  }
  f_val_trace[k, 2] <- f(b, tdat, ydat)
  if (k %% 1000 == 0) cat('k=', k, ', f_value=', f_val_trace[k, 2], ',  b=', b, '\n')
}

## k= 1000 , f_value= 1068.17 ,  b= 75.09771 16.51267 40.53817 
## k= 2000 , f_value= 443.8386 ,  b= 88.76626 26.90786 41.64001 
## k= 3000 , f_value= 266.8217 ,  b= 96.34154 32.36828 41.47682 
## k= 4000 , f_value= 186.3969 ,  b= 101.5696 35.91675 41.23133 
## k= 5000 , f_value= 141.9037 ,  b= 105.5421 38.4406 40.99032 
## k= 6000 , f_value= 114.3244 ,  b= 108.7358 40.32359 40.7631 
## k= 7000 , f_value= 95.8765 ,  b= 111.4021 41.76874 40.54877 
## k= 8000 , f_value= 82.82706 ,  b= 113.6892 42.89738 40.34522 
## k= 9000 , f_value= 73.18432 ,  b= 115.6917 43.78819 40.15062 
## k= 10000 , f_value= 65.80053 ,  b= 117.4734 44.49531 39.96358 
## k= 11000 , f_value= 59.97469 ,  b= 119.0795 45.05767 39.78311 
## k= 12000 , f_value= 55.25875 ,  b= 120.5428 45.50425 39.6085 
## k= 13000 , f_value= 51.35566 ,  b= 121.8881 45.85724 39.43922 
## k= 14000 , f_value= 48.06244 ,  b= 123.1343 46.13397 39.27492 
## k= 15000 , f_value= 45.2368 ,  b= 124.2962 46.34827 39.11533 
## k= 16000 , f_value= 42.77683 ,  b= 125.3857 46.51128 38.96027 
## k= 17000 , f_value= 40.60809 ,  b= 126.4121 46.63212 38.8096 
## k= 18000 , f_value= 38.67525 ,  b= 127.3833 46.71831 38.66321 
## k= 19000 , f_value= 36.93656 ,  b= 128.3056 46.7761 38.52103 
## k= 20000 , f_value= 35.36 ,  b= 129.1844 46.81069 38.38299

plot(f_val_trace[, 2], type='l', col='red', xlab = 'iteration', ylab = 'loss function', ylim=c(min(f_val_trace), max(f_val_trace)), log='xy')
#lines(f_val_trace[, 1])
lines(f_val_trace[, 1], col='blue')
legend(n_iter/50, max(f_val_trace)*0.95, legend=c('SGD', 'steepest descent'), col=c('red', 'blue'), lty=c(1,1))

We see that in this problem SGD does not compare favourably to steepest descent, whose performance is already woeful compare to Levenberg-Marquardt. In fact, SGD does not converge even after 300,000 iterations. But one thing to keep in mind is that here the size of the dataset is very small and the parameter optimisation problem quite hard. With a much larger dataset, SGD performs reasonably well and is moreover, easy to implement. Furthermore, in real-world applications, whether the algorithm converges is not as important as the out-of-sample performance of the model. In reality, the SGD algorithm would have been stopped when out-of-sample performance begins to deteriorate, long before convergence.