SC1

Statistical Computing 1

R Optimisation Functions


A Two-dimensional Example

To illustrate various R optimisation functions, we define a simple two-dimensional function below and make a contour plot. We use the meshgrid function from the package pracma, which gives \(x\) and \(y\) coordinates of a square grid in \(\mathbb{R}^2\).

library(pracma) # for meshgrid function
f <- function(x1, x2) cos(x1-1) + cos(x2) + sin(3*x1*x2) + x1^2 + x2^2
meshgrid(seq(-2, 2, length=5))
## $X
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   -2   -1    0    1    2
## [2,]   -2   -1    0    1    2
## [3,]   -2   -1    0    1    2
## [4,]   -2   -1    0    1    2
## [5,]   -2   -1    0    1    2
## 
## $Y
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   -2   -2   -2   -2   -2
## [2,]   -1   -1   -1   -1   -1
## [3,]    0    0    0    0    0
## [4,]    1    1    1    1    1
## [5,]    2    2    2    2    2
x <- seq(-2, 2, length=101)
grid_XY <- meshgrid(x)
z <- matrix(mapply(f, grid_XY$X, grid_XY$Y), nrow=101)
min(z)
## [1] 0.5926044
contour(x, x, z, nlevels=20)

The nlm function

The R function nlm uses a Newton-type algorithm. We have the option of specifying the gradient function and the hessian function. For this, we use the deriv function in R, which takes an expression and outputs a function \(f\) that returns the gradient (and hessian, if desired) of \(f\) along with the value of \(f\) itself.

f1 <- deriv(expression(cos(x1-1) + cos(x2) + sin(3*x1*x2) + x1^2 + x2^2),
  namevec = c('x1', 'x2'), function.arg=T, hessian=T)
f1(0,0)
## [1] 1.540302
## attr(,"gradient")
##            x1 x2
## [1,] 0.841471  0
## attr(,"hessian")
## , , x1
## 
##            x1 x2
## [1,] 1.459698  3
## 
## , , x2
## 
##      x1 x2
## [1,]  3  1
nlm(function(x) f1(x[1], x[2]), c(0,0))
## $minimum
## [1] 0.591872
## 
## $estimate
## [1] -0.7366981  0.5830360
## 
## $gradient
## [1] -2.415179e-12  2.897682e-12
## 
## $code
## [1] 1
## 
## $iterations
## [1] 7
nlm(function(x) f1(x[1], x[2]), c(-1.5, 1.5))
## $minimum
## [1] 3.178075
## 
## $estimate
## [1] -1.506530  1.622656
## 
## $gradient
## [1] 1.207285e-08 7.978390e-08
## 
## $code
## [1] 1
## 
## $iterations
## [1] 4

We see from the results that nlm converges rapidly toward a local minimum that is somewhat close to the initial guess, but not necessarily toward the global minimum. If we start from the point \((0,0)\), nlm does converge to the global minimum 0.591872, which is actually smaller than the minimum value of \(f\) on the grid we found previously. The downside of nlm is that it will only converge to a local minimum and in fact, it is difficult to predict what will happen if we do not start close to a local minimum, just as in the one-dimensional case.

If we simply give the function to nlm without gradient or hessian attributes, then nlm will compute the derivatives numerically, which will not be as accurate as symbolic expressions.

nlm(function(x) f(x[1], x[2]), c(0,0))
## $minimum
## [1] 0.591872
## 
## $estimate
## [1] -0.7366992  0.5830349
## 
## $gradient
## [1]  3.996803e-08 -1.776357e-09
## 
## $code
## [1] 1
## 
## $iterations
## [1] 9
nlm(function(x) f(x[1], x[2]), c(-1.5, 1.5))
## $minimum
## [1] 3.178075
## 
## $estimate
## [1] -1.506530  1.622656
## 
## $gradient
## [1]  1.143747e-07 -1.180056e-07
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

If we compare the results using explicit expression of gradient/hessian functions with results using numerical computation of gradient/hessian, we see that the numeric computation results in larger gradient values at the minimum nlm finds and also takes more iterations.

The optim function

The default method used by optim is Nelder-Mead.

optim(c(0,0), function(x) f(x[1], x[2]))
## $par
## [1] -0.7367403  0.5830213
## 
## $value
## [1] 0.591872
## 
## $counts
## function gradient 
##       63       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(c(-1.5,1.5), function(x) f(x[1], x[2]))
## $par
## [1] -1.506550  1.622604
## 
## $value
## [1] 3.178075
## 
## $counts
## function gradient 
##       49       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

We also try this method with CG, BFGS, and L-BFGS-B, all of which seem to work well.

optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="CG")
## $par
## [1] -1.506531  1.622654
## 
## $value
## [1] 3.178075
## 
## $counts
## function gradient 
##      134       29 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="BFGS")
## $par
## [1] -1.506552  1.622629
## 
## $value
## [1] 3.178075
## 
## $counts
## function gradient 
##       12        6 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="L-BFGS-B")
## $par
## [1] -1.506529  1.622657
## 
## $value
## [1] 3.178075
## 
## $counts
## function gradient 
##        9        9 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]),
      method="L-BFGS-B", lower=c(-2, -2), upper=c(2, 1.6))
## $par
## [1] -1.52373  1.60000
## 
## $value
## [1] 3.179766
## 
## $counts
## function gradient 
##        9        9 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

We observe that BFGS and L-BFGS-B requires far fewer iterations than either Nelder-Mead or CG. Finally, we try simulated annealing, the only method that claims to be able find the global minimum.

optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="SANN", control=list(maxit=100))
## $par
## [1] -0.7244211  0.5261632
## 
## $value
## [1] 0.6032669
## 
## $counts
## function gradient 
##      100       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="SANN", control=list(maxit=1000))
## $par
## [1] -0.7621775  0.5859042
## 
## $value
## [1] 0.5937823
## 
## $counts
## function gradient 
##     1000       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="SANN", control=list(maxit=3000))
## $par
## [1] -0.7599764  0.5642328
## 
## $value
## [1] 0.593034
## 
## $counts
## function gradient 
##     3000       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Simulated annealing certainly does not get stuck in the local minimum close to its starting point, but it does take a lot of iterations to get to close to the global minimum, even for such a simple function.

A Nonlinear Least Squares Example

We use a dataset from the nlmrt function help file. A plot of the data as well as the fitted logistic model is shown below. We first use R functions that are specifically designed to solve nonlinear least squares problems, nls (part of base R) which uses the Gauss-Newton algorithm by default, and nlsLM (part of minpack.lm package) which uses the Levenberg-Marquardt algorithm.

ydat <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443, 38.558, 50.156, 62.948, 75.995, 91.972)
tdat <- seq_along(ydat)
my_data <- data.frame(y=ydat, t=tdat)
start1 <- c(b1=1, b2=1, b3=1)
my_model <- y ~ b1/(1+b2*exp(-b3*t))
try(nls(my_model, start=start1, data=my_data))
## Error in nls(my_model, start = start1, data = my_data) : 
##   singular gradient
library(minpack.lm)
out_lm <- nlsLM(my_model, start=start1, data=my_data)
out_lm
## Nonlinear regression model
##   model: y ~ b1/(1 + b2 * exp(-b3 * t))
##    data: my_data
##       b1       b2       b3 
## 196.1863  49.0916   0.3136 
##  residual sum-of-squares: 2.587
## 
## Number of iterations to convergence: 17 
## Achieved convergence tolerance: 1.49e-08
plot(tdat, ydat, pch=18, xlab='t', ylab='y', col='red')
lines(tdat, out_lm$m$fitted(), pch=1, col='blue', type='b', lty=2)
legend(1, 95, legend=c('data', 'fitted model'), col=c('red', 'blue'), pch=c(18,1))

We see that the Gauss-Newton algorithm actually fails for this slightly tricky dataset and model due to singular gradients, but the Levenberg-Marquardt algorithm, which can handle singuar gradients, has no problem finding a good fit for the data.

Next we define the residue function explicitly as the objective function and uses general-purpose optimisation functions for this nonlinear least squares problem.

f <- function(b, mydata) sum((mydata$y-b[1]/(1+b[2]*exp(-b[3]*mydata$t)))^2)

nlm(f, mydata=my_data, p=start1)
## $minimum
## [1] 2.587305
## 
## $estimate
## [1] 196.2832646  49.0967439   0.3135034
## 
## $gradient
## [1]  2.383761e-08 -6.476354e-08  3.170175e-05
## 
## $code
## [1] 2
## 
## $iterations
## [1] 45
optim(par=start1, fn=f, mydata=my_data) # default method is Nelder-Mead
## $par
##        b1        b2        b3 
##  35.52886 -28.58678  18.73764 
## 
## $value
## [1] 9205.436
## 
## $counts
## function gradient 
##      156       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(par=start1, fn=f, mydata=my_data, method="CG")
## $par
##         b1         b2         b3 
## 39.7789687  5.2889870  0.4890772 
## 
## $value
## [1] 5336.906
## 
## $counts
## function gradient 
##      447      101 
## 
## $convergence
## [1] 1
## 
## $message
## NULL
optim(par=start1, fn=f, mydata=my_data, method="BFGS")
## $par
##          b1          b2          b3 
## 196.5079544  49.1138533   0.3133611 
## 
## $value
## [1] 2.587543
## 
## $counts
## function gradient 
##      193       63 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
optim(par=start1, fn=f, mydata=my_data, method="L-BFGS-B")
## $par
##          b1          b2          b3 
## 196.5325081  49.1219542   0.3133583 
## 
## $value
## [1] 2.587556
## 
## $counts
## function gradient 
##      104      104 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
optim(par=start1, fn=f, mydata=my_data, method="SANN", control=list(maxit=30000))
## $par
##         b1         b2         b3 
## 98.3608388 33.5661121  0.4142424 
## 
## $value
## [1] 234.2907
## 
## $counts
## function gradient 
##    30000       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

We see from the results that the only algorithms that succeeds in finding a reasonable solution to this problem are the Newton type methods. Both Nelder-Mead and conjugate gradient fail to get anywhere close to something reasonable. Simulated annealing also fails, despite running for 30,000 iterations and the problem being only 3-dimensional.