R Optimisation Functions
A Two-dimensional Example
To illustrate various R
optimisation functions, we define a simple two-dimensional function below and make a contour plot. We use the meshgrid
function from the package pracma
, which gives \(x\) and \(y\) coordinates of a square grid in \(\mathbb{R}^2\).
library(pracma) # for meshgrid function
f <- function(x1, x2) cos(x1-1) + cos(x2) + sin(3*x1*x2) + x1^2 + x2^2
meshgrid(seq(-2, 2, length=5))
## $X
## [,1] [,2] [,3] [,4] [,5]
## [1,] -2 -1 0 1 2
## [2,] -2 -1 0 1 2
## [3,] -2 -1 0 1 2
## [4,] -2 -1 0 1 2
## [5,] -2 -1 0 1 2
##
## $Y
## [,1] [,2] [,3] [,4] [,5]
## [1,] -2 -2 -2 -2 -2
## [2,] -1 -1 -1 -1 -1
## [3,] 0 0 0 0 0
## [4,] 1 1 1 1 1
## [5,] 2 2 2 2 2
x <- seq(-2, 2, length=101)
grid_XY <- meshgrid(x)
z <- matrix(mapply(f, grid_XY$X, grid_XY$Y), nrow=101)
min(z)
## [1] 0.5926044
contour(x, x, z, nlevels=20)
The nlm
function
The R
function nlm
uses a Newton-type algorithm. We have the option of specifying the gradient function and the hessian function. For this, we use the deriv
function in R
, which takes an expression and outputs a function \(f\) that returns the gradient (and hessian, if desired) of \(f\) along with the value of \(f\) itself.
f1 <- deriv(expression(cos(x1-1) + cos(x2) + sin(3*x1*x2) + x1^2 + x2^2),
namevec = c('x1', 'x2'), function.arg=T, hessian=T)
f1(0,0)
## [1] 1.540302
## attr(,"gradient")
## x1 x2
## [1,] 0.841471 0
## attr(,"hessian")
## , , x1
##
## x1 x2
## [1,] 1.459698 3
##
## , , x2
##
## x1 x2
## [1,] 3 1
nlm(function(x) f1(x[1], x[2]), c(0,0))
## $minimum
## [1] 0.591872
##
## $estimate
## [1] -0.7366981 0.5830360
##
## $gradient
## [1] -2.415179e-12 2.897682e-12
##
## $code
## [1] 1
##
## $iterations
## [1] 7
nlm(function(x) f1(x[1], x[2]), c(-1.5, 1.5))
## $minimum
## [1] 3.178075
##
## $estimate
## [1] -1.506530 1.622656
##
## $gradient
## [1] 1.207285e-08 7.978390e-08
##
## $code
## [1] 1
##
## $iterations
## [1] 4
We see from the results that nlm
converges rapidly toward a local minimum that is somewhat close to the initial guess, but not necessarily toward the global minimum. If we start from the point \((0,0)\), nlm
does converge to the global minimum 0.591872, which is actually smaller than the minimum value of \(f\) on the grid we found previously. The downside of nlm
is that it will only converge to a local minimum and in fact, it is difficult to predict what will happen if we do not start close to a local minimum, just as in the one-dimensional case.
If we simply give the function to nlm
without gradient or hessian attributes, then nlm
will compute the derivatives numerically, which will not be as accurate as symbolic expressions.
nlm(function(x) f(x[1], x[2]), c(0,0))
## $minimum
## [1] 0.591872
##
## $estimate
## [1] -0.7366992 0.5830349
##
## $gradient
## [1] 3.996803e-08 -1.776357e-09
##
## $code
## [1] 1
##
## $iterations
## [1] 9
nlm(function(x) f(x[1], x[2]), c(-1.5, 1.5))
## $minimum
## [1] 3.178075
##
## $estimate
## [1] -1.506530 1.622656
##
## $gradient
## [1] 1.143747e-07 -1.180056e-07
##
## $code
## [1] 1
##
## $iterations
## [1] 10
If we compare the results using explicit expression of gradient/hessian functions with results using numerical computation of gradient/hessian, we see that the numeric computation results in larger gradient values at the minimum nlm
finds and also takes more iterations.
The optim
function
The default method used by optim
is Nelder-Mead.
optim(c(0,0), function(x) f(x[1], x[2]))
## $par
## [1] -0.7367403 0.5830213
##
## $value
## [1] 0.591872
##
## $counts
## function gradient
## 63 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(c(-1.5,1.5), function(x) f(x[1], x[2]))
## $par
## [1] -1.506550 1.622604
##
## $value
## [1] 3.178075
##
## $counts
## function gradient
## 49 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
We also try this method with CG, BFGS, and L-BFGS-B, all of which seem to work well.
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="CG")
## $par
## [1] -1.506531 1.622654
##
## $value
## [1] 3.178075
##
## $counts
## function gradient
## 134 29
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="BFGS")
## $par
## [1] -1.506552 1.622629
##
## $value
## [1] 3.178075
##
## $counts
## function gradient
## 12 6
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="L-BFGS-B")
## $par
## [1] -1.506529 1.622657
##
## $value
## [1] 3.178075
##
## $counts
## function gradient
## 9 9
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]),
method="L-BFGS-B", lower=c(-2, -2), upper=c(2, 1.6))
## $par
## [1] -1.52373 1.60000
##
## $value
## [1] 3.179766
##
## $counts
## function gradient
## 9 9
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
We observe that BFGS and L-BFGS-B requires far fewer iterations than either Nelder-Mead or CG. Finally, we try simulated annealing, the only method that claims to be able find the global minimum.
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="SANN", control=list(maxit=100))
## $par
## [1] -0.7244211 0.5261632
##
## $value
## [1] 0.6032669
##
## $counts
## function gradient
## 100 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="SANN", control=list(maxit=1000))
## $par
## [1] -0.7621775 0.5859042
##
## $value
## [1] 0.5937823
##
## $counts
## function gradient
## 1000 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(c(-1.5, 1.5), function(x) f(x[1], x[2]), method="SANN", control=list(maxit=3000))
## $par
## [1] -0.7599764 0.5642328
##
## $value
## [1] 0.593034
##
## $counts
## function gradient
## 3000 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Simulated annealing certainly does not get stuck in the local minimum close to its starting point, but it does take a lot of iterations to get to close to the global minimum, even for such a simple function.
A Nonlinear Least Squares Example
We use a dataset from the nlmrt
function help file. A plot of the data as well as the fitted logistic model is shown below. We first use R
functions that are specifically designed to solve nonlinear least squares problems, nls
(part of base R
) which uses the Gauss-Newton algorithm by default, and nlsLM
(part of minpack.lm
package) which uses the Levenberg-Marquardt algorithm.
ydat <- c(5.308, 7.24, 9.638, 12.866, 17.069, 23.192, 31.443, 38.558, 50.156, 62.948, 75.995, 91.972)
tdat <- seq_along(ydat)
my_data <- data.frame(y=ydat, t=tdat)
start1 <- c(b1=1, b2=1, b3=1)
my_model <- y ~ b1/(1+b2*exp(-b3*t))
try(nls(my_model, start=start1, data=my_data))
## Error in nls(my_model, start = start1, data = my_data) :
## singular gradient
library(minpack.lm)
out_lm <- nlsLM(my_model, start=start1, data=my_data)
out_lm
## Nonlinear regression model
## model: y ~ b1/(1 + b2 * exp(-b3 * t))
## data: my_data
## b1 b2 b3
## 196.1863 49.0916 0.3136
## residual sum-of-squares: 2.587
##
## Number of iterations to convergence: 17
## Achieved convergence tolerance: 1.49e-08
plot(tdat, ydat, pch=18, xlab='t', ylab='y', col='red')
lines(tdat, out_lm$m$fitted(), pch=1, col='blue', type='b', lty=2)
legend(1, 95, legend=c('data', 'fitted model'), col=c('red', 'blue'), pch=c(18,1))
We see that the Gauss-Newton algorithm actually fails for this slightly tricky dataset and model due to singular gradients, but the Levenberg-Marquardt algorithm, which can handle singuar gradients, has no problem finding a good fit for the data.
Next we define the residue function explicitly as the objective function and uses general-purpose optimisation functions for this nonlinear least squares problem.
f <- function(b, mydata) sum((mydata$y-b[1]/(1+b[2]*exp(-b[3]*mydata$t)))^2)
nlm(f, mydata=my_data, p=start1)
## $minimum
## [1] 2.587305
##
## $estimate
## [1] 196.2832646 49.0967439 0.3135034
##
## $gradient
## [1] 2.383761e-08 -6.476354e-08 3.170175e-05
##
## $code
## [1] 2
##
## $iterations
## [1] 45
optim(par=start1, fn=f, mydata=my_data) # default method is Nelder-Mead
## $par
## b1 b2 b3
## 35.52886 -28.58678 18.73764
##
## $value
## [1] 9205.436
##
## $counts
## function gradient
## 156 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(par=start1, fn=f, mydata=my_data, method="CG")
## $par
## b1 b2 b3
## 39.7789687 5.2889870 0.4890772
##
## $value
## [1] 5336.906
##
## $counts
## function gradient
## 447 101
##
## $convergence
## [1] 1
##
## $message
## NULL
optim(par=start1, fn=f, mydata=my_data, method="BFGS")
## $par
## b1 b2 b3
## 196.5079544 49.1138533 0.3133611
##
## $value
## [1] 2.587543
##
## $counts
## function gradient
## 193 63
##
## $convergence
## [1] 0
##
## $message
## NULL
optim(par=start1, fn=f, mydata=my_data, method="L-BFGS-B")
## $par
## b1 b2 b3
## 196.5325081 49.1219542 0.3133583
##
## $value
## [1] 2.587556
##
## $counts
## function gradient
## 104 104
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
optim(par=start1, fn=f, mydata=my_data, method="SANN", control=list(maxit=30000))
## $par
## b1 b2 b3
## 98.3608388 33.5661121 0.4142424
##
## $value
## [1] 234.2907
##
## $counts
## function gradient
## 30000 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
We see from the results that the only algorithms that succeeds in finding a reasonable solution to this problem are the Newton type methods. Both Nelder-Mead and conjugate gradient fail to get anywhere close to something reasonable. Simulated annealing also fails, despite running for 30,000 iterations and the problem being only 3-dimensional.