Numerical Calculus II: Integration

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1 Numerical Integration

Integration is an important tool in statistics. Integration is involved in, for example:

  1. integrating random effects out of a joint distribution to get a likelihood,
  2. evaluating expectations, including posterior expectations in Bayesian inference.

Unfortunately integration is typically intractable and accurate approximations are often computationally expensive, so much statistical research is devoted to clever ways of avoiding it. When it can’t be avoided then the aim is typically to find ways of approximating an integral of a function with a suitably small number of function evaluations.

Integration w.r.t. 1 or 2 (maybe even up to 4) variables is not too problematic. A typical solution, often called quadrature, involves evaluating the function at a specified set of points and combining the resulting values to obtain an approximation of the integral.

In even more detail, in one dimension the idea is usually to

  • split the domain into small intervals
  • construct interpolating polynomials in each interval using the function evaluations
  • integrate each interpolating polynomial over each interval
  • sum the values together.

At least in principle, one can then approximate higher dimensional integrals by viewing them as iterated one-dimensional integrals. This is only practical for a small number of dimensions as the cost grows exponentially with dimension.

For higher-dimensional problems, which occur frequently in Statistics, we often resort to random algorithms that do not always scale so poorly with dimension.

2 Quadrature

To focus on the main ideas, we will restrict our attention to numerically approximating the definite integral of a function \(f\) over a finite interval \([a,b]\). Extensions to semi-infinite and infinite intervals will be discussed briefly, as will extensions to multiple integrals.

In practice, for one dimensional integrals you can use R’s integrate function. For multiple integrals, the cubature package can be used. Unlike the basics covered here, these functions will produce error estimates and should be more robust.

2.1 Polynomial interpolation

We consider approximation of a continuous function \(f\) on \([a,b]\), i.e. \(f \in C^0([a,b])\) using a polynomial function \(p\). One practical motivation is that polynomials can be integrated exactly.

One way to devise an approximation is to first consider approximating \(f\) using an interpolating polynomial with, say, \(k\) points \(\{(x_i,f(x_i))\}_{i=1}^k\). The interpolating polynomial is unique, has degree at most \(k-1\), and it is convenient to express it as a Lagrange polynomial:

\[p_{k-1}(x) := \sum_{i=1}^k \ell_i(x) f(x_i),\]

where the Lagrange basis polynomials are

\[\ell_i(x) = \prod_{j=1,j\neq i}^k \frac{x-x_j}{x_i-x_j} \qquad i \in \{1,\ldots,k\}.\] For a given 3rd degree polynomial \(f\), we plot polynomial approximations for \(k \in \{2,3,4\}\) and specific choices of \(x_1,\ldots,x_4\).

construct.interpolating.polynomial <- function(f, xs) {
  k <- length(xs)
  fxs <- f(xs)
  p <- function(x) {
    value <- 0
    for (i in 1:k) {
      fi <- fxs[i]
      zs <- xs[setdiff(1:k,i)]
      li <- prod((x-zs)/(xs[i]-zs))
      value <- value + fi*li
    }
    return(value)
  }
  return(p)
}

plot.polynomial.approximation <- function(f, xs, a, b) {
  p <- construct.interpolating.polynomial(f, xs)
  vs <- seq(a, b, length.out=500)
  plot(vs, f(vs), type='l', xlab="x", ylab="black: f(x), red: p(x)")
  points(xs, f(xs), pch=20)
  lines(vs, vapply(vs, p, 0), col="red")
}

a <- -4
b <- 4

f <- function(x) {
  return(-x^3 + 3*x^2 - 4*x + 1)
}

plot.polynomial.approximation(f, c(-2, 2), a, b)

plot.polynomial.approximation(f, c(-2, 0, 2), a, b)

plot.polynomial.approximation(f, c(-2, 0, 2, 4), a, b)

Of course, one might instead want a different representation of the polynomial, e.g. as sums of monomials with appropriate coefficients.

\[p(x) = \sum_{i=1}^k a_i x^{i-1}.\]

This can be accomplished in principle by solving a linear system. In practice, this representation is often avoided for large \(k\) as solving the linear system is numerically unstable.

An alternative to using many interpolation points to fit a high-degree polynomial is to approximate the function with different polynomials in a number of subintervals of the domain. This results in a piecewise polynomial approximation that is not necessarily continuous.

We consider a simple scheme where \([a,b]\) is partitioned into a number of equal length subintervals, and within each subinterval \(k\) interpolation points are used to define an approximating polynomial. To keep things simple, we consider only two possible ways to specify the \(k\) interpolation points within each subinterval. In both cases the points are evenly spaced and evenly spaced from the endpoints of the interval. However, when the scheme is “closed”, the endpoints of the interval are included, while when the scheme is “open”, the endpoints are not included. For a closed scheme with \(k=1\), we opt to include the left endpoint.

# get the endpoints of the subintervals
get.subinterval.points <- function(a, b, nintervals) {
  return(seq(a, b, length.out=nintervals+1))
}

# returns which subinterval a point x is in
get.subinterval <- function(x, a, b, nintervals) {
  h <- (b-a)/nintervals
  return(min(max(1,ceiling((x-a)/h)),nintervals))
}

# get the k interpolation points in the interval
# this depends on the whether the scheme is open or closed
get.within.subinterval.points <- function(a, b, k, closed) {
  if (closed) {
    return(seq(a, b, length.out=k))
  } else {
    h <- (b-a)/(k+1)
    return(seq(a+h,b-h,h))
  }
}

construct.piecewise.polynomial.approximation <- function(f, a, b, nintervals, k, closed) {
  ps <- vector("list", nintervals)
  subinterval.points <- get.subinterval.points(a, b, nintervals)
  for (i in 1:nintervals) {
    left <- subinterval.points[i]
    right <- subinterval.points[i+1]
    points <- get.within.subinterval.points(left, right, k, closed)
    p <- construct.interpolating.polynomial(f, points)
    ps[[i]] <- p
  }
  p <- function(x) {
    return(ps[[get.subinterval(x, a, b, nintervals)]](x))
  }
  return(p)
}

plot.piecewise.polynomial.approximation <- function(f, a, b, nintervals, k, closed) {
  p <- construct.piecewise.polynomial.approximation(f, a, b, nintervals, k, closed)
  vs <- seq(a, b, length.out=500)
  plot(vs, f(vs), type='l', xlab="x", ylab="black: f(x), red: p(x)")
  lines(vs, vapply(vs, p, 0), col="red")
  subinterval.points <- get.subinterval.points(a, b, nintervals)
  for (i in 1:nintervals) {
    left <- subinterval.points[i]
    right <- subinterval.points[i+1]
    pts <- get.within.subinterval.points(left, right, k, closed)
    points(pts, f(pts), pch=20, col="blue")
  }
  abline(v = subinterval.points)
}
plot.piecewise.polynomial.approximation(sin, 0, 10, 5, 1, TRUE)

plot.piecewise.polynomial.approximation(sin, 0, 10, 5, 2, TRUE)

plot.piecewise.polynomial.approximation(sin, 0, 10, 5, 2, FALSE)

plot.piecewise.polynomial.approximation(sin, 0, 10, 5, 3, TRUE)

plot.piecewise.polynomial.approximation(sin, 0, 10, 20, 1, TRUE)

plot.piecewise.polynomial.approximation(sin, 0, 10, 20, 1, FALSE)

2.2 Polynomial integration

We now consider approximating the integral

\[I(f) := \int_a^b f(x) {\rm d}x,\]

where \(f \in C^0([a,b])\).

All of the approximations we will consider here involve computing integrals associated with the polynomial approximations. The approximations themselves are often referred to as quadrature rules.

2.2.1 Changing the limits of integration

For constants \(a<b\) and \(c<d\), we can accommodate a change of finite interval via

\[\int_a^b f(x) {\rm d}x = \int_c^d g(y) {\rm d}y,\]

by defining

\[g(y) := \frac{b-a}{d-c} f \left (a+\frac{b-a}{d-c}(y-c) \right ).\]

Examples of common intervals are \([-1,1]\) and \([0,1]\).

change.domain <- function(f, a, b, c, d) {
  g <- function(y) {
    return((b-a)/(d-c)*f(a + (b-a)/(d-c)*(y-c)))
  }
  return(g)
}

# test out the function using R's integrate function
integrate(sin, 0, 10)
1.839072 with absolute error < 9.9e-12
g = change.domain(sin, 0, 10, -1, 1)
integrate(g, -1, 1)
1.839072 with absolute error < 9.9e-12

One can also accommodate a semi-infinite interval by a similar change of variables. One example is

\[\int_a^\infty f(x) {\rm d}x = \int_0^1 g(y) {\rm d}y, \qquad g(y) := \frac{1}{(1-y)^2} f \left ( a + \frac{y}{1-y} \right ).\]

Similarly, one can transform an integral over \(\mathbb{R}\) to one over \([-1,1]\)

\[\int_{-\infty}^\infty f(x) {\rm d}x = \int_{-1}^1 g(t) {\rm d}t, \qquad g(t) := \frac{1+t^2}{(1-t^2)^2} f \left ( \frac{t}{1-t^2} \right ).\]

We will only consider finite intervals here.

2.2.2 Integrating the interpolating polynomial approximation

Consider integrating a Lagrange polynomial \(p_{k-1}\) over \([a,b]\). We have \[\begin{align} I(p_{k-1}) &= \int_a^b p_{k-1}(x) {\rm d}x \\ &= \int_a^b \sum_{i=1}^k \ell_i(x) f(x_i) {\rm d}x \\ &= \sum_{i=1}^k f(x_i) \int_a^b \ell_i(x) {\rm d}x \\ &= \sum_{i=1}^k w_i f(x_i), \end{align}\] where for \(i \in \{1,\ldots,k\}\), \(w_i := \int_a^b \ell_i(x) {\rm d}x\) and we recall that \(\ell_i(x) = \prod_{j=1,j\neq i}^k \frac{x-x_j}{x_i-x_j}\).

The approximation of \(I(f)\) is \(\hat{I}(f) = I(p_{k-1})\), where \(p_{k-1}\) depends only on the choice of interpolating points \(x_1,\ldots,x_k\).

The functions \(\ell_i\) can be a bit complicated, but certainly can be integrated by hand quite easily for small values of \(k\). For example, with \(k=1\) we have \(\ell_1 \equiv 1\), so the integral \(\int_a^b \ell_1(x) {\rm d}x = b-a\). For \(k=2\) we have \(\ell_1(x) = (x-x_2)/(x_1-x_2)\), yielding \[\int_a^b \ell_1(x) {\rm d}x = \frac{b-a}{2(x_1-x_2)}(b+a-2x_2),\] and similarly \(\ell_2(x) = (x-x_1)/(x_2-x_1)\), so \[\int_a^b \ell_2(x) {\rm d}x = \frac{b-a}{2(x_2-x_1)}(b+a-2x_1).\]

2.3 Newton–Cotes rules

The rectangular rule corresponds to a closed scheme with \(k=1\):

\[\hat{I}_{\rm rectangular}(f) = (b-a) f(a).\]

The midpoint rule corresponds to an open scheme with \(k=1\):

\[\hat{I}_{\rm midpoint}(f) = (b-a) f \left ( \frac{a+b}{2} \right).\]

The trapezoidal rule corresponds to a closed scheme with \(k=2\). Since \((x_1,x_2) = (a,b)\), we obtain \(\int_a^b \ell_1(x) {\rm d}x = (b-a)/2\) and

\[\hat{I}_{\rm trapezoidal}(f) = \frac{b-a}{2} \{ f(a)+f(b) \}.\]

Simpson’s rule corresponds to a closed scheme with \(k=3\). After some calculations, we obtain

\[\hat{I}_{\rm Simpson}(f) = \frac{b-a}{6} \left \{ f(a) + 4 f \left ( \frac{a+b}{2} \right) + f(b) \right \}.\]

We can obtain bounds on the error of integration using any sequence of interpolation points, for sufficiently smooth functions.

rule \(\hat{I}(f) - I(f)\), with \(\xi \in (a,b)\)
rectangular \(-\frac{1}{2}(b-a)^2 f'(\xi)\)
midpoint \(-\frac{1}{24}(b-a)^3 f^{(2)}(\xi)\)
trapezoidal \(\frac{1}{12}(b-a)^3 f^{(2)}(\xi)\)
Simpson \(\frac{1}{2880}(b-a)^5 f^{(4)}(\xi)\)

This indicates that the midpoint rule, which uses only one point, is often better than the trapezoidal rule, which uses 2. These are both significantly worse than Simpson’s rule, which uses 3 points. One might think that using large numbers of points is beneficial, but this is not always the case since the interpolating polynomial may become quite poor when using equally spaced points as seen before. We also see that the rectangular and trapezoidal rules are exact for constant functions, the midpoint rule is exact for linear functions, and Simpson’s rule is exact for polynomials of degree up to 3.

newton.cotes <- function(f, a, b, k, closed) {
  if (k == 1) {
    if (closed) {
      return((b-a)*f(a))
    } else {
      return((b-a)*f((a+b)/2))
    }
  }
  if (k == 2 && closed) {
    return((b-a)/2*(f(a)+f(b)))
  }
  if (k == 3 && closed) {
    return((b-a)/6*(f(a)+4*f((a+b)/2)+f(b)))
  }
  stop("not implemented")
}

nc.example <- function(f, name, value) {
  df <- data.frame(f=character(), rule=character(), error=numeric())
  df <- rbind(df, data.frame(f=name, rule="Rectangular", error=newton.cotes(f,0,1,1,TRUE)-value))
  df <- rbind(df, data.frame(f=name, rule="Midpoint", error=newton.cotes(f,0,1,1,FALSE)-value))
  df <- rbind(df, data.frame(f=name, rule="Trapezoidal", error=newton.cotes(f,0,1,2,TRUE)-value))
  df <- rbind(df, data.frame(f=name, rule="Simpson's", error=newton.cotes(f,0,1,3,TRUE)-value))
  return(df)
}

df <- nc.example(function(x) x, "x", 1/2)
df <- rbind(df, nc.example(function(x) x^2, "x^2", 1/3))
df <- rbind(df, nc.example(function(x) x^3, "x^3", 1/4))
df <- rbind(df, nc.example(function(x) x^4, "x^4", 1/5))

ggplot(df, aes(fill=rule, y=error, x=f)) + geom_bar(position="dodge", stat="identity")

2.3.1 Composite rules

When a composite polynomial interpolation approximation is used, the integral of the approximation is simply the sum of the integrals associated with each subinterval. Hence, a composite Newton–Cotes rule is obtained by splitting the interval \([a,b]\) into \(m\) subintervals and summing the approximate integrals from the Newton–Cotes rule for each subinterval. \[\hat{I}^m_{\rm rule}(f) = \sum_{i=1}^m \hat{I}_{\rm rule}(f_i),\] where \(f_i\) is \(f\) restricted to \([a+(i-1)h, a+ih]\) and \(h=(b-a)/m\).

composite.rule <- function(f, a, b, subintervals, rule) {
  subinterval.points <- get.subinterval.points(a, b, subintervals)
  s <- 0
  for (i in 1:subintervals) {
    left <- subinterval.points[i]
    right <- subinterval.points[i+1]
    s <- s + rule(f, left, right)
  }
  return(s)
}

# composite Newton--Cotes
composite.nc <- function(f, a, b, subintervals, k, closed) {
  rule <- function(f, left, right) {
    newton.cotes(f, left, right, k, closed)
  }
  return(composite.rule(f, a, b, subintervals, rule))
}

We obtain the following composite errors, in which the dependence on \(m\) is of particular interest.

rule \(\hat{I}^m(f) - I(f)\), with \(\xi \in (a,b)\)
rectangular \(-\frac{1}{2m}(b-a)^2 f'(\xi)\)
midpoint \(-\frac{1}{24m^2}(b-a)^3 f^{(2)}(\xi)\)
trapezoidal \(\frac{1}{12m^2}(b-a)^3 f^{(2)}(\xi)\)
Simpson \(\frac{1}{2880m^4}(b-a)^5 f^{(4)}(\xi)\)

We plot the errors against their theoretical values for \(f = \sin\), \((a,b) = (0,10)\) and different numbers of subintervals, \(m\). We replace \(f^{(s)}(\xi)\) in the theoretical expression with \((b-a)^{-1} \int_a^b f^{(s)}(x) {\rm d}x\), which should be accurate for large values of \(m\).

ms <- c(10:19, seq(20, 500, 10))

composite.rectangular <- vapply(ms, function(m) composite.nc(sin, 0, 10, m, 1, TRUE), 0)

composite.midpoints <- vapply(ms, function(m) composite.nc(sin, 0, 10, m, 1, FALSE), 0)

composite.trapezoidal <- vapply(ms, function(m) composite.nc(sin, 0, 10, m, 2, TRUE), 0)

composite.simpsons <- vapply(ms, function(m) composite.nc(sin, 0, 10, m, 3, TRUE), 0)

val <- 1-cos(10) # integral of sin(x) for x=0..10
v1 <- -sin(10)/10 # integral of sin'(x)/10 for x=0..10
v2 <- val/10 # integral of sin(x)/10 for x=0..10

tr <- tibble(m=ms, rule="rectangular", log.error=log(composite.rectangular - val),
             theory=log(v1*1/2*10^2/ms))
tt <- tibble(m=ms, rule="trapezoidal", log.error=log(val - composite.trapezoidal),
             theory=log(v2*1/12*10^3/ms^2))
tm <- tibble(m=ms, rule="midpoints", log.error=log(composite.midpoints - val),
            theory=log(v2*1/24*10^3/ms^2))
ts <- tibble(m=ms, rule="simpsons", log.error=log(composite.simpsons - val),
            theory=log(v2*1/2880*10^5/ms^4))
tib <- bind_rows(tr, tt, tm, ts)
ggplot(tib, aes(x=m, colour=rule)) + geom_point(aes(y=log.error)) + geom_line(aes(y=theory))

2.3.2 Gaussian quadrature

So far we have considered fairly simple choices for the interpolation points. In fact, one can use some nice mathematics to show that choosing a special set of points will improve the approximation accuracy for the polynomial integral approximations, even when the interpolating polynomial is not that close to the true function. See https://awllee.github.io/sc1/integration/quadrature/#gaussian-quadrature for details.

2.4 OLD - being edited

In one dimension we might choose to approximate the integral of a continuous function \(f(x)\) by the area under a set of step functions, with the midpoint of each matching \(f\): \[ \int_a^b f(x) dx = \frac{b-a}{N} \sum_{i=1}^N f(a + (i-.5)(b-a)/N) + O(h^2) \] where \(h\) in the final error term is \((b-a)/N\). Let’s see what the error bound means in practice by approximating \(\int_0^1 e^x dx\).

  I.true <- exp(1) - 1
  N <- 10
  I.mp <- sum(exp((1:N-.5)/N))/N
  c(I.mp,I.true,(I.true-I.mp)/I.true)
[1] 1.7175660865 1.7182818285 0.0004165452

A \(.04\%\) error may be good enough for many purposes, but suppose we need higher accuracy

  N <- 100
  I.mp <- sum(exp((1:N-.5)/N))/N
  c(I.mp,I.true,(I.true-I.mp)/I.true)
[1] 1.718275e+00 1.718282e+00 4.166655e-06
  N <- 1000
  I.mp <- sum(exp((1:N-.5)/N))/N
  c(I.mp,I.true,(I.true-I.mp)/I.true)
[1] 1.718282e+00 1.718282e+00 4.166667e-08

Exactly as the error bound indicated, each 10 fold increase in the number of function evaluations buys us a 100 fold reduction in the approximation error. Incidentally, notice that in statistical terms this error is not a random error — it is really a bias.

If \(f\) is cheap to evaluate, accuracy is not paramount and we are only interested in 1 dimensional integrals then we might stop there (or more likely with the slightly more accurate but equally straightforward Simpson’s Rule), but life is not so simple. If one is very careful about the choice of \(x\) points at which to evaluate \(f(x)\) and about the weights used in the integral approximating summation, then some very impressive results can be obtained, at least for smooth enough functions, using Gauss(ian) Quadrature. The basic idea is to find a set \(\{x_i, w_i:i=1,\ldots,n\}\) such that if \(w(x)\) is a fixed weight function and \(g(x)\) is any polynomial up to order \(2n-1\) then \[ \int_a^b g(x) w(x)dx = \sum_{i=1}^n w_i g(x_i) \] Note that the \(x_i, w_i\) set depends on \(w(x)\), \(a\) and \(b\), but not on \(g(x)\). The hope is that the r.h.s. will still approximate the l.h.s. very accurately if \(g(x)\) is not a polynomial of the specified type, and there is some impressive error analysis to back this up, for any function with \(2n\) well behaved derivatives (see e.g. Monahan, 2001, section 10.3).

The R package statmod has a nice gauss.quad function for producing the weights, \(w_i\) and nodes, \(x_i\), for some standard \(w(x)\)s. It assumes \(a=-1\), \(b=1\), but we can just linearly re-scale our actual interval to use this. Here is the previous example using Gauss Quadrature (assuming \(w(x)\) is a constant).

library(statmod)
N <- 10
gq <- gauss.quad(N) # get the x_i, w_i
gq # take a look
$nodes
 [1] -0.9739065 -0.8650634 -0.6794096 -0.4333954 -0.1488743  0.1488743
 [7]  0.4333954  0.6794096  0.8650634  0.9739065

$weights
 [1] 0.06667134 0.14945135 0.21908636 0.26926672 0.29552422 0.29552422
 [7] 0.26926672 0.21908636 0.14945135 0.06667134
sum(gq$weights) # for an interval of width 2
[1] 2
I.gq <- sum(gq$weights*exp((gq$nodes+1)/2))/2
c(I.gq, I.true, (I.true-I.gq)/I.true)
[1] 1.718282e+00 1.718282e+00 9.045735e-16

For convenience, we can pack the rescaling logic into a simple function:

gqi <- function(g, N = 10, a = -1, b = 1) {
    gq = gauss.quad(N)
    w = gq$weights*(b-a)/2
    n = a + (b-a)*(gq$nodes + 1)/2
    sum(w * g(n))
    }

gqi(exp, 10, 0, 1)
[1] 1.718282

Clearly 10 evaluation points was a bit wasteful here, so consider \(N=5\)

I.gq <- gqi(exp, 5, 0, 1)
c(I.gq, I.true, (I.true-I.gq)/I.true)
[1] 1.718282e+00 1.718282e+00 3.803086e-13

even 5 seems slightly wasteful:

I.gq <- gqi(exp, 3, 0, 1)
c(I.gq, I.true, (I.true-I.gq)/I.true)
[1] 1.718281e+00 1.718282e+00 4.795992e-07

still more accurate than the midpoint rule with 100 evaluations. This example should be treated with caution, however. \(e^x\) is a very smooth function and such impressive results will not hold for less smooth cases.

3 Quadrature rules for multidimensional integrals

We can integrate w.r.t. several variables by recursive application of quadrature rules. For example consider integrating \(f(x,z)\) over a square region, and assume we use basically the same node and weight sequence for both dimensions. We have \[ \int f(x_j,z) dz \approx \sum_i w_i f(x_j,z_i) \] and so \[ \int \int f(x,z) dz dx \approx \sum_j w_j \sum_i w_i f(x_j,z_i) = \sum_j \sum_i w_jw_i f(x_j,z_i) \] Clearly, in principle we can repeat this for as many dimensions as we like. In practice however, to maintain a given level of accuracy, the number of function evaluations will have to rise as \(N^d\) where \(N\) is the number of nodes for each dimension, and \(d\) is the number of dimensions. Even if we can get away with a 3 point Gaussian quadrature, we will need 1.5 million evaluations by the time we reach a 13 dimensional integral.

Here is a function which takes the node sequence x and weight sequence w of some quadrature rule and uses them to produce the d dimensional set of nodes, and the corresponding weight vector resulting from recursive application of the 1D rule.

mesh <- function(x,d,w=1/length(x)+x*0) {
  n <- length(x)
  W <- X <- matrix(0,n^d,d)
  for (i in 1:d) {
   X[,i] <- x;W[,i] <- w
   x <- rep(x,rep(n,length(x)))
   w <- rep(w,rep(n,length(w)))
  }
 w <- exp(rowSums(log(W))) ## column product of W gives weights
 list(X=X,w=w) ## each row of X gives co-ordinates of a node
}

Here’s how it works.

mesh(c(1,2),3)
$X
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    2    1    1
[3,]    1    2    1
[4,]    2    2    1
[5,]    1    1    2
[6,]    2    1    2
[7,]    1    2    2
[8,]    2    2    2

$w
[1] 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125

Let’s look at it in action on \(f({\bf x}) = \exp (\sum x_i)\), integrated over \([-1,1]^d\).

fd <- function(x) {exp(rowSums(x))}

First try the midpoint rule over a two dimensional domain

  N <- 10; d <- 2; N^d # number of function evaluations
[1] 100
  I.true <- (exp(1)-exp(-1))^d
  mmp <- mesh((1:N-.5)/N*2-1,d,rep(2/N,N))
  I.mp <- sum(mmp$w*fd(mmp$X))
  c(I.mp,I.true,(I.true-I.mp)/I.true)
[1] 5.506013515 5.524391382 0.003326677

Now over a 5D domain

  N <- 10; d <- 5; N^d # number of function evaluations
[1] 1e+05
  I.true <- (exp(1)-exp(-1))^d
  mmp <- mesh((1:N-.5)/N*2-1,d,rep(2/N,N))
  I.mp <- sum(mmp$w*fd(mmp$X))
  c(I.mp,I.true,(I.true-I.mp)/I.true)
[1] 71.136612879 71.731695754  0.008295954

which illustrates the enormous rate of increase in function evaluations required to achieve a rather disappointing accuracy.

Of course Gauss Quadrature is a better bet

  N <- 4; d <- 2; N^d
[1] 16
  I.true <- (exp(1)-exp(-1))^d
  gq <- gauss.quad(N)
  mgq <- mesh(gq$nodes,d,gq$weights)
  I.gq <- sum(mgq$w*fd(mgq$X))
  c(I.gq,I.true,(I.true-I.gq)/I.true)
[1] 5.524390e+00 5.524391e+00 2.511325e-07

with the increase in dimension being slightly less painful

  N <- 4; d <- 5; N^d
[1] 1024
  I.true <- (exp(1)-exp(-1))^d
  gq <- gauss.quad(N)
  mgq <- mesh(gq$nodes,d,gq$weights)
  I.gq <- sum(mgq$w*fd(mgq$X))
  c(I.gq,I.true,(I.true-I.gq)/I.true)
[1] 7.173165e+01 7.173170e+01 6.278311e-07

even so, by the time we reach 20 dimensions we would be requiring \(10^{12}\) function evaluations, and most functions we might want to integrate will not be so smooth and well behaved that 4 quadrature points give us anything like the sort of accuracy seen in this example.

4 Approximating the integrand

Beyond a few dimensions, quadrature rules really run out of steam. In statistical work, where we are often interested in integrating probability density functions, or related quantities, it is sometimes possible to approximate the integrand by a function with a known integral. Here, just one approach will be illustrated: Laplace approximation. The typical application of the method is when we have a joint density \(f({\bf y},{\bf b})\) and want to evaluate the marginal p.d.f. of \(\bf y\), \[ f_y({\bf y}) = \int f({\bf y},{\bf b}) d{\bf b}. \] For a given \(\bf y\) let \(\hat {\bf b}_y\) be the value of \(\bf b\) maximising \(f\). Then Taylor’s theorem implies that \[ \log \{f({\bf y},{\bf b}) \} \simeq \log \{f({\bf y},\hat {\bf b}_y) \} - \frac{1}{2} ({\bf b } - \hat {\bf b}_y)^{\sf T}{\bf H} ({\bf b } - \hat {\bf b}_y) \] where \(\bf H\) is the Hessian of \(-\log (f)\) w.r.t. \(\bf b\) evaluated at \({\bf y}, \hat {\bf b}_y\). i.e. \[ f({\bf y},{\bf b}) \simeq f({\bf y},\hat {\bf b}_y) \exp \left \{ - \frac{1}{2} ({\bf b } - \hat {\bf b}_y)^{\sf T}{\bf H} ({\bf b } - \hat {\bf b}_y) \right \} \] and so \[ f_y({\bf y}) \simeq f({\bf y},\hat {\bf b}_y) \int \exp \left \{ - \frac{1}{2} ({\bf b } - \hat {\bf b}_y)^{\sf T}{\bf H} ({\bf b } - \hat {\bf b}_y) \right \} d{\bf b}. \] From the properties of normal p.d.f.s we have \[ \int \frac{|{\bf H}|^{1/2}}{(2 \pi)^{d/2}} e^{ - \frac{1}{2} ({\bf b } - \hat {\bf b}_y)^{\sf T}{\bf H} ({\bf b } - \hat {\bf b}_y)} d{\bf b} =1 \Rightarrow \int e^{ - \frac{1}{2} ({\bf b } - \hat {\bf b}_y)^{\sf T}{\bf H} ({\bf b } - \hat {\bf b}_y) } d{\bf b} = \frac{(2 \pi)^{d/2}}{|{\bf H}|^{1/2}} \] so \[ f_y({\bf y}) \simeq f({\bf y},\hat {\bf b}_y) \frac{(2 \pi)^{d/2}}{|{\bf H}|^{1/2}}. \] Notice that the basic ingredients of this approximation are the Hessian of the log integrand w.r.t \({\bf b}\) and the \(\hat {\bf b}_y\) values. Calculating the latter is an optimization problem, which can usually be solved by a Newton type method, given that we can presumably obtain gradients as well as the Hessian. The highest cost here is the Hessian calculation, which scales with the square of the dimension, at worst, or for cheap integrands the determinant evaluation, which scales as the cube of the dimension. i.e. when it works, this approach is much cheaper than brute force quadrature.

5 Monte Carlo integration

Consider integrating \(f({\bf x})\) over some region \(\Omega\) of volume \(V(\Omega)\). Clearly \[ I_\Omega = \int_\Omega f({\bf x}) d{\bf x} = \mathbb{E} \{ f({\bf X})\} V(\Omega) \] where ${} $ uniform over \(\Omega\). Generating \(N\) uniform random vectors, \({\bf x}_i\), over \(\Omega\), we have the obvious estimator \[ \hat I_\Omega = \frac{V(\Omega)}{N} \sum_{i=1}^N f({\bf x}_i), \] which defines basic Monte Carlo Integration. Note that in practice, if \(\Omega\) is a complicated shape, it may be preferable to generate uniformly over a simpler enclosing region \(\tilde \Omega\), and just set the integrand to zero for points outside \(\Omega\) itself. Clearly \(\hat I_\Omega\) is unbiased, and if the \({\bf x}_i\) are independent then \[ {\rm var}(\hat I_\Omega) = \frac{V(\Omega)^2}{N^2} N {\rm var}\{ f({\bf X})\} \] i.e. the variance scales as \(N^{-1}\), and the standard deviation scales as \(N^{-1/2}\) (it’s \(O_p(N^{-1/2})\)).

This is not brilliantly fast convergence to \(I_\Omega\), but compare it to quadrature. If we take the midpoint rule then the (bias) error is \(O(h^2)\) where \(h\) is the node spacing. In \(d\) dimensions then at best we can arrange \(h \propto N^{-1/d}\), implying that the bias is \(O(N^{-2/d})\). i.e. once we get above \(d=4\) we would prefer the stochastic convergence rate of Monte-Carlo Integration to the deterministic convergence rate of the midpoint rule.

Of course using Simpson’s rule would buy us a few more dimensions, and for smooth functions the spectacular convergence rates of Gauss quadrature may keep it as the best option up to a rather higher \(d\), but the fact remains that its bias is still dependent on \(d\), while the sampling error of the Monte Carlo method is dimension independent, so Monte Carlo will win in the end.

6 Stratified Monte Carlo Integration

One tweak on basic Monte-Carlo is to divide (a region containing) \(\Omega\) into \(N\) equal volumed sub-regions, \(\Omega_i\), and to estimate the integral within each \(\Omega_i\) from a single \({\bf X}_i\) drawn from a uniform distribution over \(\Omega_i\). The estimator has the same form as before, but its variance is now \[ {\rm var}(\tilde I_\Omega) = \frac{V(\Omega)^2}{N^2} \sum_{i=1}^N {\rm var}\{f({\bf X}_i)\} \] which is usually lower than previously, because the range of variation of \(f\) within each \(\Omega_i\) is lower than within the whole of \(\Omega\).

The underlying idea of trying to obtain the nice properties of uniform random variables, but with more even spacing over the domain of integration, is pushed to the limit by Quasi Monte Carlo integration, which uses deterministic space filling sequences of numbers as the basis for integration. For fixed \(N\), the bias of such schemes grows much less quickly with dimension than the quadrature rules.

7 Importance sampling

In statistics an obvious variant on Monte-Carlo integration occurs when the integrand factors into the product of a p.d.f. \(f({\bf x})\) and another term in which case \[ \int \phi ({\bf x}) f({\bf x}) d{\bf x} = \mathbb{E}_f[\phi({\bf X})] \approx \frac{1}{n} \sum_{i=1}^n \phi({\bf x}_i) \] where the \({\bf x}_i\) are observations on r.v.s with p.d.f. \(f({\bf x})\). A difficulty with this approach, as with simple MC integration, is that a high proportion of the \({\bf x}_i\)s often fall in places where \(\phi({\bf x}_i)\) makes no, or negligible, contribution to the integral. For example, in Bayesian modelling, it is quite possible for the \(f({\bf y}|\boldsymbol{\theta})\) to be have non-vanishing support over only a tiny volume of the parameter space over which \(f(\boldsymbol{\theta})\) has non-vanishing support.

One approach to this problem is importance sampling. Rather than generate the \({\bf x}_i\) from \(f({\bf x})\) we generate from a distribution with the same support as \(f\), but with a p.d.f., \(g({\bf x})\), which is designed to generate points concentrated in the places where the integrand is non-negligible. Then the estimator is \[ \int \phi ({\bf x}) f({\bf x}) d{\bf x} =\int \phi ({\bf x}) \frac{f({\bf x})}{g({\bf x})}g({\bf x}) d{\bf x} \approx \hat I_{is} = \frac{1}{n} \sum_{i=1}^n \phi({\bf x}_i) \frac{f({\bf x}_i)}{g({\bf x}_i)}. \] The expected value of the r.h.s. of the above is its l.h.s., so this estimator is unbiased.

What would make a good \(g\)? Suppose that \(g \propto \phi f\). Then \(g({\bf x}) = \phi({\bf x}) f({\bf x})/k\), where \(k = \int \phi({\bf x}) f({\bf x}) d{\bf x}\). So \(\hat I_{is} = nk/n = \int \phi({\bf x}) f({\bf x}) d{\bf x}\), and the estimator has zero variance. Clearly this is impractical. If we knew \(k\) then we would not need to estimate, but there are useful insights here anyway. Firstly, the more constant is \(\phi f/g\), the lower the variance of \(\hat I_{is}\). Secondly, the idea that the density of evaluation points should reflect the contribution of the integrand in the local neighbourhood is quite generally applicable.

8 Laplace importance sampling

We are left with the question of how to choose \(g\)? When the integrand is some sort of probability density function, a reasonable approach is to approximate it by the normal density implied by the Laplace approximation. That is we take \(g\) to be the p.d.f. of the multivariate normal \(N(\hat {\bf x},{\bf H}^{-1})\) where \(\hat {\bf x}\) denotes the ${} $ value maximising \(\phi f\), while \(\bf H\) is the Hessian of \(-\log(\phi f)\) w.r.t. \(\bf x\), evaluated at \(\hat {\bf x}\).

In fact, as we have seen previously, multivariate normal random vectors are generated by transformation of standard normal random vectors \({\bf z} \sim N({\bf 0},{\bf I})\). i.e. \({\bf x} = \hat {\bf x} + {\bf R}^{-1}{\bf z}\) where \({\bf R}\) is a Cholesky factor of \(\bf H\) so that \({\bf R}^{\sf T}{\bf R} = {\bf H}\). There is then some computational saving in using a change of variable argument to re-express the importance sampling estimator in terms of the density of \(\bf z\), rather than \(g\) itself. Start with \[ \int \phi ({\bf x}) f({\bf x}) d{\bf x} = \int \phi(\hat{\bf x}+{\bf R}^{-1}{\bf z})f(\hat{\bf x}+{\bf R}^{-1}{\bf z})|{\bf R}^{-1}|d{\bf z} = \frac{1}{|{\bf R}|} \int \phi(\hat{\bf x}+{\bf R}^{-1}{\bf z})\frac{f(\hat{\bf x}+{\bf R}^{-1}{\bf z})}{h({\bf z})} h({\bf z})d{\bf z}, \] where \(h({\bf z}) = (2\pi)^{-d/2}\exp\{-{\bf z}^{\sf T}{\bf z}/2\}\) is the standard normal density for \({\bf z}\), giving \[ \int \phi ({\bf x}) f({\bf x}) d{\bf x} = \frac{1}{|{\bf R}|}\mathbb{E}_z\left[ \phi(\hat{\bf x}+{\bf R}^{-1}{\bf z})\frac{f(\hat{\bf x}+{\bf R}^{-1}{\bf z})}{h({\bf z})} \right] = \frac{(2\pi)^{d/2}}{|{\bf R}|}\mathbb{E}_z\left[ \phi(\hat{\bf x}+{\bf R}^{-1}{\bf z}) f(\hat{\bf x}+{\bf R}^{-1}{\bf z})\exp\{{\bf z}^{\sf T}{\bf z}/2\}\right], \] and leading in turn to the Laplace importance sampling estimator \[ \int \phi ({\bf x}) f({\bf x}) d{\bf x} \approx \frac{(2 \pi)^{d/2}}{n|{\bf R}|} \sum_{i=1}^n \phi (\hat {\bf x} + {\bf R}^{-1} {\bf z}_i) f (\hat {\bf x} + {\bf R}^{-1} {\bf z}_i) e^{{\bf z}_i^{\sf T}{\bf z}_i/2}. \]

Some tweaks may be required.

  1. It might be desirable to make \(g\) a little more diffuse than the Laplace approximation suggests, to avoid excessive importance weights being assigned to \({\bf z}_i\) values in the tails of the target distribution.

  2. If the integration is in order to obtain a likelihood, which is to be maximised, then it is usually best to condition on a set of \({\bf z}_i\) values, in order to avoid random variability in the approximate likelihood, which will make optimization difficult.

  3. Usually the log of the integrand is evaluated rather than the integrand itself, and it is the log of the integral that is required. To avoid numerical under/overflow problems note that \[ \log \left \{\sum_i w_i \exp(\psi_i) \right \} = \log \left [\sum_i w_i \exp\{\psi_i - \max(\psi_i)\} \right ] + \max(\psi_i) \] the r.h.s. will not overflow or underflow when computed in finite precision arithmetic (for reasonable \(w_i\)).

  4. As with simple Monte-Carlo integration, some variance reduction is achievable by stratification. The \({\bf z}_i\) can be obtained by suitable transformation of random vectors from a uniform distribution on \((0,1)^d\). Instead \((0,1)^d\) can be partitioned into equal volume parts, with one uniform vector generated in each

Tweak 3. is an example of a general technique for avoiding numerical underflow and overflow which has many applications in statistical computing. It is therefore worth examining more carefully.

9 The log–sum–exp trick

The trick used in point 3 at the end of the last section deserves a little more attention. In statistical computing, we rarely want to work with raw probabilities and likelihoods, but instead work with their logarithms. Log-likelihoods are much less likely to numerically underflow or overflow than raw likelihoods. For likelihood computations involving iid observations, this is often convenient, since a product of raw likelihood terms becomes a sum of log-likelihood terms. However, sometimes we have to evaluate the sum of raw probabilities or likelihoods. The obvious way to do this would be to exponentiate the logarithms, evaluate the sum, and then take the log of the result to get back to our log scale. However, the whole point to working on the log scale is to avoid numerical underflow and overflow, so we know that exponentiating raw likelihoods is a dangerous operation which is likely to fail in many problems. Can we do the log–sum–exp operation safely? Yes, we can.

Suppose that we have log (unnormalised) weights (typically probabilities or likelihoods), \(l_i = \log w_i\), but we want to compute the (log of the) sum of the raw weights. We can obviously exponentiate the log weights and sum them, but this carries the risk that all of the log weights will underflow (or, possibly, overflow). So, we want to compute \[ L = \log \sum_i \exp(l_i), \] without the risk of underflow. We can do this by first computing \(m = \max_i l_i\) and then noting that \[ L = \log \sum_i \exp(l_i) = \log\sum_i \exp(m)\exp(l_i-m) = \log \left[\exp(m)\sum_i \exp(l_i-m)\right] = m + \log\sum_i\exp(l_i-m). \] We are now safe, since \(l_i-m\) cannot be bigger than zero, and hence no term can overflow. Also, since we know that \(l_i-m = 0\) for some \(i\), we know that at least one term will not underflow to zero. So we are guaranteed to get a sensible finite answer in a numerically stable way. Note that exactly the same technique works for computing a sample mean (or other weighted sum) as opposed to just a simple sum (as above).

10 An example

Consider a Generalised Linear Mixed Model: \[ y_i|{\bf b} \sim {\rm Poi}(\mu_i), ~~~ \log(\mu_i) = {\bf X}_i \boldsymbol{\beta}+ {\bf Z}_i {\bf b},~~~ {\bf b} \sim N({\bf 0}, \boldsymbol{\Lambda}^{-1}) \] where \(\boldsymbol{\Lambda}\) is diagonal and depends on a small number of parameters. Here, \(\boldsymbol{\beta}\) are the fixed effects and \({\bf b}\) are the random effects, with covariate matrices \({\bf X}\) and \({\bf Z}\), respectively. From the model specification it is straightforward to write down the joint probability function of \(\bf y\) and \(\bf b\), based on the factorisation \(f({\bf y}, {\bf b}) = f_{y|b}({\bf y}|{\bf b})f_b({\bf b})\). i.e. \[\begin{align*} \log\{f({\bf y}, {\bf b})\} &= \sum_i \{ y_i ({\bf X}_i \boldsymbol{\beta}+ {\bf Z}_i {\bf b}) - \exp ({\bf X}_i \boldsymbol{\beta}+ {\bf Z}_i {\bf b}) - \log(y_i!) \} - \\ & \qquad \frac{1}{2} \sum_j \lambda_j b^2_j + \frac{1}{2} \sum_j \log(\lambda_j) - \frac{1}{2} \sum_j \log(2\pi) \end{align*}\] but likelihood based inference actually requires that we can evaluate the marginal p.f. \(f_y({\bf y})\) for any values of the model parameters. Now \[ f_y({\bf y}) = \int f({\bf y}, {\bf b}) d{\bf b} \] and evaluation of this requires numerical integration.

To tie things down further, suppose that the fixed effects part of the model is simply \(\beta_0 + \beta_1 x\) where \(x\) is a continuous covariate, while the random effects part is given by two factor variables, \(z_1\) and \(z_2\), each with two levels. So a typical row of \({\bf Z}\) is a vector of 4 zeros and ones, in some order. Suppose further that the first two elements of \(\bf b\) relate to \(z_1\) and the remainder to \(z_2\) and that \({\rm diag}(\boldsymbol{\Lambda}) = [\sigma^{-2}_1,\sigma^{-2}_1,\sigma^{-2}_2,\sigma^{-2}_2]\). The following simulates 50 observations from such a model.

set.seed(1);n <- 50
x <- runif(n)
z1 <- sample(c(0,1),n,replace=TRUE)
z2 <- sample(c(0,1),n,replace=TRUE)
b <- rnorm(4) ## simulated random effects
X <- model.matrix(~x)
Z <- cbind(z1,1-z1,z2,1-z2)
eta <- X%*%c(0,3) + Z%*%b
mu <- exp(eta)
y <- rpois(n, mu)

Here is a function evaluating \(\log \{ f({\bf y},{\bf b}) \}\) (log is used here to avoid undesirable underflow to zero)

lf.yb0 <- function(y, b, theta, X, Z, lambda) {
  beta <- theta[1:ncol(X)]
  theta <- theta[-(1:ncol(X))]
  eta <- X%*%beta + Z%*%b
  mu <- exp(eta)
  lam <- lambda(theta, ncol(Z))
  sum(y*eta - mu - lfactorial(y)) - sum(lam*b^2)/2 +
      sum(log(lam))/2 - ncol(Z)*log(2*pi)/2
}

theta contains the fixed effect parameters, i.e. \(\boldsymbol{\beta}\) and the parameters of \(\boldsymbol{\Lambda}\). So in this example, theta is of length 4, with the first 2 elements corresponding to \(\boldsymbol{\beta}\) and the last two are \(\sigma_1^{-2}, \sigma_{2}^{-2}\). lambda is the name of a function for turning the parameters of \(\boldsymbol{\Lambda}\) into its leading diagonal elements, for example

var.func <- function(theta, nb) c(theta[1], theta[1], theta[2], theta[2])

Actually, for integration purposes, lf.yb0 can usefully be re-written to accept a matrix b where each column is a different \(\bf b\) at which to evaluate the log p.d.f. Suppose lf.yb is such a function. Lets try evaluating \(f_y({\bf y})\) at the simulated data, and the theta vector used for simulation, using the various integration methods covered above.

10.1 Quadrature rules

First try brute force application of the midpoint rule. We immediately have the slight problem of the integration domain being infinite, but since the integrand will vanish at some distance from \({\bf b} = {\bf 0}\), we can probably safely restrict attention to the domain \([-5,5]^4\).

  theta <- c(0,3,1,1)
  nm <- 10;m.r <- 10
  bm <- mesh((1:nm-(nm+1)/2)*m.r/nm,4,rep(m.r/nm,nm))
  lf <- lf.yb(y,t(bm$X),theta,X,Z,var.func)
  log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
[1] -130.7355

Of course, we don’t actually know how accurate this is, but we can get some idea by doubling nm, in which case

  log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
[1] -117.9508

Clearly we are not doing all that well, even with 10000-160000 function evaluations.

On very smooth functions Gauss quadrature rules were a much better bet, so let’s try a naive application of these again.

  nm <- 10
  gq <- gauss.quad(nm)
  w <- gq$weights*m.r/2 ## rescale to domain
  x <- gq$nodes*m.r/2
  bm <- mesh(x,4,w)
  lf <- lf.yb(y,t(bm$X),theta,X,Z,var.func)
  log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
[1] -129.8

— this is a long way from the midpoint estimate: what happens if we double nm to 20?

  log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
[1] -130.6644

Increasing further still indicates problems. The difficulty is that the integrand is not very smooth, in the way that \(e^x\) was, and it is not well approximated by a tensor product of high order polynomials. The midpoint rule actually works better here. In fact this integral could be performed reasonably accurately by skilful use of Gauss quadrature, but the example illustrates that naive application can be much worse than doing something simpler.

10.2 Laplace approximation

To use the Laplace approximation we need to supplement lf.yb with functions glf.yb and hlf.yb which evaluate the gradient vector and Hessian of the \(\log(f)\) w.r.t. \(\bf b\). Given these functions we first find \(\hat {\bf b}\) by Newton’s method

b <- rep(0,4)
while(TRUE) {
  b0 <- b
  g <- glf.yb(y, b, theta, X, Z, var.func)
  H <- hlf.yb(y, b, theta, X, Z, var.func)
  b <- b - solve(H, g)
  b <- as.vector(b)
  if (sum(abs(b-b0)) < 1e-5*sum(abs(b)) + 1e-4) break
}

after which b contains \(\hat {\bf b}\) and H the negative of the required \(\bf H\). It remains to evaluate the approximation

lf.yb0(y,b,theta,X,Z,var.func) + 2*log(2*pi) -
    0.5*determinant(H,logarithm=TRUE)$modulus
[1] -112.9528
attr(,"logarithm")
[1] TRUE

This is reasonably close to the truth, but the method itself provides us with no way of knowing how close.

10.3 Monte Carlo

Simple brute force Monte Carlo Integration follows

  N <- 100000
  bm <- matrix((runif(N*4)-.5)*m.r,4,N)
  lf <- lf.yb(y, bm, theta, X, Z, var.func)
  log(sum(exp(lf-max(lf)))) + max(lf) - log(N) + log(m.r^4)
[1] -113.7247

The next replicate gave -119.3, and the standard deviation over 100 replicates is about 2. That degree of variability is with \(10^5\) function evaluations. Stratification improves things a bit, but let’s move on.

10.4 Slightly better Monte Carlo

The required marginal likelihood can obviously be written \[ f_y({\bf y}) = \int f({\bf y}, {\bf b}) d{\bf b} = \int f_{y|b}({\bf y}|{\bf b})f_b({\bf b}) d{\bf b} = \mathbb{E}_b\left[f_{y|b}({\bf y}|{\bf b})\right] \] – the expectation of \(f_{y|b}\) wrt the random effects distribution. Draws from the latter must be better placed than the uniform draws used above, so this should give us a somewhat improved Monte Carlo estimator.

lf.ygb <- function(y, b, theta, X, Z, lambda) {
  beta = theta[1:ncol(X)]
  eta = Z%*%b + as.vector(X%*%beta)
  colSums(y*eta - exp(eta) - lfactorial(y))
}
N = 100000
lam = var.func(theta[3:4])
bm = matrix(rnorm(4*N, 0, sqrt(lam)), nrow=4)
llv = lf.ygb(y, bm, theta, X, Z, var.func)
max(llv) + log(mean(exp(llv - max(llv))))
[1] -115.2339

The standard deviation of this scheme over 100 replicates is about 0.15. This is a modest improvement, but most of the bm samples are still wasted on regions where the integrand is negligible: exactly the situation that importance sampling addresses.

10.5 Laplace Importance Sampling

Using Laplace importance sampling allows the construction of an efficient unbiased estimate of the marginal likelihood.

N <- 10000
R <- chol(-H)
z <- matrix(rnorm(N*4),4,N)
bm <- b + backsolve(R,z)
lf <- lf.yb(y,bm,theta,X,Z,var.func)
zz.2 <- colSums(z*z)/2
lfz <- lf + zz.2
log(sum(exp(lfz - max(lfz)))) + max(lfz) + 2 * log(2*pi) -
    sum(log(diag(R))) - log(N)
[1] -115.2595

Over 100 replicates this method has a standard deviation of just under 0.001, and that is with 10 times fewer samples than the previous Monte-Carlo attempts. It drops to 0.0003 if 100000 samples are used. Stratification can reduce this variability still further. Be aware that while we know that the estimator is unbiased, it will not be on the log scale, of course.

10.6 Conclusions

The example emphasises that integration is computationally expensive, and that multi-dimensional integration is best approached by approximating the integrand with something analytically integrable or by constructing unbiased estimators of the integral. The insight that variance is likely to be minimised by concentrating the ‘design points’ of the estimator on where the integrand has substantial magnitude is strongly supported by this example, and leads rather naturally, beyond the scope of this course, to MCMC methods. Note that in likelihood maximisation settings you might also consider avoiding intractable integration via the EM algorithm (see e.g. Lange, 2010 or Monahan, 2001).