SC1

Statistical Computing 1

R apply family, sweep, map, reduce, filter


R apply family

The apply family of functions consist of apply(), lapply(), sapply(), and mapply(). The basic idea of the apply family is to apply the same function repetitively across a list or a certain dimension of an array or dataframe.

apply

We start with apply(). Its syntax is apply(X, MARGIN, FUN, ...), where

  • X is an array of dimension \(\ge 2\), including a 2-d matrix.
  • MARGIN is a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows (i.e. output has length equal to number of rows), 2 indicates columns (i.e. output has length equal to number of columns), c(1, 2) indicates rows and columns (i.e. output remains a matrix of the same dimension).
  • FUN is the function to be applied

As an example, we construct a matrix of all 1’s and sum across rows and columns:

x <- matrix(1, nrow=3, ncol=4)
apply(x, 1, sum)
## [1] 4 4 4
apply(x, 2, sum)
## [1] 3 3 3 3
apply(x, c(1,2), sum)
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    1    1    1    1
## [3,]    1    1    1    1

The function can be defined by the user. For example,

apply(x, 1, function(x) sum(exp(x)))
## [1] 10.87313 10.87313 10.87313

The input can also be a higher dimensional array

apply(array(1, dim=c(2,3,4)), c(1,2), sum)
##      [,1] [,2] [,3]
## [1,]    4    4    4
## [2,]    4    4    4

lapply

In contrast to apply, which must be applied to data of higher than one dimension, lapply applies a function over a list or vector and outputs a list. Its basic usage is lapply(X, FUN, …). For example,

lapply(rep(1,5), exp)
## [[1]]
## [1] 2.718282
## 
## [[2]]
## [1] 2.718282
## 
## [[3]]
## [1] 2.718282
## 
## [[4]]
## [1] 2.718282
## 
## [[5]]
## [1] 2.718282
x <- list(a=1:10, b=exp(-3:3), c=c(TRUE,FALSE,FALSE,TRUE))
lapply(x, mean)
## $a
## [1] 5.5
## 
## $b
## [1] 4.535125
## 
## $c
## [1] 0.5
lapply(x, quantile, probs=(1:3)/4)
## $a
##  25%  50%  75% 
## 3.25 5.50 7.75 
## 
## $b
##       25%       50%       75% 
## 0.2516074 1.0000000 5.0536690 
## 
## $c
## 25% 50% 75% 
## 0.0 0.5 1.0

sapply

The sapply() function works like lapply(), but it tries to simplify the output as much as possible.

sapply(rep(1,5), exp)
## [1] 2.718282 2.718282 2.718282 2.718282 2.718282
unlist(lapply(rep(1,5), exp))
## [1] 2.718282 2.718282 2.718282 2.718282 2.718282
sapply(x, mean)
##        a        b        c 
## 5.500000 4.535125 0.500000
sapply(x, quantile, probs=(1:3)/4)
##        a         b   c
## 25% 3.25 0.2516074 0.0
## 50% 5.50 1.0000000 0.5
## 75% 7.75 5.0536690 1.0

mapply

mapply is a multivariate version of sapply. It is useful if you have more than one input to the function. Its basic usage is mapply(FUN, ...,), where … stands for the inputs to FUN. Notice that the order of parameters is in a different order to apply or lapply. (Why?)

mapply(rep, 1:4, 4:1)
## [[1]]
## [1] 1 1 1 1
## 
## [[2]]
## [1] 2 2 2
## 
## [[3]]
## [1] 3 3
## 
## [[4]]
## [1] 4
mapply(rep, times=1:4, x=4:1)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 3 3
## 
## [[3]]
## [1] 2 2 2
## 
## [[4]]
## [1] 1 1 1 1
mapply(function(x, y) seq_len(x) + y, c(a=1, b=2, c=3), c(A=10, B=0, C=-10))
## $a
## [1] 11
## 
## $b
## [1] 1 2
## 
## $c
## [1] -9 -8 -7

Speed vs for loop

To compare the performance of a piece of code written using lapply vs for loop, we use the following simple function

ftest1 <- function(x) log(0.43 * x + 1)
x <- 1:1e6 #runif(1e+06, min=-1, max=1)
ftest1_apply <- function(x) lapply(x, ftest1)
ftest1_loop <- function(x) {
  y <- rep(NA, length(x))
  for (i in 1:length(x)) y[i] <- ftest1(x[i])
  y
}
system.time(ftest1_apply(x))
##    user  system elapsed 
##   0.747   0.035   0.782
system.time(ftest1_loop(x))
##    user  system elapsed 
##   0.526   0.000   0.526
system.time(ftest1(x))
##    user  system elapsed 
##   0.008   0.000   0.007

The results show that lapply is the slowest, with for loop being twice as fast as lapply. But obviously applying the function directly (hence vectorised) is the fastest by far.

Map, Reduce and Filter

First, a word on the lambda notation, which is a short, typically one-line, definition of a function: {function (x) x+1}. Such a function is also known as an anonymous function as there is no name attached to it. Note that this alone does nothing in R. It has to be used in Map, Reduce, Filter, or another function that takes functions as inputs. Note also that the first letter of the following functions are capped.

Map

The function Map maps the same function to every element in a vector. For example,

x <- 1:6
y <- Map({function(a) a*a}, x)
y
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25
## 
## [[6]]
## [1] 36
unlist(y)
## [1]  1  4  9 16 25 36

Reduce

The function Reduce will perform the function on pairs of elements of a vector, iterate the procedure, and return a single number. For example,

Reduce(function(a, b) a+b, x)
## [1] 21
Reduce(function(x, y) x*y, x)
## [1] 720

Note that in the last line, x and y in the lambda definition of the function is internal to the function itself and has no meaning outside.

Filter

The function Filter goes down a vector and only keep elements that satisfies the condition determined by the function. For example,

Filter(function(x) x %% 2 == 0, x)
## [1] 2 4 6
Filter(function(x) x>3, x)
## [1] 4 5 6
Filter(function(x) x-1, x) # bad style
## [1] 2 3 4 5 6

Note that in R, 0 is treated as FALSE, whereas any non-zero number is treated as TRUE