R apply family, sweep, map, reduce, filter
R apply family
The apply family of functions consist of apply()
, lapply()
, sapply()
, and mapply()
. The basic idea of the apply family is to apply the same function repetitively across a list or a certain dimension of an array or dataframe.
apply
We start with apply()
. Its syntax is apply(X, MARGIN, FUN, ...)
, where
X
is an array of dimension \(\ge 2\), including a 2-d matrix.MARGIN
is a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows (i.e. output has length equal to number of rows), 2 indicates columns (i.e. output has length equal to number of columns),c(1, 2)
indicates rows and columns (i.e. output remains a matrix of the same dimension).FUN
is the function to be applied
As an example, we construct a matrix of all 1’s and sum across rows and columns:
x <- matrix(1, nrow=3, ncol=4)
apply(x, 1, sum)
## [1] 4 4 4
apply(x, 2, sum)
## [1] 3 3 3 3
apply(x, c(1,2), sum)
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 1 1 1 1
## [3,] 1 1 1 1
The function can be defined by the user. For example,
apply(x, 1, function(x) sum(exp(x)))
## [1] 10.87313 10.87313 10.87313
The input can also be a higher dimensional array
apply(array(1, dim=c(2,3,4)), c(1,2), sum)
## [,1] [,2] [,3]
## [1,] 4 4 4
## [2,] 4 4 4
lapply
In contrast to apply
, which must be applied to data of higher than one dimension, lapply
applies a function over a list or vector and outputs a list. Its basic usage is lapply(X, FUN, …)
. For example,
lapply(rep(1,5), exp)
## [[1]]
## [1] 2.718282
##
## [[2]]
## [1] 2.718282
##
## [[3]]
## [1] 2.718282
##
## [[4]]
## [1] 2.718282
##
## [[5]]
## [1] 2.718282
x <- list(a=1:10, b=exp(-3:3), c=c(TRUE,FALSE,FALSE,TRUE))
lapply(x, mean)
## $a
## [1] 5.5
##
## $b
## [1] 4.535125
##
## $c
## [1] 0.5
lapply(x, quantile, probs=(1:3)/4)
## $a
## 25% 50% 75%
## 3.25 5.50 7.75
##
## $b
## 25% 50% 75%
## 0.2516074 1.0000000 5.0536690
##
## $c
## 25% 50% 75%
## 0.0 0.5 1.0
sapply
The sapply()
function works like lapply()
, but it tries to simplify the output as much as possible.
sapply(rep(1,5), exp)
## [1] 2.718282 2.718282 2.718282 2.718282 2.718282
unlist(lapply(rep(1,5), exp))
## [1] 2.718282 2.718282 2.718282 2.718282 2.718282
sapply(x, mean)
## a b c
## 5.500000 4.535125 0.500000
sapply(x, quantile, probs=(1:3)/4)
## a b c
## 25% 3.25 0.2516074 0.0
## 50% 5.50 1.0000000 0.5
## 75% 7.75 5.0536690 1.0
mapply
mapply
is a multivariate version of sapply
. It is useful if you have more than one input to the function. Its basic usage is mapply(FUN, ...,)
, where … stands for the inputs to FUN
. Notice that the order of parameters is in a different order to apply
or lapply
. (Why?)
mapply(rep, 1:4, 4:1)
## [[1]]
## [1] 1 1 1 1
##
## [[2]]
## [1] 2 2 2
##
## [[3]]
## [1] 3 3
##
## [[4]]
## [1] 4
mapply(rep, times=1:4, x=4:1)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 3 3
##
## [[3]]
## [1] 2 2 2
##
## [[4]]
## [1] 1 1 1 1
mapply(function(x, y) seq_len(x) + y, c(a=1, b=2, c=3), c(A=10, B=0, C=-10))
## $a
## [1] 11
##
## $b
## [1] 1 2
##
## $c
## [1] -9 -8 -7
Speed vs for loop
To compare the performance of a piece of code written using lapply
vs for
loop, we use the following simple function
ftest1 <- function(x) log(0.43 * x + 1)
x <- 1:1e6 #runif(1e+06, min=-1, max=1)
ftest1_apply <- function(x) lapply(x, ftest1)
ftest1_loop <- function(x) {
y <- rep(NA, length(x))
for (i in 1:length(x)) y[i] <- ftest1(x[i])
y
}
system.time(ftest1_apply(x))
## user system elapsed
## 0.747 0.035 0.782
system.time(ftest1_loop(x))
## user system elapsed
## 0.526 0.000 0.526
system.time(ftest1(x))
## user system elapsed
## 0.008 0.000 0.007
The results show that lapply
is the slowest, with for loop being twice as fast as lapply
. But obviously applying the function directly (hence vectorised) is the fastest by far.
Map, Reduce and Filter
First, a word on the lambda notation, which is a short, typically one-line, definition of a function:
{function (x) x+1}
. Such a function is also known as an anonymous function as there is no name attached to it. Note that this alone does nothing in R
. It has to be used in Map, Reduce, Filter, or another function that takes functions as inputs. Note also that the first letter of the following functions are capped.
Map
The function Map
maps the same function to every element in a vector. For example,
x <- 1:6
y <- Map({function(a) a*a}, x)
y
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 9
##
## [[4]]
## [1] 16
##
## [[5]]
## [1] 25
##
## [[6]]
## [1] 36
unlist(y)
## [1] 1 4 9 16 25 36
Reduce
The function Reduce
will perform the function on pairs of elements of a vector, iterate the procedure, and return a single number. For example,
Reduce(function(a, b) a+b, x)
## [1] 21
Reduce(function(x, y) x*y, x)
## [1] 720
Note that in the last line, x
and y
in the lambda definition of the function is internal to the function itself and has no meaning outside.
Filter
The function Filter
goes down a vector and only keep elements that satisfies the condition determined by the function. For example,
Filter(function(x) x %% 2 == 0, x)
## [1] 2 4 6
Filter(function(x) x>3, x)
## [1] 4 5 6
Filter(function(x) x-1, x) # bad style
## [1] 2 3 4 5 6
Note that in R, 0 is treated as FALSE
, whereas any non-zero number is treated as TRUE