SC1

Statistical Computing 1

Parallel programming


Note. Any timing results here could be substantially different to what you would get if you ran the code on your own computer, so comments on the results may also be confusing. This is because these computations are run in the cloud, as part of the website build process.

Why parallel?

Often, for large data processing jobs, a single CPU core is not enough. Large computational jobs can be:

Parallel programming aims to distribute cpu-bound computations to different cores (in a single processor) or processors (if available). Modern supercomputers are fast because they have massive number of processors, therefore it is crucial to be able write programs that run in parallel if one wants to take advantage of large clusters of processors.

However, it is not true, even for easily parallelisable computations, that running on \(n\) processes will be \(n\) times as fast. The achievable gain using additional processors tend to diminish as \(n\) increases.

Parallelise using mclapply

library(parallel)
num_cores <- detectCores()
num_cores
## [1] 4

In a Wright-Fisher model, an important model in population genetics, we trace the frequency of a mutant in a population of constant size \(N\). The number of mutants \(X_k\) is a Markov process that depends only on the value of \(X_{k-1}\): \[ X_k | X_{k-1} \sim Binomial(N, \frac{X_{k-1}} N). \]

simulate_wright_fisher <- function(N, n_gen, init_freq) {
  # population size, number of generation to simulate, and initial frequency of mutant
  counts <- numeric(n_gen)
  counts[1] <- round(N * init_freq)
  for (k in 2:n_gen) {
    counts[k] <- rbinom(1, size=N, prob=counts[k-1]/N)
  }
  return(counts)
}
counts <- simulate_wright_fisher(5000, 10000, 0.2)
plot(counts, type='l', main='frequency of mutant', xlab='generation', ylab='')

system.time(simulate_wright_fisher(5000, 10000, 0.2))
##    user  system elapsed 
##   0.019   0.004   0.023

This simple “neutral” Wright-Fisher model can still be vectorised relatively easily, but more complicated models won’t be. This is where parallelisation comes in.

mclapply() (doesn’t work on windows)

wrapper <- function(init_freq) simulate_wright_fisher(5000, 10, init_freq)
init_freqs <- rep(0.2, 3)
res <- mclapply(init_freqs, wrapper, mc.cores=2)
res
## [[1]]
##  [1] 1000 1013 1025 1044 1064 1060 1048 1063 1054 1043
## 
## [[2]]
##  [1] 1000 1013 1004 1030 1004  991 1016 1048 1026  973
## 
## [[3]]
##  [1] 1000  970  979  959  913  924  904  904  929  885
wrapper2 <- function(init_freq) tail(simulate_wright_fisher(5000, 1000, init_freq), 1)
init_freqs2 <- rep(0.2, 500)
system.time(lapply(init_freqs2, wrapper2))
##    user  system elapsed 
##   0.987   0.016   1.004
system.time(mclapply(init_freqs2, wrapper2, mc.cores=2))
##    user  system elapsed 
##   0.002   0.004   0.555
system.time(mclapply(init_freqs2, wrapper2, mc.cores=4))
##    user  system elapsed 
##   0.000   0.013   0.534

The results show that mclapply with 2 cores is twice as fast as lapply. Increasing to 4 cores improves the performance of mclapply further, but by a factor slightly less than 2. (Note that on this MacBook, there are 4 physical cores, 8 virtual cores.)

foreach and doParallel

Note that foreach is not a parallel for loop. It resembles a function that outputs something, typically a list of the same length as the number of iterates, with usage illustrated below. Note that %do% evaluates the expression sequentially, while %dopar% evaluates it in parallel.

library(foreach)
foreach (i=1:3) %do% {
  i*i
}
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
res <- foreach (i=1:2, .combine=rbind) %do% {
  simulate_wright_fisher(5000, 10, 0.2)
}
res
##          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## result.1 1000  975  955  927  909  919  929  902  936   903
## result.2 1000  933  884  908  905  916  934  904  919   893

A useful argument to add to foreach is .final, which is function of one argument that is called to return final. For sample, if we are only interested in the number of mutants in the most recent generation,

n_gen <- 1000
res <- foreach (i=1:1000, .combine=rbind, .final=function(x) x[,n_gen]) %do% {
  simulate_wright_fisher(5000, n_gen, 0.2)
}
hist(res, breaks=50)

So far nothing has run in parallel. In order to do so, we need to register the number of cores to use via registerDoParallel.

library(doParallel)
## Loading required package: iterators
registerDoParallel(2) # only register 2 cores
n_runs <- 200
system.time(foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %do% {
  simulate_wright_fisher(5000, n_gen, 0.2)
})
##    user  system elapsed 
##   0.436   0.000   0.437
system.time(foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %dopar% {
  simulate_wright_fisher(5000, n_gen, 0.2)
})
##    user  system elapsed 
##   0.428   0.148   0.297
t1 <- Sys.time()
res1 <- foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %do% {
  simulate_wright_fisher(5000, n_gen, 0.2)
}
t2 <- Sys.time()
res2 <- foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %dopar% {
  simulate_wright_fisher(5000, n_gen, 0.2)
}
t3 <- Sys.time()
registerDoParallel(4) # try registering all 4 cores
res2 <- foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %dopar% {
  simulate_wright_fisher(5000, n_gen, 0.2)
}
t4 <- Sys.time()

t2-t1
## Time difference of 0.4643681 secs
t3-t2
## Time difference of 0.2935133 secs
t4-t3
## Time difference of 0.3021364 secs
# clean up the cluster
stopImplicitCluster()

Just as in the case of mclapply, evaluating the code in parallel with 2 cores takes about half as much time as evaluating the code sequentially, but with 4 cores, the speedup deteriorates somewhat.