Parallel programming

Note. Any timing results here could be substantially different to what you would get if you ran the code on your own computer, so comments on the results may also be confusing. This is because these computations are run in the cloud, as part of the website build process.

Why parallel?

Often, for large data processing jobs, a single CPU core is not enough. Large computational jobs can be:

cpu-bound: takes too much cpu time
memory-bound: takes too much memory
I/O-bound: takes too much time to read/write from disk
network-bound: Takes too much time to transfer across network

Parallel programming aims to distribute cpu-bound computations to different cores (in a single processor) or processors (if available). Modern supercomputers are fast because they have massive number of processors, therefore it is crucial to be able write programs that run in parallel if one wants to take advantage of large clusters of processors.

However, it is not true, even for easily parallelisable computations, that running on \(n\) processes will be \(n\) times as fast. The achievable gain using additional processors tend to diminish as \(n\) increases.

Parallelise using `mclapply`

library(parallel)
num_cores <- detectCores()
num_cores

## [1] 4

In a Wright-Fisher model, an important model in population genetics, we trace the frequency of a mutant in a population of constant size \(N\). The number of mutants \(X_k\) is a Markov process that depends only on the value of \(X_{k-1}\): \[ X_k | X_{k-1} \sim Binomial(N, \frac{X_{k-1}} N). \]

simulate_wright_fisher <- function(N, n_gen, init_freq) {
  # population size, number of generation to simulate, and initial frequency of mutant
  counts <- numeric(n_gen)
  counts[1] <- round(N * init_freq)
  for (k in 2:n_gen) {
    counts[k] <- rbinom(1, size=N, prob=counts[k-1]/N)
  }
  return(counts)
}
counts <- simulate_wright_fisher(5000, 10000, 0.2)
plot(counts, type='l', main='frequency of mutant', xlab='generation', ylab='')

system.time(simulate_wright_fisher(5000, 10000, 0.2))

##    user  system elapsed 
##   0.019   0.004   0.023

This simple “neutral” Wright-Fisher model can still be vectorised relatively easily, but more complicated models won’t be. This is where parallelisation comes in.

`mclapply()` (doesn’t work on windows)

wrapper <- function(init_freq) simulate_wright_fisher(5000, 10, init_freq)
init_freqs <- rep(0.2, 3)
res <- mclapply(init_freqs, wrapper, mc.cores=2)
res

## [[1]]
##  [1] 1000 1013 1025 1044 1064 1060 1048 1063 1054 1043
## 
## [[2]]
##  [1] 1000 1013 1004 1030 1004  991 1016 1048 1026  973
## 
## [[3]]
##  [1] 1000  970  979  959  913  924  904  904  929  885

wrapper2 <- function(init_freq) tail(simulate_wright_fisher(5000, 1000, init_freq), 1)
init_freqs2 <- rep(0.2, 500)
system.time(lapply(init_freqs2, wrapper2))

##    user  system elapsed 
##   0.987   0.016   1.004

system.time(mclapply(init_freqs2, wrapper2, mc.cores=2))

##    user  system elapsed 
##   0.002   0.004   0.555

system.time(mclapply(init_freqs2, wrapper2, mc.cores=4))

##    user  system elapsed 
##   0.000   0.013   0.534

The results show that mclapply with 2 cores is twice as fast as lapply. Increasing to 4 cores improves the performance of mclapply further, but by a factor slightly less than 2. (Note that on this MacBook, there are 4 physical cores, 8 virtual cores.)

`foreach` and `doParallel`

Note that foreach is not a parallel for loop. It resembles a function that outputs something, typically a list of the same length as the number of iterates, with usage illustrated below. Note that %do% evaluates the expression sequentially, while %dopar% evaluates it in parallel.

library(foreach)
foreach (i=1:3) %do% {
  i*i
}

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9

res <- foreach (i=1:2, .combine=rbind) %do% {
  simulate_wright_fisher(5000, 10, 0.2)
}
res

##          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## result.1 1000  975  955  927  909  919  929  902  936   903
## result.2 1000  933  884  908  905  916  934  904  919   893

A useful argument to add to foreach is .final, which is function of one argument that is called to return final. For sample, if we are only interested in the number of mutants in the most recent generation,

n_gen <- 1000
res <- foreach (i=1:1000, .combine=rbind, .final=function(x) x[,n_gen]) %do% {
  simulate_wright_fisher(5000, n_gen, 0.2)
}
hist(res, breaks=50)

So far nothing has run in parallel. In order to do so, we need to register the number of cores to use via registerDoParallel.

library(doParallel)

## Loading required package: iterators

registerDoParallel(2) # only register 2 cores
n_runs <- 200
system.time(foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %do% {
  simulate_wright_fisher(5000, n_gen, 0.2)
})

##    user  system elapsed 
##   0.436   0.000   0.437

system.time(foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %dopar% {
  simulate_wright_fisher(5000, n_gen, 0.2)
})

##    user  system elapsed 
##   0.428   0.148   0.297

t1 <- Sys.time()
res1 <- foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %do% {
  simulate_wright_fisher(5000, n_gen, 0.2)
}
t2 <- Sys.time()
res2 <- foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %dopar% {
  simulate_wright_fisher(5000, n_gen, 0.2)
}
t3 <- Sys.time()
registerDoParallel(4) # try registering all 4 cores
res2 <- foreach (i=1:n_runs, .combine=rbind, .final=function(x) x[,n_gen]) %dopar% {
  simulate_wright_fisher(5000, n_gen, 0.2)
}
t4 <- Sys.time()

t2-t1

## Time difference of 0.4643681 secs

t3-t2

## Time difference of 0.2935133 secs

t4-t3

## Time difference of 0.3021364 secs

# clean up the cluster
stopImplicitCluster()

Just as in the case of mclapply, evaluating the code in parallel with 2 cores takes about half as much time as evaluating the code sequentially, but with 4 cores, the speedup deteriorates somewhat.

Why parallel?

Parallelise using mclapply

mclapply() (doesn’t work on windows)

foreach and doParallel

Parallelise using `mclapply`

`mclapply()` (doesn’t work on windows)

`foreach` and `doParallel`