Implementation notes

The multi-threaded implementation of SMC is fairly straightforward. The attempt to efficiently utilize multiple cores is largely focused around using appropriate data structures and low-overhead resampling / selection, since the remaining steps are embarrassingly parallel.

There is also some attention paid to Julia's threading interface and, for both parallel and serial execution, the need to avoid dynamic memory allocations by pre-allocating reusable memory. All code was written using Julia's core and standard library, with some code written in more general purpose packages such as NonUniformRandomVariateGeneration.jl.

Generating the ancestor indices in sorted order is accomplished primarily through use of the uniform spacings method for generating a sequence of sorted uniform random variates developed by Lurie & Hartley (1972) and described by Devroye (1986, p. 214). In multi-threaded code, the multinomial variate giving the number of ancestor indices each thread should simulate using a thread-specific subvector of particle weights is simulated by using a combination of inversion sampling and an implementation of the btrd algorithm of Hörmann (1993), which simulates binomially distributed variates in expected $\mathcal{O}(1)$ time. Simulating a multinomial random variate is accomplished by simulating a sequence of appropriate Binomial random variates. The code-generating function for copying particles was adapted from a suggestion on Julia Discourse by Greg Plowman.

Thread-specific, counter-based RNGs in the RandomNumbers.jl package are used. These are only slightly slower than the MersenneTwister RNG provided by Julia's standard library Random.