Fast Random Number Generation in Go

2020-07-25

Background

Some algorithms rely on a stream of random numbers in order to meet their (average) time complexity class. One such famous algorithm is Quicksort.

I recently worked on some performance optimisations for simplefeatures' R-Tree bulk loading algorithm. This bulk loading algorithm also utilises random numbers in order to meet its (average) time complexity class. It actually uses random numbers in a very similar way compared to Quicksort: to select random pivot indices to sort around.

This blog post shows one aspect of that performance work, namely optimising the generation of random numbers. Optimising random number generation allowed a 20% to 40% performance increase for R-Tree bulk loading (depending on what is loaded).

Fast Random Numbers

We’ll start out with the most obvious solution for generating random numbers, and then see what can be done to improve performance.

Each benchmark will generate random numbers in the range [0, 100) i.e. between between 0 (inclusive) and 100 (exclusive). There isn’t anything special about this paricular range, but it is representative of the sorts of ranges used for R-Tree bulk loading. Experimentally, I found that changing the range doesn’t have an impact on performance.

Go Standard Library

First, let’s benchmark Go’s built in random number generator. This serves as a baseline, and was used in my initial R-Tree bulk loading implementation.

import (
    "math/rand"
    "testing"
)

var k = 100

func BenchmarkMathRand(b *testing.B) {
    for i := 0; i < b.N; i++ {
        rand.Intn(k)
    }
}

This is already pretty fast (~18 ns per random number). It just goes to show that even fast operations can slow you down if they’re in a tight loop.

BenchmarkMathRand-4             56426527                17.8 ns/op

Local Random Number Generator

Note that we’re using the global rand.Intn function, which internally uses a shared random number source. That source is protected by a mutex, so we’re locking and unlocking each time we get a new number. Let’s try using a local *rand.Rand instance rather than using the global rand.Intn function.

func BenchmarkMathRandLocal(b *testing.B) {
    rnd := rand.New(rand.NewSource(0))
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        rnd.Intn(k)
    }
}

That made quite a large difference, an improvement of ~40% over the original.

BenchmarkMathRandLocal-4        100000000               10.2 ns/op

But wait… We’re not measuring the cost of setting up the local *rand.Rand instance because it occurs before the call to b.ResetTimer(). This isn’t really a fair test. To find out how expensive it is to set up a local *rand.Rand instance, we can benchmark that operation in isolation.

func BenchmarkCreateLocalRand(b *testing.B) {
    for i := 0; i < b.N; i++ {
        rand.New(rand.NewSource(0))
    }
}

It turns out the cost is significant! The startup cost for a *rand.Rand is approximately 3 orders of magnitude slower than generating each random number!

BenchmarkCreateLocalRand-4        149104              7868 ns/op

We can use some back of the envelope calculations to determine how many random numbers we would have to generate from the local *rand.Rand is order to break even against using the global rand functions.

Assume each random number from rand.Intn takes ~18ns, and each random number from a local *rand.Rand instance takes ~10ns (with a setup of ~7,800ns).

18x = 10x + 7,800
 8x = 7,800
  x = 975

So we need to be generating around 1,000 random numbers per local *rand.Rand instance before this approach even begins to break even.

Hybrid Algorithms

We could attempt to use a hybrid algorithm if we know up front how many random numbers we’ll need. For the R-Tree bulk load algorithm, we would be able to work that out without too much difficulty. If the number of random numbers we need is low, we could use the global functions. If the number of random numbers is high, we could use a local instance.

This is getting finicky though… Hybrid algorithms need to be finely tuned, and the tuning would need to be updated for new hardware or compiler versions. This is way too much trouble except for the most specialised situations (and this is not one of them). I decided not to pursue this approach.

Linear Congruential Generators

Because we don’t control the random number generation (it’s in the math/rand package), we can’t modify it to make it faster. What if we generate the random numbers ourselves?

One of the most basic random number generation algorithms is the Linear Congruential Generator. It works by generating a chain of random values, each based on the previous. For each new random number, the previous number is multiplied by a constant, and then another constant is added. This value is then reduced modulo another constant (otherwise is will increase forever).

RND_next := (RND_prev * CONST_MUL + CONST_ADD) % CONST_MOD

Note that Linear Congruential Generators as described above don’t emit high quality random numbers. That may or may not matter depending on your application. For my use case (choosing random pivots), it’s fine.

Implementing this generator is easy, so lets see how it performs. We choose some “well known” constants, CONST_MUL=1664525, CONST_ADD=1013904223, and CONST_MOD=2^32. These constants are from Numerical Recipes. The choice of modulo constant is especially convenient, since it allows us to just store the state in a 32 bit unsigned integer and ignore the modulo aspect entirely.

Be wary of Go compiler optimisations when benchmarking. If you’re not careful, the compiler can optimise everything away. An easy way to avoid this is to store the results in a global variable.

var result int

func BenchmarkLCG(b *testing.B) {
    var cumulative int
    var state int32
    for i := 0; i < b.N; i++ {
        state = state*1664525 + 1013904223
        rndNum := int(state) % k
        cumulative += rndNum // prevent optimising everything out
    }
    result = cumulative
}

This performance isn’t that much better compared to what we saw previously when using a local instance of *rand.Rand. But at least we don’t have to pay the expensive setup cost.

BenchmarkLCG-4                  152581359                7.95 ns/op

Deeper Profiling

It’s hard to know where to go from here… We only have two lines of code left, and they aren’t doing much other than arithmetic operations.

We’ll need to use some profiling techniques in order to work out what part of the remaining code is the hotspot. If we generate a CPU profile while running the benchmark, the results are enlightening:

(flat)   
     .    func BenchmarkLCG(b *testing.B) { 
     .        var cumulative, state int 
     .        for i := 0; i < b.N; i++ { 
 250ms            state = (state*1664525 + 1013904223) % (1 << 32) 
 1.75s            cumulative = state % k 
     .        } 
     .        result = cumulative 
     .    }

Most of the execution time is spent taking the state and reducing it modulo k in order to get the actual output random number (1.75s total). Generating the new random state itself is relatively cheap (260ms total). This makes sense, since modulo operations (along with divisions) are well known to be much more expensive compared to additions and multiplications.

So does that mean we’re at a dead end? After all, we need a number between 0 and k…

Modulo Alternatives For Random Number Generation

We don’t necessarily need to use a modulo operation to convert our random number state into a number between 0 and k. It’s possible instead to use a multiplication and a shift, which are much cheaper operations.

The full code is:

var result int

func BenchmarkFastLCG(b *testing.B) {
    var cumulative int
    var state int32
    for i := 0; i < b.N; i++ {
        state = state*1664525 + 1013904223
        rndNum := (int64(state) * int64(k)) >> 32
        cumulative += int(rndNum) // prevent optimising everything out
    }
    result = cumulative
}

The benchmark results are very promising!

BenchmarkFastLCG-4              1000000000               1.04 ns/op

If you haven’t seen this trick before, (despite its simplicity) it can be a bit hard to digest. It’s basically just a different way to split up state into k different buckets.

An analogy may be useful:

Imagine dealing out a shuffled pack of 52 cards to k players. Exactly one of the cards in the deck is specially marked. Each player receives either floor(52/k) or ceil(52/k) cards (k may not evenly divide 52). The player who receives the marked card wins. This is analogous to choosing a random number in the range [0, k) (the player who wins) based on a random state in the range [0, 52) (the marked card in the deck).

The modulus approach is analogous to dealing one card at a time around the circle.

The multiply and shift approach is analogous to dealing floor(52/k) cards to the first player, floor(52/k) cards to the second player, and so on.

Words of Caution when Optimising

Performance optimisation are always a trade off. They increase the performance of a program at the expense of readability and maintainability (all the while attempting to maintain functional equivalence).

Keep the following tips in mind:

Always thoroughly unit test the code you’re optimising. Optimisations typically reduce readability and maintainability. Good unit testing can help to mitigate this.
Always use benchmarks to prove optimisation results. Don’t just guess that a change should improve performance. It’s okay to let intuition guide you, but always back it up with measurements.
After implementing a performance optimisation, ask yourself whether the trade off makes sense. This will depend on many things. What is the impact on readability and maintainability? What is the performance gain in percentage terms? Does the business application already perform well? Don’t fall for the sunk cost fallacy; it’s okay to abandon an optimisation if you don’t feel that the trade offs are worth it.

Happy optimising!

Github