2020-07-25
Some algorithms rely on a stream of random numbers in order to meet their (average) time complexity class. One such famous algorithm is Quicksort.
I recently worked on some performance optimisations for simplefeatures' R-Tree bulk loading algorithm. This bulk loading algorithm also utilises random numbers in order to meet its (average) time complexity class. It actually uses random numbers in a very similar way compared to Quicksort: to select random pivot indices to sort around.
This blog post shows one aspect of that performance work, namely optimising the generation of random numbers. Optimising random number generation allowed a 20% to 40% performance increase for R-Tree bulk loading (depending on what is loaded).
We’ll start out with the most obvious solution for generating random numbers, and then see what can be done to improve performance.
Each benchmark will generate random numbers in the range [0, 100)
i.e.
between between 0 (inclusive) and 100 (exclusive). There isn’t anything special
about this paricular range, but it is representative of the sorts of ranges
used for R-Tree bulk loading. Experimentally, I found that changing the range
doesn’t have an impact on performance.
First, let’s benchmark Go’s built in random number generator. This serves as a baseline, and was used in my initial R-Tree bulk loading implementation.
import (
"math/rand"
"testing"
)
var k = 100
func BenchmarkMathRand(b *testing.B) {
for i := 0; i < b.N; i++ {
rand.Intn(k)
}
}
This is already pretty fast (~18 ns per random number). It just goes to show that even fast operations can slow you down if they’re in a tight loop.
BenchmarkMathRand-4 56426527 17.8 ns/op
Note that we’re using the global rand.Intn
function, which internally uses a
shared random number source. That source is protected by a mutex, so we’re
locking and unlocking each time we get a new number. Let’s try using a local
*rand.Rand
instance rather than using the global rand.Intn
function.
func BenchmarkMathRandLocal(b *testing.B) {
rnd := rand.New(rand.NewSource(0))
b.ResetTimer()
for i := 0; i < b.N; i++ {
rnd.Intn(k)
}
}
That made quite a large difference, an improvement of ~40% over the original.
BenchmarkMathRandLocal-4 100000000 10.2 ns/op
But wait… We’re not measuring the cost of setting up the local *rand.Rand
instance because it occurs before the call to b.ResetTimer()
. This isn’t
really a fair test. To find out how expensive it is to set up a local *rand.Rand
instance, we can benchmark that operation in isolation.
func BenchmarkCreateLocalRand(b *testing.B) {
for i := 0; i < b.N; i++ {
rand.New(rand.NewSource(0))
}
}
It turns out the cost is significant! The startup cost for a *rand.Rand
is
approximately 3 orders of magnitude slower than generating each random number!
BenchmarkCreateLocalRand-4 149104 7868 ns/op
We can use some back of the envelope calculations to determine how many random
numbers we would have to generate from the local *rand.Rand
is order to break even
against using the global rand
functions.
Assume each random number from rand.Intn
takes ~18ns, and each random number
from a local *rand.Rand
instance takes ~10ns (with a setup of ~7,800ns).
18x = 10x + 7,800
8x = 7,800
x = 975
So we need to be generating around 1,000 random numbers per local *rand.Rand
instance before this approach even begins to break even.
We could attempt to use a hybrid algorithm if we know up front how many random numbers we’ll need. For the R-Tree bulk load algorithm, we would be able to work that out without too much difficulty. If the number of random numbers we need is low, we could use the global functions. If the number of random numbers is high, we could use a local instance.
This is getting finicky though… Hybrid algorithms need to be finely tuned, and the tuning would need to be updated for new hardware or compiler versions. This is way too much trouble except for the most specialised situations (and this is not one of them). I decided not to pursue this approach.
Because we don’t control the random number generation (it’s in the math/rand
package), we can’t modify it to make it faster. What if we generate the random
numbers ourselves?
One of the most basic random number generation algorithms is the Linear Congruential Generator. It works by generating a chain of random values, each based on the previous. For each new random number, the previous number is multiplied by a constant, and then another constant is added. This value is then reduced modulo another constant (otherwise is will increase forever).
RND_next := (RND_prev * CONST_MUL + CONST_ADD) % CONST_MOD
Note that Linear Congruential Generators as described above don’t emit high quality random numbers. That may or may not matter depending on your application. For my use case (choosing random pivots), it’s fine.
Implementing this generator is easy, so lets see how it performs. We choose
some “well known” constants, CONST_MUL=1664525
, CONST_ADD=1013904223
, and
CONST_MOD=2^32
. These constants are from Numerical
Recipes. The choice of modulo constant is
especially convenient, since it allows us to just store the state in a 32 bit
unsigned integer and ignore the modulo aspect entirely.
Be wary of Go compiler optimisations when benchmarking. If you’re not careful, the compiler can optimise everything away. An easy way to avoid this is to store the results in a global variable.
var result int
func BenchmarkLCG(b *testing.B) {
var cumulative int
var state int32
for i := 0; i < b.N; i++ {
state = state*1664525 + 1013904223
rndNum := int(state) % k
cumulative += rndNum // prevent optimising everything out
}
result = cumulative
}
This performance isn’t that much better compared to what we saw previously
when using a local instance of *rand.Rand
. But at least we don’t have to pay
the expensive setup cost.
BenchmarkLCG-4 152581359 7.95 ns/op
It’s hard to know where to go from here… We only have two lines of code left, and they aren’t doing much other than arithmetic operations.
We’ll need to use some profiling techniques in order to work out what part of the remaining code is the hotspot. If we generate a CPU profile while running the benchmark, the results are enlightening:
(flat)
. func BenchmarkLCG(b *testing.B) {
. var cumulative, state int
. for i := 0; i < b.N; i++ {
250ms state = (state*1664525 + 1013904223) % (1 << 32)
1.75s cumulative = state % k
. }
. result = cumulative
. }
Most of the execution time is spent taking the state and reducing it modulo k
in order to get the actual output random number (1.75s total). Generating the
new random state itself is relatively cheap (260ms total). This makes sense,
since modulo operations (along with divisions) are well known to be much more
expensive compared to additions and multiplications.
So does that mean we’re at a dead end? After all, we need a number between 0
and k
…
We don’t necessarily need to use a modulo operation to convert our random
number state into a number between 0 and k
. It’s possible instead to use a
multiplication and a shift, which are much cheaper operations.
The full code is:
var result int
func BenchmarkFastLCG(b *testing.B) {
var cumulative int
var state int32
for i := 0; i < b.N; i++ {
state = state*1664525 + 1013904223
rndNum := (int64(state) * int64(k)) >> 32
cumulative += int(rndNum) // prevent optimising everything out
}
result = cumulative
}
The benchmark results are very promising!
BenchmarkFastLCG-4 1000000000 1.04 ns/op
If you haven’t seen this trick before, (despite its simplicity) it can be a bit
hard to digest. It’s basically just a different way to split up state
into
k
different buckets.
An analogy may be useful:
Imagine dealing out a shuffled pack of 52 cards to
k
players. Exactly one of the cards in the deck is specially marked. Each player receives eitherfloor(52/k)
orceil(52/k)
cards (k
may not evenly divide 52). The player who receives the marked card wins. This is analogous to choosing a random number in the range[0, k)
(the player who wins) based on a random state in the range[0, 52)
(the marked card in the deck).The modulus approach is analogous to dealing one card at a time around the circle.
The multiply and shift approach is analogous to dealing
floor(52/k)
cards to the first player,floor(52/k)
cards to the second player, and so on.
Performance optimisation are always a trade off. They increase the performance of a program at the expense of readability and maintainability (all the while attempting to maintain functional equivalence).
Keep the following tips in mind:
Always thoroughly unit test the code you’re optimising. Optimisations typically reduce readability and maintainability. Good unit testing can help to mitigate this.
Always use benchmarks to prove optimisation results. Don’t just guess that a change should improve performance. It’s okay to let intuition guide you, but always back it up with measurements.
After implementing a performance optimisation, ask yourself whether the trade off makes sense. This will depend on many things. What is the impact on readability and maintainability? What is the performance gain in percentage terms? Does the business application already perform well? Don’t fall for the sunk cost fallacy; it’s okay to abandon an optimisation if you don’t feel that the trade offs are worth it.
Happy optimising!