Using Go’s RWMutex recursively is a bug

2022-10-14

A previous blog post made a few claims about recursive usage of Go’s RWMutex. In particular:

Recursive usage is always a bug.
This class of bug is hard to detect via testing.
The bug may lay dormant for a long time before causing problems.

These claims were used as part of a justification for preferring Mutex over RWMutex as a default choice and only using RWMutex after having explicitly calculated that RWMutex would give a performance improvement.

This post will dive into the details. It will explain the mechanism behind this bug class and how it may occur during real-world software engineering activities.

Regressions always start with a correct program

The example used throughout this post will be a small Go type that models a financial account ledger. It simplifies a real account ledger model but is sufficient to show how recursive usage of RWMutex can occur. It contains:

A list of credit and debit entries, each of which are slices of ints.
An RWMutex to control thread-safe read-only and read-write access to the entries.

The type definition is:

type AccountLedger struct {
	mu      sync.RWMutex
	debits  []int
	credits []int
}

It has a method that updates the account ledger by adding new credit and debit entries. The mutex is held in read-write mode during the execution of the method.

func (a *AccountLedger) AddEntries(debits, credits []int) {
	a.mu.Lock()
	defer a.mu.Unlock()
	a.debits = append(a.debits, debits...)
	a.credits = append(a.credits, credits...)
}

It has another method that calculates the net balance of the account ledger. The net balance is just the difference between the sum of the credits and the sum of the debits. The mutex is held in read-only mode during the execution of this method.

func (a *AccountLedger) sumDebits() int {
	var sum int
	for _, d := range a.debits {
		sum += d
	}
	return sum
}

func (a *AccountLedger) sumCredits() int {
	var sum int
	for _, c := range a.credits {
		sum += c
	}
	return sum
}

func (a *AccountLedger) Net() int {
	a.mu.RLock()
	defer a.mu.RUnlock()
	return a.sumDebits() - a.sumCredits()
}

Everything in this program works well! There are no bugs here. Some unit tests can be written to show how the type behaves. First, its basic usage can be tested by adding some entries to the account ledger and then calculating the net balance and ensuring it’s the right value.

func TestNet(t *testing.T) {
	var ledger AccountLedger
	ledger.AddEntries([]int{10, 7}, []int{9})
	got := ledger.Net()
	const want = 8 // 10 + 7 - 9
	if got != want {
		t.Errorf("got %d want %d", got, want)
	}
}

Since the type is thread-safe, some tests should be written that call the methods concurrently to ensure everything works correctly. The new test adds more entries to the account ledger while concurrently calculating the net balance. Because the ordering of the method calls is non-deterministic, the test passes if the net balance is one of two values.

func TestNetConcurrent(t *testing.T) {
	var ledger AccountLedger
	ledger.AddEntries([]int{10, 7}, []int{9}) // Net is 8

	var wg sync.WaitGroup
	wg.Add(2)

	go func() {
		// Added entries net to -1.
		defer wg.Done()
		ledger.AddEntries([]int{6, 3}, []int{10})
	}()

	var got int
	go func() {
		defer wg.Done()
		got = ledger.Net()
	}()

	wg.Wait()

	// The call to Net() should occur either before or after the second call to
	// AddEntries(...), so got should either be 8 or 7, depending on the
	// ordering.
	if got != 7 && got != 8 {
		t.Errorf("got was %d but expected 7 or 8", got)
	}
}

Changing software can introduce bugs

Software requirements change over time. Imagine there is a desire to calculate the sum of the credits and the sum of the debits separately. Because the Net method was implemented using helper methods, similar methods already exist. All that needs to be done is export them and ensure the lock is held in read-only mode during execution.

The diff looks like this:

@@ -19,3 +22,5 @@

-func (a *AccountLedger) sumDebits() int {
+func (a *AccountLedger) SumDebits() int {
+       a.mu.RLock()
+       defer a.mu.RUnlock()
        var sum int
@@ -27,3 +32,5 @@

-func (a *AccountLedger) sumCredits() int {
+func (a *AccountLedger) SumCredits() int {
+       a.mu.RLock()
+       defer a.mu.RUnlock()
        var sum int

Some new unit tests for the newly exported methods can now be written (these are omitted due to not being very interesting). The existing unit tests still pass, so high fives are exchanged, and the new code is pushed out to production.

Tracking down concurrency bugs is hard

Eighteen months later, the software starts locking up. When it initially starts up, it works fine for a while. But then something happens, and all of the methods on the AccountLedger start blocking indefinitely.

Strangely, the unit tests have never failed. Perhaps the goroutines are being interleaved slightly differently in the production environment compared to the unit tests? The way goroutines are interleaved is an implementation detail of Go’s scheduler, so it shouldn’t be surprising if things act slightly differently in different scenarios.

One way to reproduce the failure is to sprinkle in some calls to runtime.Gosched(), which artificially yields control back to Go’s scheduler, allowing other goroutines to be scheduled instead. This is a general debugging technique for concurrency-related bugs because it can help change how goroutines interleave with each other.

It turns out that only one call to runtime.Gosched() is needed to reproduce the deadlock via the existing unit test suite.

 func (a *AccountLedger) Net() int {
        a.mu.RLock()
        defer a.mu.RUnlock()
+       runtime.Gosched()
        return a.SumDebits() - a.SumCredits()
 }

The tests now time out:

$ go test
panic: test timed out after 10m

What caused the bug?

The documentation for sync.RWMutex states that it shouldn’t be used recursively (emphasis mine):

If a goroutine holds a (sic) RWMutex for reading and another goroutine might call Lock, no goroutine should expect to be able to acquire a read lock until the initial read lock is released. In particular, this prohibits recursive read locking. This is to ensure that the lock eventually becomes available; a blocked Lock call excludes new readers from acquiring the lock.

That’s exactly what the code does though! The lock is acquired in read-only mode by calling RLock() at the start of the Net method. It’s then reacquired a second time in the SumDebits method. Another goroutine could simultaneously call the AddEntries method, which acquires the lock in read-write mode by calling Lock().

The following sequence diagram shows a goroutine ordering that could cause the deadlock to occur in the unit test:

sequence diagram

The bug can be fixed by restructuring the code so that the mutex is acquired in read-only mode only once per call to an exported method. There are many ways to do this, each with its own trade-offs. The details are not explored in this post.

Would `Mutex` have prevented the bug?

Unfortunately, using a regular Mutex instead of an RWMutex wouldn’t have prevented this bug. The Mutex type doesn’t support recursive locking either (i.e. cannot be locked multiple times by the same goroutine).

However, if Mutex were used, then the bug would have been trivial to catch:

Even the non-concurrent unit test would have caught the problem. Calls to the Net method would block forever, so the test would just time out and fail.
Even if there were no testing, the program would have locked up the first time the Net method was called in production. In most circumstances, this is better than having a dormant bug.