STM Deadlocks (Ruby STM, Round 2.5)
(See also Ruby STM: Round Two and Ruby STM: First Round)
Dealing with deadlocks in my lock-based STM implementation has proven to be an interesting problem in three respects: detection, resolution, and unit testing. In case they benefit anyone else, here are the things I learned last night.
Deadlock Detection
While the Ruby interpreter has some built-in deadlock detection, it only catches deadlocks involving all live threads at once. Worse (for our purposes), it responds to deadlocks by unceremoniously terminating the program. When we encounter a transaction deadlock, we only want to force one of the involved transactions to retry, permitting the rest to continue. This means we’ll need to detect (and resolve) transactional deadlocks ourselves.
Fortunately, since we know what object a particular transaction is waiting on, and which transaction owns an outstanding lock on an object, detecting deadlocks is relatively easy: recursively follow the link from one transaction to the transaction which has locked the object it’s waiting upon. If we eventually encounter a transaction which is not waiting for an object, we know that this group of transactions can still make progress. However, if we discover we’re simply going in circles (e.g. because we encounter the same transaction a second time), we know there’s a deadlock and we can take action to break that cycle.
Essentially, what we’ve got is a dependency tree: the transactions waiting for the objects locked by a transaction are its children. (It’s worth noting that there isn’t necessarily just one dependency tree—unrelated groups of transactions will constitute their own separate trees.) Every transaction in such a tree must either complete or retry before its children can make any further progress—and if a loop is introduced in the tree (so it’s no longer a tree), no progress will be possible for any transaction involved. A deadlock.
Now, if we check for cycles every time before sending the current transaction off to wait, it makes cycle detection much easier. We have only to walk up the tree starting from the transaction we’re about to wait for to discover whether we see the current transaction in that tree (in practice, if it is, it’d be the top, since that’s the only possible position for a transaction which isn’t already waiting). If we do find it there, we have to take action to prevent a deadlock.
Resolving Deadlocks
My first attempt at deadlock resolution was to immediately retry the transaction which was about to create a deadlock. I was surprised to find that my deadlock-avoidance unit tests still didn’t pass. Maybe in retrospect I shouldn’t have been—what was happening was that the transaction was simply turning around and recreating the same deadlock before any other transaction had a chance to make progress.
There appear to be three options:
- use
Thread.passbefore retrying in order to permit other transactions to run - roll back the deadlock-inducing transaction and wait for another transaction to acquire one of the locks it had been holding before retrying it
- have the deadlock-provoking transaction steal the lock from its current owner, forcing that owner to retry
Ugly as it is, the last option is the only one which guarantees progress
will be made past the particular deadlock. At the moment I’m sticking
with Thread.pass since it’s enough to make the unit tests succeed and
in practice I don’t think Ruby’s scheduler will schedule the thread
calling Thread.pass again until the others have been run.
Concurrent Unit Tests
Writing unit tests for deadlocks (and indeed many concurrency issues) has been fun. You really want to be able to force a specific worst-case thread scheduling scenario, but in practice that’s proven extremely difficult. At the moment I’m simply adding sufficiently large sleeps to ensure that in most cases the desired thread wins (or loses) the race. To do it right, I’m going to have to work out some sort of approach that involves passing a token between threads to represent a transfer of control.
I’ve also found that I often need to have a watchdog thread running during each test which kills the other test threads if they don’t complete within a given safety period. This allows the test to fail in the case of livelock or deadlock, rather than hanging (or killing, if Ruby’s deadlock prevention kicks in) the entire test suite.