Asynchronous is not "Fire and Forget": Joiners not Quitters

In all the recent enthusiasm for threaded and asynchronous programming, I’ve noticed people missing something really important: if you fire off an asynchronous task, you will eventually need to wait for it to complete — and you’ll probably need to be able to cancel it as well.

At the basic level, this is why Ruby’s Thread has instance methods such as Thread#join and Thread#kill (even if #kill, specifically, is such a blunt instrument that you should never use it in production code). For the sake of language- and library- neutrality, I’ll call these two sorts of operations join and cancel.

These are very important. Let’s begin by looking at join.

join

When one task joins another, it blocks (or defers its execution, if you prefer) until the other task completes. After the join completes, the joining task can be assured that the other task has really finished, that all of its side-effects have been accomplished, and that its result is available.

Essentially every thread API provides some sort of join operation, whether we’re talking about waitpid() for POSIX processes, pthread_join for POSIX threads, java.lang.Thread.join for Java threads, trap_exit in Erlang, Thread#join in Ruby, threading.Thread.join in Python, and so on.

Many of these methods also provide a way to get a status or result value from the thread. In Ruby, Thread#value does this.

In some cases the notion of joining is less explicit, but still present. For example, to join another Erlang process you would take advantage of the trap_exit flag and wait to recieve a message indicating that the process in question has exited. In the context of a node.js program, you will usually set up a callback function to be invoked when the asynchronous operation completes. This callback frequently ends up serving as the continuation of the logical task that set it up. It’s all the same basic pattern.

Why is this important? Let’s look at what happens when API designers drop the ball.

The cake is a Lie

My first example has to do with CoffeeScript’s cake utility. cake is similar to tools like Ruby’s rake, and of course make, all of which let developers define a set of named tasks (typically refreshing automatically generated files) which can easily be invoked from the commandline. In the case of rake and make, tasks which depend on other tasks are handled by explicitly indicating the dependency in a declarative way:

# make
foo:
	@echo Make ALL THE THINGS

bar: foo
	@echo Use ALL THE THINGS
# rake
task :foo do
  puts "Make ALL THE THINGS"
end
desc "Make all the things"

task :bar => [:foo] do
  puts "Use ALL THE THINGS"
end
desc "Use all the things"

In the case of cake this has to be handled a little more explicitly:

# cake
task "foo", "Make all the things", ->
  console.log("Make ALL THE THINGS")

task "bar", "Use all the things", ->
  invoke "foo"
  console.log("Use ALL THE THINGS")

This is probably fine in itself, since cake doesn’t aspire to support the sort of file dependency handling that rake and make do. However, while logging to the console is synchronous, and node provides some synchronous APIs for working with the filesystem, what happens if the "foo" task ends up needing to to something asynchronous? Most node APIs are asynchronous by design, after all.

If makeAllTheThings were an asynchronous operation which took a callback, you’d expect to be able to do something like this:

task "foo", "Make all the things", (options, cb) ->
  makeAllTheThings(cb)

task "bar", "Use all the things", ->
  invoke "foo", ->
    useAllTheThings()

But nope. Chuck Testa. cake‘s invoke doesn’t even accept a callback argument. The best you could do without breaking out of the cake DSL would be something like this:

task "foo", "Make all the things", ->
  makeAllTheThings(->)

task "bar", "Use all the things", ->
  invoke "foo"
  useAllTheThings()

But (even if it happens to work on your machine most of the time) this doesn’t guarantee that all of the things will have been made before you try to start using them. What joining does for you is establish a constraint: none of the following things should begin before this task is done.

There’s an important insight to be had here: Strictly sequential code is typically overconstrained, unnecessarily waiting for one thing to complete before moving on to the next, even if the next doesn’t really care about its result. This doesn’t cause bugs, but it does result in missed opportunities for performance. Asynchronous code is underconstrained by default, letting you take full advantage of opportunities for paralleism (in IO if not in computation), but you have to explicitly add some sequencing constraints back in if you want things to work reliably where order matters.

In the end, with cake’s existing API, you have to do something like this:

foo = (options, cb) ->
  makeAllTheThings(cb)

bar = (options) ->
  foo options, ->
    useAllTheThings()

task "foo", "Make all the things", foo

task "bar", "Use all the things", bar

In order to safely compose asynchronous tasks, you have to step outside the boundaries of the task machinery altogether; you can’t compose them the same way synchronous tasks can be. API fail.

(To be fair to Jeremy, this is more of a node fail than a cake fail; there isn’t a completely nice way to make this work with invoke. Ideally what you would want is for invoke to automatically join any asynchronous things spawned by a task, the same way a top-level task invocation does, but there’s no node API which would permit implementing that. invoke is pretty much a promise that can never really be kept; it was almost omitted from the cake API altogether, and the above non-DSL approach is probably the recommended one.)

In general, if you fire off an asynchronous task and never join it anywhere (note that node/cake themselves wait for tasks before exiting), then you’re doing something wrong. If you care about the result of a task’s computation, then someone eventually needs to wait for the task to complete in order to be sure that it’s finished computing everything it’s supposed to. (If you don’t care about the task’s result, then why are you sparking it in the first place?)

This is especially important with unit tests. Any test involving an asynchronous implementation needs to have a way to join every important asynchronous operation the code under test sparks off before it can check assertions. Otherwise, you can easily end up writing tests which happen to pass on your development machine (most of the time), but intermittently fail in CI after they’ve already landed, perhaps when someone else’s legitimate changes happen to arbitrarily influence the timing.

These kind of issues can be much worse when the fail is buried several API layers deep. In fact, this is a common architectural problem in asynchronous libraries in general. (Twisted Python seems to be a particularly bad offender. For example, the in-built name resolver will fire off requests and simply leave them out there, sometimes causing the trial’s “reactor clean” check to fail in tests. Even if you’re just resolving “localhost”.)

Next time: Cancel THIS!