Work on supporting async/await in Rust continues to progress rapidly. I’m hoping to write a retrospective on everything that happened in 2018 in a few weeks. Right now we’re closing in on an important milestone: stabilizing the futures API that will be used to interact programmatically with asynchronous computations. The biggest remaining area of work is the design of the waker API, an essential but somewhat opaque part of how our asynchronous programming system works. I want to take a look at this API and try to make it a bit clearer, so the design decisions regarding the API become clearer.
The poll/wait/wake cycle
A running programming using async/await correctly involves three fundamental components:
- The bulk of the program consists of futures, which are a kind of pause-able computation. In general, the end user’s code is will consist of futures, which they will write as if they were normal functions using async/await syntax.
- At the “top” of the program is the executor. The executor schedules futures by polling them when they are ready to make progress.
- At the “bottom” of the program, the futures depend on event sources. (in the case of async IO, the source of IO events is often called the reactor). The event source wakes the executor when an event occurs that will allow a future depending on it to make progress.
Once a future is spawned onto an executor, that future gets executed to completion using a three phase cycle:
- Poll: The executor polls the future, which computes until it reaches a point at which it can no longer make progress.
- Wait: The reactor or event source registers that this future is waiting
on an event to happen. The future has returned
Poll::Pendingand the event source is now tracking that it will need to wake this future when that event is ready.
- Wake: The event happens, and the future is woken up. It is now up to the executor to schedule the future to be polled again.
At a high level you can think of the executor as managing the program’s compute resources (scheduling futures that are awake and ready to be polled) and the reactior as managing the program’s IO resources (waiting on IO events and waking futures when they are ready). The executor and the reactor form the two halves of what in most asynchronous computing systems is called the “event loop.” One of the great things about Rust’s design is that you can combine different executors with different reactors, instead of being tied down to one runtime library for both pieces.
Requirements on the Waker API
The way that the executor and the event sources coordinate waiting and waking is the Waker API. When the Future is polled, the executor passes it a waker, which an event source will register and eventually wake. Each of the three phases of the cycle put additional pressure on the API, making it a very tricky API to get right.
What does the “poll” phase require?
During the poll phase, very little is done with the waker: the value passed to the future is just passed down, further and further, until it reaches an actual source of events, which will register it (beginning the “wait” phase). The only requirement that the poll phase introduces is dynamism: because it needs to be passed through arbitrary futures, the waker type cannot be generic. This means that every requirement introduced by the other two phases needs to be dynamically dispatched.
Rust has support for relatively easy dynamic dispatch using trait objects, but
because of the rules of object safety, this easy form of dynamic dispatch is
often quite limited. Indeed, we’ll find its too limited to support our use
case, which is why all of the API proposals have a “Waker” type, instead of
just using references to
dyn Wake trait objects.
What does the “wait” phase require?
The wait phase is, from the perspective of the waker, very simple: the event
source registers that waker is waiting on an event, and then does nothing until
the event occurs (beginning the “wake” phase). This introduces one additional
requirement: the waker type must implement
The reason for this is straightforward: when an event source registers that this future will be waiting on an event, it must store the waker so that it can later call wake to begin the wake phase. In order to introduce concurrency, its pretty essential to be able to wait on multiple events at the same time, so its not possible for the waker to be uniquely owned by a single event source. As a result, the Waker type needs to be cloneable.
This is where the API immediately starts getting tricky to design, because it
interacts with the previous requirement that the waker’s methods be dynamically
Clone trait itself is not object safe, because it returns the
Self type. What the waker implementations need is a function from
What does the “wake” phase require?
The final phase is the phase in which the wakers really do all of their work. Once the event we’re waiting on has happened, the event source calls the wake method. The wake method is implemented by each executor, and contains the logic for setting up this future to be polled again by the executor. It turns out there are several ways to implement this, and we’d like to be able to support all of them.
- Using an &‘static AtomicBool: In this implementation, the executor can only run one task at a time. When its time to wake that task, a global flag is tripped, and then the task will be polled again via a sidechannel. This implementation does not make sense for most use cases, but it is actually being used by some users on embedded platforms with extremely minimal resources. The waker is represented as a reference to the global flag.
- Using Task IDs: In this implementation, the executor stores a global set
of tasks that it is current polling in some sort of associative map. When it
is time to wake a task, the executor is given the ID for that task in order
to tell it which task is ready to be polled. The waker is represented as this
task ID (in effect, the waker’s data is a
- Using reference counting: In this implementation (which has become the predominant implementation), the executor is just one or more queue of tasks that it will fetch from and poll as soon as they’re ready. The waker is itself a reference counted pointer to a particular task, and when it is time to wake, it puts itself onto the executor’s queue. In this case, the waker is represented as a reference counted pointer.
All three of these implementation strategies ultimately represent the waker in the same data: an arbitrary, address-sized integer (in the first and third cases, they are addresses, whereas in the middle case, they’re just indexes). However, the third case also requires the ability to run dynamic code on clone and drop, in order to handle the reference count properly.
The requirement to dynamically support this wide set of implementation
strategies is why you see functions like
drop_raw in the
various APIs for wakers.
Waking and threading
The final tricky bit of the wake phase is how to handle threading. In particular, some implementations of executors would prefer to handle waking from the same thread the waker was created on differently from waking from a different thread. There are two variations on this idea:
- Some executor implementations can have an optimized path when they are woken from the same thread. They support waking from any thread, but if you wake from the same thread, there is a path they can take which would represent a nontrivial performance improvement.
- Some executor implementations don’t want to support multithreading at all. These executors would be bundled with their reactor and share state with it, running all the futures concurrently but on the same operating system thread. These executors don’t want to pay costs associated with multithreading at all.
A lot of the consternation in designing the Waker API has centered around how
to handle these use cases. It’s the reason that there is currently a
distinction in the API between a
LocalWaker type and a
That’s as good an overview as I can give of the requirements that the waker API needs to solve. Hopefully this context can help more people follow the discussion going forward.
Astute readers may notice that I have not actually outlined any conrete API in this post: neither the one implemented in nightly, nor the one proposed in the current RFC, nor a third alternative proposal. Even more astute readers may notice that this post is labeled as being the first in the series.
In the next post in the series, I’m going to drill into the final issue I mentioned: threading. In the second post in this series, I’ll try to lay out what I believe is the most prudent and well balanced way to handle single-thread optimizations.