Byron Wasti


Performance Pitfalls of Async Function Pointers (and Why It Might Not Matter)

Posted on

I have been working on a distributed load testing framework for Rust called Balter, and one of the primary things it does is takes a user-defined function and runs it in a loop. If we consider the world of non-async Rust, we could write a (greatly) simplified version of this with a normal function pointer.

// Balter code
pub fn load_test(user_func: fn()) {
    loop {
        user_func();
    }
}

// User Code
fn main() {
    balter::load_test(my_load_test_scenario);
}

fn my_load_test_scenario() {
    ...
}

What is handy about function pointers is that they're lightweight and flexible.

pub fn load_test(user_func: fn()) {
    for _ in 0..THREADS {
        thread::spawn(|| {
            loop {
                user_func();
            }
        });
    }
}

load_test(my_func_a);
load_test(my_func_b);
load_test(my_func_c);

But if we want the most bang for our buck when building a tool for load testing, we want to leverage Rust's async support. For example, in the context of load testing a web server we will be largely IO bound, and using Tokio to handle concurrency should give us far better performance.

The issue is that an async function pointer doesn't really exist. The above code can't be converted to support async by just throwing the async keyword in front of our functions. The following is not valid Rust.

pub async fn load_test(user_func: async fn()) {
    for _ in 0..TASKS {
        tokio::spawn(|| async {
            loop {
                user_func().await;
            }
        });
    }
}

async fn foo() {
}

load_test(foo).await;

If we dig into the implementation details of how async works, we find that an async fn() in Rust is equivalent to a function returning a Future, fn() -> impl Future. Under the hood, the compiler generates a struct which implements the Future trait as the return value. While it might seem like our problem is solved, this is actually the root of the issue. Two async fn()s with the exact same signature are entirely different!

fn sync_foo() -> i32 {
}

fn sync_bar() -> i32 {
}

// No problem in the sync world
let arr: [fn() -> i32] = [sync_foo, sync_bar];


async fn foo() -> i32 {
}

async fn bar() -> i32 {
}

// We can't do this!
// (even if we pretend we can use `impl Future` here)
let arr: [fn() -> impl Future<Output=i32>] = [foo, bar];

Written this way it might be apparent why this is an issue. The impl Future<Output=i32> has to resolve to a concrete type at compile time, and the compiler generates two different opaque types for foo() and bar()'s return values. We can't treat the two function pointers for foo() and bar() the same, because they aren't the same.

The common solution for this problem is to Box::pin() the Future. A function which returns a Future trait object, while it may look ugly, does make "async function pointers" work (and we can always hide it behind a macro).

fn foo() -> Pin<Box<dyn Future<Output=i32>>> {
    Box::pin(async {
        // Our usual async code
    })
}

fn bar() -> Pin<Box<dyn Future<Output=i32>>> {
    Box::pin(async {
        // Our usual async code
    })
}

// This works now!
let arr = [foo, bar];

This is an unfortunate lack of symmetry between async fn and fn, but when considering the implementation details it makes sense. An async fn() is unlike any other async fn(), despite what their signatures may be!

Performance

[Update 2/16] Fixed black_box usage for normal function pointers. Thanks @c00t!

To measure the performance of our "async function pointers," I created a handful of benchmarks. For a baseline, I wanted to measure the sync function pointer performance, as well as a boxed function pointer. Then, get a baseline of a normal async function. Finally, the performance of the fn() -> Pin<Box<dyn Future>> "async function pointer."

The exact benchmark code can be found on Github. For each situation, I run a "load test" which calls the supplied function 250M times. An example given the sync function pointer baseline is as shown.

use std::hint::black_box;

fn main() {
    load_test(black_box(foo));
}

fn load_test(func: fn(i32) -> i32) {
    for i in 0..250_000_000 {
        let _res = func(i);
    }
}

fn foo(arg: i32) -> i32 {
    black_box(arg * 2)
}

The benchmarks were run with a release build of the binaries, and timed using hyperfine with minimal system load (closing all browsers and heavy applications). This was all run on an 11th Gen i5 Intel Framework, running NixOS.

hyperfine target/release/function-pointer
Benchmark 1: target/release/function-pointer
  Time (mean ± σ):     429.1 ms ±   7.0 ms    [User: 428.2 ms, System: 1.1 ms]
  Range (min … max):   418.9 ms … 436.7 ms    10 runs

hyperfine target/release/boxed-function-pointer
Benchmark 1: target/release/boxed-function-pointer
  Time (mean ± σ):     537.9 ms ±   2.5 ms    [User: 536.5 ms, System: 1.4 ms]
  Range (min … max):   536.1 ms … 544.0 ms    10 runs

hyperfine target/release/async-func
Benchmark 1: target/release/async-func
  Time (mean ± σ):     407.6 ms ±   3.6 ms    [User: 402.2 ms, System: 2.8 ms]
  Range (min … max):   403.7 ms … 411.6 ms    10 runs

hyperfine target/release/async-boxed-naive
Benchmark 1: target/release/async-boxed-naive
  Time (mean ± σ):      4.985 s ±  0.090 s    [User: 4.965 s, System: 0.004 s]
  Range (min … max):    4.922 s …  5.198 s    10 runs

source: https://github.com/byronwasti/async-fn-pointer-perf

It is unclear what the additional time for the async-func is due to, it very well could be a constant factor startup time for the Tokio runtime. What truly stands out is the async-boxed-naive run time: 5 seconds. That is a whole order of magnitude greater than every other situation. Box::pin() is apparently really slow.

This makes some amount of sense: creating a Pin<Box<dyn Future>> requires some of heavy lifting. We are putting the entire function state on the heap, which function pointers and async functions can put on the stack, and what has to be put on the heap is larger (at least compared to normal function pointers).

While it isn't surprising that Box::pin() is slower than normal async performance, just how much slower is. Box::pin() is often recommended to get around various "async function pointer" issues, but rarely with the caveat "and it will be ~10x slower." Except is it? (I'll come back to this question)

Alternatives

Unfortunately I have yet to find a single good "universal" alternative. Each alternative suggested below is fairly specific to the situation at hand, and involves trade-offs.

1. Push the Box::pin() to the Edge

Instead of using a fn() -> Pin<Box<dyn Future>> where we would naturally want the async function pointer, we can lift the Box::pin() as far up as possible. What this means is using generics for any function call which might be in the hot-path, and keeping Box::pin() on the cold path. The goal we are trying to achieve is having our effective "async function pointer" actually encompass all of the code that would run the function it is calling.

This approach is specific to only certain situations where lifting the Box::pin() is possible. In general, using generics has better performance than boxing, but comes with the trade-off of increased binary size (due to monomorphization).

For our working example of load testing, this would mean pushing all of the actual load_test functionality into the Box::pin(), such that rather than passing "async function pointers" into a load_test function, we instead have "async function pointers" of a load tested function. An example of this is below.

async fn load_test_wrapper(load_test: Pin<Box<dyn Future>>) {
    load_test.await;
}

async fn load_test<T, F>(func: T)
where T: Fn() -> F,
      F: Future<Output=i32>,
{
    loop {
        func().await;
    }
}

async fn foo() -> i32 {
}

async fn bar() -> i32 {
}

// This works; now each "async function pointer" includes
// the load test code which runs the function.
let arr = [Pin::Box(load_test(foo)), Pin::Box(load_test(bar))];

The performance of this alternative approach as compared to the others is below, labeled as async-boxed-invert.

hyperfine target/release/function-pointer
Benchmark 1: target/release/function-pointer
  Time (mean ± σ):     429.1 ms ±   7.0 ms    [User: 428.2 ms, System: 1.1 ms]
  Range (min … max):   418.9 ms … 436.7 ms    10 runs

hyperfine target/release/boxed-function-pointer
Benchmark 1: target/release/boxed-function-pointer
  Time (mean ± σ):     537.9 ms ±   2.5 ms    [User: 536.5 ms, System: 1.4 ms]
  Range (min … max):   536.1 ms … 544.0 ms    10 runs

hyperfine target/release/async-func
Benchmark 1: target/release/async-func
  Time (mean ± σ):     407.6 ms ±   3.6 ms    [User: 402.2 ms, System: 2.8 ms]
  Range (min … max):   403.7 ms … 411.6 ms    10 runs

hyperfine target/release/async-boxed-naive
Benchmark 1: target/release/async-boxed-naive
  Time (mean ± σ):      4.985 s ±  0.090 s    [User: 4.965 s, System: 0.004 s]
  Range (min … max):    4.922 s …  5.198 s    10 runs

hyperfine target/release/async-boxed-invert
Benchmark 1: target/release/async-boxed-invert
  Time (mean ± σ):     318.1 ms ±   1.2 ms    [User: 316.6 ms, System: 2.2 ms]
  Range (min … max):   317.1 ms … 320.9 ms    10 runs

What is confusing to me is why this ends up faster than the async-func case (which is just calling async functions normally). If anything, this approach ought to be slower! I would love to know if anyone has theories as to why this has a speedup.

2. Use an Enum

By using an Enum, we can just fake having an async function pointer to a fairly convincing degree. By enumerating each async function we would like to call into an Enum, and then defining an impl on the Enum which calls those functions, we can have something that works like a function pointer.

The tricky part to the Enum solution is that creating this Enum requires knowledge of all async functions at compile-time. Unfortunately, unless stateful proc macros become a thing, this will be tricky to do for libraries which want to construct this for users (without having the user gather up all of their async functions into one spot).

async fn load_test(func: Func)
{
    loop {
        func.run().await;
    }
}

async fn foo() -> i32 {
}

async fn bar() -> i32 {
}

enum Func {
    Foo,
    Bar,
}

impl Func {
    async fn run(&self) -> i32 {
        match self {
            Func::Foo => foo().await,
            Func::Bar => bar().await,
        }
    }
}

Results, labeled async-enum.

hyperfine target/release/function-pointer
Benchmark 1: target/release/function-pointer
  Time (mean ± σ):     429.1 ms ±   7.0 ms    [User: 428.2 ms, System: 1.1 ms]
  Range (min … max):   418.9 ms … 436.7 ms    10 runs

hyperfine target/release/boxed-function-pointer
Benchmark 1: target/release/boxed-function-pointer
  Time (mean ± σ):     537.9 ms ±   2.5 ms    [User: 536.5 ms, System: 1.4 ms]
  Range (min … max):   536.1 ms … 544.0 ms    10 runs

hyperfine target/release/async-func
Benchmark 1: target/release/async-func
  Time (mean ± σ):     407.6 ms ±   3.6 ms    [User: 402.2 ms, System: 2.8 ms]
  Range (min … max):   403.7 ms … 411.6 ms    10 runs

hyperfine target/release/async-boxed-naive
Benchmark 1: target/release/async-boxed-naive
  Time (mean ± σ):      4.985 s ±  0.090 s    [User: 4.965 s, System: 0.004 s]
  Range (min … max):    4.922 s …  5.198 s    10 runs

hyperfine target/release/async-boxed-invert
Benchmark 1: target/release/async-boxed-invert
  Time (mean ± σ):     318.1 ms ±   1.2 ms    [User: 316.6 ms, System: 2.2 ms]
  Range (min … max):   317.1 ms … 320.9 ms    10 runs

hyperfine target/release/async-enum
Benchmark 1: target/release/async-enum
  Time (mean ± σ):     526.5 ms ±   0.8 ms    [User: 524.2 ms, System: 2.6 ms]
  Range (min … max):   525.6 ms … 528.1 ms    10 runs

Just slightly slower than async functions normally, which is a reasonable trade-off.

3. Reset the Future

One niche and clever strategy is to reset the future. If we want to continuously rerun a given async function, then if after every run we were able to "reset" the function back to its start, we could just reuse the Pin<Box<dyn Future>> that we already have. This trick is used by the Tower rate-limiting functionality for the sleep Future, to avoid having to re-Box::pin.

pub struct RateLimit {
    ...
    sleep: Pin<Box<Sleep>>,
}

impl RateLimit {
    fn call() {
        ...

        // The service is disabled until further notice
        // Reset the sleep future in place, so that we don't have to
        // deallocate the existing box and allocate a new one.
        self.sleep.as_mut().reset(until);
    }
}

source: simplified snippet from Tower

Unfortunately this requires that you have control over the Future involved (and can implement a .reset() method on it), though I am curious if this can be captured in a library to apply for all async fn... I have not created this library nor do I even know how to do this for a single async function, so doubly-unfortunate, I do not have any benchmarking for the performance of this approach. It was just too neat to not mention.

Results

It's one thing to benchmark something on a small scale, but another to apply it at a larger scale. I initially used Box::pin() to get Balter off the ground, but always intended to remove it. Profiling async code can be tricky, so it wasn't clear to me what the performance implications were going to be. Hence why I did this testing. However, I still had to find out: does the seemingly poor performance of Box::pin() actually matter?

As far as I can tell, the answer is... maybe not? At least, for Balter, using the first technique of lifting the Box::pin() out of the hot-path has made no difference in benchmarks. This is surprising, as I was expecting at least some improvement from the small-scale benchmarks. With and without Box::pin() the Balter framework overhead caps at an output of ~750K TPS on my machine (which means running the framework against a no-op function).

Balter isn't quite at the point where it is ready for proper profiling and benchmarking, and this all came about primarily due to curiosity. There are some clear suspects for performance improvements from the profiling I have done, and these could be drowning out the impact of Box::pin(). I was just so surprised by the small-scale benchmarks and figured I would share!

As always with performance matters, it all comes down to profiling and measuring. Box::pin() seems slow but it might not matter for your use-case!

[Update 2/14] Thanks to @matthieum on Reddit for comments on additional strategies. I will be looking into those and updating the post accordingly with timing measurements!

[Update 2/16] Thanks to @coot on Github for filing the issue on black_box usage which lead to incorrect measurements for the normal function pointer cases.