Atomic Polling Intervals for Highly Concurrent Workloads

Posted on

One of the problems I've run into while developing Balter is figuring out how often you are able to poll atomics without affecting data accuracy. For context, Balter is a distributed load testing framework, and under-the-hood it spawns a bunch of concurrent tasks which run the user's load test scenario, as well as a Scenario Coordinator task which polls the data collected via atomics. For the purposes of this post, we'll just consider Transactions Per Second (TPS).1

tps-sampling

If you aren't familiar with atomics, they are primitive types which provide synchronization across threads at a hardware level, and are a core building block for many concurrent applications. For instance, Rust provides an atomic version of the i32 integer type, AtomicI32, which can be shared across threads and updated in a lock-free manner (avoiding the use of a Mutex).

// Achieve the same goal, with the atomic
// being much faster and lock-free.
let mutex_foo = Arc::new(Mutex::new(10));
let atomic_foo = Arc::new(AtomicI32::new(10));

In Balter, after each transaction we increment an atomic, which is done using the .fetch_add() method call and providing an Ordering. The Ordering is for specifying the kind of memory synchronization, and for this use-case we can use the most lenient option.

some_transaction().await;
ATOMIC.fetch_add(1, Ordering::Relaxed);

On the Scenario Coordinator side we have a loop which runs on an interval and polls and resets the atomic:

loop {
    let elapsed = interval.next().await;
    let value = ATOMIC.swap(0, Ordering::Relaxed);
    let tps = value as f64 / elapsed;
}

The question is: how often can the Scenario Coordinator loop run?

The reason we want to poll as often as possible is to speed up how fast Balter operates. Currently, Balter's control loops for error rate matching and latency matching are slow. By polling more often, the control loops would be run more often and Balter would be faster at converging to a specified error rate or latency. However, we don't want to provide bad data to the control loops which could lead to instability.

Atomic Experiments

In order to understand the constraints involved in the polling intervals for atomics, I set up a repository with code which runs a simple test. It spawns 100 concurrent tasks, each of which runs a simple loop that waits for 10ms (with noise added)2 and then increments two atomics: a baseline atomic, and a test atomic. The main function then polls the baseline atomic every 1s (to establish what we should see), and polls the test atomic at varying intervals.

// Task loop
loop {
    tokio::sleep(delay_with_noise).await;
    TEST_ATOMIC.fetch_add(1, Ordering::Relaxed);
    BASELINE_ATOMIC.fetch_add(1, Ordering::Relaxed);
}

// Sampling loop
tokio::join! {
    async {
        loop {
            // Always 1s
            baseline_interval.next().await;
            let baseline = BASELINE_ATOMIC.swap(0, Ordering::Relaxed);
        }
    },
    async {
        loop {
            // Varies
            test_interval.next().await;
            let test = TEST_ATOMIC.swap(0, Ordering::Relaxed);
        }
    }
}

With this setup we are able to measure the effects of atomic polling intervals by directly comparing the test intervals with a baseline which is being taken in parallel (to minimize external variability).

IntervalMean TPS (μ)Std (σ)Error
1s6961.8321.280.00%
500ms6972.4219.430.11%
200ms6945.4629.870.40%
50ms6749.78130.692.98%
10ms5965.96428.9914.48%

The fact that we get more noise when polling more often is not surprising, but what is surprising is just how far off the mean TPS is for the 10ms interval. Initially I thought it had to do with how close the polling interval was to the latency of each atomic update (both being 10ms). However, repeating this with a 10us delay had a similar issue (though the error was lower):

IntervalMean TPS (μ)Std (σ)Error
1s86135.641052.310.00%
500ms85767.45792.220.12%
200ms85747.231017.960.33%
50ms84122.802059.602.46%
10ms80539.525925.215.91%

It would seem that the smaller the polling interval is, the higher the error in measured TPS. I decided to plot out what the relationship is between the polling interval and the error in measurement, and did this for a wide range of concurrent tasks, latency per task, and polling intervals.

As we can see from the graph above, the error is closely correlated to the polling interval. As the polling interval gets smaller, the error in measurement increases exponentially. In fact, this type of graph is reminiscent of a common issue in concurrent applications: contention.

Lock-Free != Contention Free

The reason we use atomics in concurrent applications is to avoid locking. Locking can often lead to high rates of contention if too many threads are trying to access the same value. Atomics would seem to avoid this issue but at the end of the day there is no free lunch when it comes to memory synchronization, and atomics still run into issues with contention.

When we poll the atomic with a call to .swap(0, Ordering::Relaxed) the CPU has to ensure that every core's cache is flushed and replaced with the new value. Additionally, before we return the value, we need to ensure that every fetch_add() was applied to the value. The .swap() call is a linearization point and this has overhead and leads to contention if called too often.

The .fetch_add() does not lead to contention (despite being called far more often and from more threads) I believe because it is monotonic (always increasing). Unfortunately I can't find too many details online on how atomics are handled at the hardware level, but in general monotonic operations in concurrent systems are quite easy to deal with as the ordering doesn't matter, so it is not too surprising there is minimal contention for this call.

What isn't particularly clear to me is where the contention is taking place. Since the baseline atomic and test atomic are both incremented in the task loop, and the baseline atomic is unaffected by the contention, it would seem that the contention is entirely in the .swap() call. How this works at a hardware level, with the fetch_add() being unaffected, but .swap() being affected, would be interesting to learn about, but I haven't found any good sources for this information.

Next Steps

Its clear that polling atomics faster is not a viable strategy for Balter to speed up it's control loops. Moving forward, I will be investigating alternative control algorithms for Balter to make it faster, and plan to do a similar deep dive for that work in the near future.

Hopefully this little deep dive into atomic polling intervals is useful for anyone else who happens to have a similar problem!


1

This situation is a bit more complicated by the presence of latency data, which can't be collected directly via atomics, but that will be for another blog post.

2

Running the experiment with noise mimics reality better, but it also avoids data artifacts which were occurring as the atomic increments tended to "align" in time without noise.