On the Effectiveness of Retries and Timeouts in Distributed Systems
Posted on
One of the adages that has stuck with me from my relatively brief employment at AWS was that every single network call had to have retries and timeouts. This can even be seen in the AWS CLI and SDKs, which all have retries (3 by default) and timeouts baked in.
On the surface this seems like a reasonable policy; retries are handy in case of arbitrary network failures and timeouts are a nice to have. However, I would argue that retries and timeouts by default across the entire system is unreasonably effective at mitigating a number of the challenges when developing a large-scale distributed system.
Retries by themselves have a number of nice properties that are easy to see: temporary network failures are mitigated (just retry), temporary service outages are mitigated (even nodes crashing mid-request; just retry), it is a forcing function for idempotency everywhere (if every client has retries, requests must be idempotent), and, combined with exponential back-offs, mitigates the effects of rate-limiting (trading it off for increased latency).
Timeouts add on a system-wide backpressure mechanism. The majority of services can be approximated as a priority queue of asynchronous requests to be processed. As their load increases, the number of requests to be processed increases, and the latency of a given request creeps up. Timing out latent requests reduces load on overloaded nodes, and with retries, gracefully rebalances to other nodes. Assuming exponential backoffs, this also gives time for a service to scale horizontally when under pressure. This effect propagates outwards from the point of overload: calling services will be stuck with more requests, they may timeout, scale-up, etc.
There are aspects to be aware of with this approach. If a client has timeouts lower than a service's reasonable response time you may see increased load as requests are unnecessarily retried. Ensuring retries and timeouts are properly tuned is important, and having clients developed by the service owners ensures that this is the case. The benefit of trivially getting system-wide backpressure and robustness to network|node failure is well worth it, especially as compared to the complexity of other approaches (looking at you message queues).
Example: Kubernetes Readiness Checks & Keepalives
Kubernetes (k8s) has a variety of health checks with the one relevant to this example being the readiness probe. The readiness probe determines if a Pod is accessible through the Service interface, and on failure a given Pod will be removed from matching service endpoints. This is in theory useful for temporarily removing a given Pod from serving production traffic if it is suffering from a recoverable error that takes time to restart (eg. restarting a connection pool).
However, k8s does not kill all connections to the pod that fails a readiness check, it only removes it from the service abstraction which means new connections won't be made. The issue becomes apparent: connections made to the pod with a Keep-Alive set will remain active.
This caused an outage at my current company. Latencies spiked for requests to keep-alive'd connections to Pods with failing readiness checks which meant the latency spiked all the way through the system despite numerous other Pods being healthy and available to respond. Had we had timeouts and retries in place, this issue would not have caused any problems. The requests would have timed out, retried on healthy pods, and the connection pooling would have cleaned up the connections eventually. A small increase in latency, but not a spike to 10s+.
I love this example because it shows how complicated systems are today. The confluence of two good features, readiness checks in k8s and keep-alive connection pooling behavior, resulted in an outage. And it shows how unreasonably effective retries and timeouts are at preventing unexpected issues in a distributed system. The problem would have still occurred, but the effect would have been mitigated.