No! The danger in forcing programmers to pick a timeout is that they will pick the wrong value, most often a too short timeout, because they have been testing their software on a super-fast internal network and haven't considered the poor users in the real world.
Case in point: Google's Waze. If I have a slow mobile connection (e.g. edge or even 3g), Waze will repeatedly fail to load a driving route. It will think for a few seconds at most, then timeout and tell me there was a problem. If it only would wait a few more seconds to load, then the app would be useful. Instead, due to their crappy choice of timeouts, the app becomes useless.
I strongly second this "No!" for both of the JS examples he makes.
There was no way to set a timeout in Fetch because the browser, acting as the user's agent, has a sane default (~75 seconds on average, but it varies by browser and platform).
Developers often pick TERRIBLE values for timeouts when left to their own devices.
Hell, the author of this exact blog post has picked 10 seconds in all of his examples. That a FUCKING BAD timeout. It's far, FAR too short for many use cases.
> Hell, the author of this exact blog post has picked 10 seconds in all of his examples. That a FUCKING BAD timeout. It's far, FAR too short for many use cases.
It isn't necessarily. It all depends on the use-case. If most of the operations are finishing in 5ms then the probability of something finishing after 10s are rather low - and timing out and retrying early is probably the way to go.
Someone else in this thread recommended setting the timeout to around the P99 time that operations take. I think that's a reasonable starting point, even I might move it towards P99.9.
I worked (and am still working) on adjusting timeouts for systems doing billions of requests/s. One takeaway from that is that the actual value of timeouts is often not too important if you look at one system in isolation. The latency distribution will be rather logarithmic. Most requests might e.g. finish in 20ms. Then you get a P99 at maybe 3 digit ms, and a P99.9 at 10s (example numbers). From there on it will make a minor difference in availability if you now set your actual timeout to 5s or to 120s - it might just be noise along your other error sources.
However it makes sense to align the absolute timeout with timeouts of dependencies. E.g. if you have a chain of
client -> service A -> service B
and service A always times out first, then the client will get an error that service A is broken - but nobody can easily diagnose whether it was service A or service Bs fault. If service B times out first, the client can get an error message that indicates that. Therefore it makes sense if upstream service timeouts are shorter (even if only by a second).
In the same model if the client times out first the services actually do only observe the client dropping the connection. They don't know whether the client timed out or cancelled the operation for other reasons. And therefore they might also not record that something in the service is actually not ideal. For that reason I would recommend setting client timeouts higher than service timeouts (if you are aware of them).
However there is yet another exception to this thing, which are TCP connection timeouts. If you can configure them separately, it makes sense to have those rather low and performing multiple retries. That can improve overall latency, since dropped SYN packets will only be retried by the OS after 1s.
TCP takes 3-4 seconds to detect a dropped packet and retransmit. That puts an absolute minimum baseline of 5 seconds on timeout just to be able to send or receive a packet.
If you're doing anything across datacenters, you have to take a minimum baseline of 10 - 15 seconds to account for extra latency on top.
If you do billions of requests/s I bet you don't care that requests fail? You probably can't even see that requests are failing because you'd have no logging, too expensive at this scale.
I do financial systems, most load typically doesn't go above 1k/s, but every request matters because a dropped request is a dropped payment, possibly tens of millions of dollars lost! There is a ton of issues caused by having too low timeouts set by developers (anything below 30 seconds). I had to reconfigure a ton of systems and libraries to have higher timeouts and ignore configuration passed by developers.
> If you do billions of requests/s I bet you don't care that requests fail? You probably can't even see that requests are failing because you'd have no logging, too expensive at this scale.
That's an assumption. Our customers care a lot. And we have sufficient monitoring in place.
All recommendations I provided above where about maximizing availability, and provided based on the experiencing of improving the experience for lots of users.
> I do financial systems, most load typically doesn't go above 1k/s, but every request matters because a dropped request is a dropped payment, possibly tens of millions of dollars lost!
Without knowing much more about your system: If you are losing that amount of money for failed requests (which can e.g. happen due to random network blips) you are doing something wrong. You should invest into different strategies than increasing timeouts.
It's inherent to payment systems to "lose" money on a failed request. A transaction involves two sides, either both sides agree that the transaction is completed or money is lost/duplicated.
If the client decides to drop (timeout) and consider the transaction cancelled, while the server is processing it and will consider it done. That's a catastrophic issue that needs to be addressed. It is one of the most common bugs I've seen in the wild (root cause: too short timeouts).
How to make highly critical systems reliable enough in the face of hardware and software issues is a complex topic. At this level this involves a holistic approach to get every component to cooperate together (timeout is a minor example). A HUGE amount of work is to detect errors, and more importantly to propagate errors across diverse stacks (software should be aware of database errors, services should detect other services failing).
This doesn't make sense to me. If there is a real risk that a timeout can happen (and there always is) then the payment system should be implementing a two phase commit.
I don't know what the Byzantine generals problem is.
Two stage commit is important because it has:
1) Predefined transaction id prior to final submission that allows you to validate the status, so if your request to commit gets 503'ed or you get a timeout you can reliably query to know if it was processed or not
2) Unlimited resubmissions of the final commit. It doesn't matter if I perform the final commit api request 1 time or 100 times, it will never cause a duplicated transaction to occur. So if I get a timeout or a 503 I can resubmit knowing that if my original commit request went in my new submit will be a no-op, and if my last commit request didn't get processed then this time it hopefully will be processed.
This pattern isn't just a payments pattern thing either. This is heavily used in distributed systems where failures can occur. UPS' API used to use this as well so you could be sure that you don't pay for duplicate shipping labels or cause duplicate shippments.
The Byzantine general problem is the field of research dealing with consensus/consistency issues like what we discuss here. The baseline is that there are two generals on a battlefield trying to coordinate an attack, they send messengers to communicate but any message might be lost or intercepted. The problem is proven to be unsolvable so let's not go heads in assuming you can be sure of any outcome ;)
Taking longer does not solve the byzantine generals problem. The difference here is that the role are asymmetrical: once the bank receive your order for a transaction it does not need to check that you know whether the order was correctly received; the bank can simply perform the transaction and them best-effort let you know of what happened.
Isn't it better to make it idempotent? The risk is that the client might accidentally make the same transaction twice if the first attempt looks like it failed.
Make the client include the id of it's last known transaction and only apply the transaction if it's up to date, otherwise tell the client to refresh and try again.
The second stage is idempotent (which is why it works), but the purpose of the first stage is to make sure both sides have an agreed upon idea of the uniqueness of the transaction that's about to take place.
For instance, if I want to generate a shipping label that goes from my house to your house and I do two attempts, how does the receiving service know if I made two distinct attempts (I want to ship 2 similarly sized items) or if a transient error occurred in between making me attempt a re submission?
You solve this by creating an inactive request with the criteria (shipping label from my house to your house). This step is not idempotent but that's OK, because if I resubmit I just create a 2nd inactive request that may never actually be finished.
The second step is to say "this request is good and I want to proceed with it". That step is idempotent and marks the existing request as not just inactive but puts it in an active state.
A shopping cart flow is a user managed 2 stage commit (review your cart, submit the cart order). No matter how many times I submit my order it won't cause duplicate orders because I'm submitting a specific shopping cart.
UPS, Paypal, and others just use a computer/api-managed 2 stage commit
You can't always rely on a client generated ID, because you would have to know that the client id is unique enough. The server is the only one who can really generate a transaction id that it knows is globally unique and efficiently queryable in its backend.
It's not mutually exclusive. You can do two stage commits with the second stage being idempotent.
The practical risk is that this puts a ton of complexity on the client, to keep track of states and perform some follow up actions. The added complexity means more bugs and each additional step can fail hence compounding the problem rather than solving it.
This doesn’t address much of your comment, but I work on systems with millions of req/s, but even with billions you can still sample to do logging and monitoring. But you’re right, we don’t care that every request works, just that 99.9% do.
> recommended setting the timeout to around the P99 time that operations take. I think that's a reasonable starting point, even I might move it towards P99.9.
Why would you intentionally drop 1% or even 0.1% of requests?
What's the process of finding P99? Is it taking a bunch of samples, getting the standard deviation, and then calculating the value at which 99% of all possible samples would fall?
This also assumes O(1) I assume...
(I'm thinking of how to apply reasonable timeouts to background celery tasks)
I straight up couldn't use terraform on DSL while visiting family over Christmas because of this. It would be chugging along at what appeared to me a reasonable speed, but either one of Google's services or terraform itself would decide I was taking too long and stop. I was unable to work that week.
Maybe it is a good idea for next time to provision a vm/machine closer to where you are deploying? It would also prevent loss of progress when the connection got dropped or something.
Is the argument that a call that could get stuck for a few minutes is better than a “wrong“ dozens of second value ?
Even as a user I feel it’s a waste of precious resource (including my time). It’s like waiting at the register until the shop closes down because the employee had to go somewhere, instead of giving up and trying the next open register.
I’d think infinity is not a valid state.
For waze’s case, I supposed their priority is not on salvaging the 1% longest request (though critical to you), and instead preserve server resources for the 99% faster clients. That’s not a “wrong” value on their side, and probably have been carefully tailored to get the right tradeoff.
A too short timeout is more problematic than no timeout because it breaks the application.
Let's say 10 seconds, typical intuitive but bad timeout. This will cause requests to fail for no reason other than users are in Asia or Africa, high latency. This will break the application when it's used or deployed across datacenters because high latency. This will cause requests to fail when the server is a bit busy (couple seconds more to process requests). Worse, it will cause chain reactions under load, creating more retries and even more load, causing other services/servers to timeout too.
Better go for a long timeout. A long timeout doesn't break the application.
I'm pretty sure infinite timeout also breaks the application, in a way people rarely realize that it is because of the timeout. People would rather think it "just didn't work, don't know why" instead of being very clever and realized "it must be low timeouts!!!"
Not all requests are the same and they need to be treated differently. Some requests are rather optional and it's probably better to timeout if they don't respond in a timely manner so not to use up more resources than needed. Other requests, like payments, for instance, you probably want to give the best chance for it to succeed. So, no timeout is likely a good idea. If the actual TCP connection times out or is closed by the server, we can hope it was good enough to realise something didn't go well and rollback. So, we are probably safer to assume it didn't go through in that case.
When it comes down to UI there are even more options. Since you have a human on the other side, you can transfer to them the responsability of deciding when to timeout. The UI certainly shouldn't become completely unresponsive while a request is being made.
Yes, go ahead and set a 10 minutes rather than infinite or 10 seconds. That will make it much easier to realize that things are frozen because they will raise exceptions and logs all over the place.
To be pedantic though, infinite timeouts don't break applications except some rare cases of resources exhaustion. If an application is completely unresponsive, it is dead for good, not because of the timeout, need to fix the root cause (often resource exhaustion like swapping or it's waiting on another IO or service that's frozen).
Failing because of a too short timeout feels silly, but a stupidly large timeout leads to frustration and hazardous user actions like killing the app with the task manager.
You don't need a timeout, you need a "cancel" button.
Funny you mention that, this reminds me of Windows task management. Windows automatically gives a popup to terminate an application when it detects an application is unresponsive.
This happens regularly when I open large files in some app, they take a fair bit of time to load, Windows offers a popup to kill the app after few seconds. Have to carefully wait and not click anything.
> For waze’s case, I supposed their priority is not on salvaging the 1% longest request (though critical to you), and instead preserve server resources for the 99% faster clients.
What resources? Buffering a response takes a minuscule amount, and if even a tiny fraction of people try again it will waste far more.
And even if it did take more in total, it would not be by much. This justification for saying it's not a wrong value is very weak.
I've seen many cases on mobile web pages where it tries to connect, hangs on a progress bar, my phone loses and regains signal during the process, it still doesn't finish loading, but a pull-to-refresh makes it load instantly.
That's an example where a timeout and retry would have fixed the problem. If it had been an API call behind an app, it would have hung indefinitely.
Some libraries sadly have their default timeouts set to infinite.
When I implement this, I typically use separate thresholds for the entire request and time since last progress or some rolling average transfer rates. Letting a slow transfer complete is useful both for what you mentioned and also reducing server congestion but you do need to detect failures where the remote end goes silent (server failure, network roaming, etc.) without tearing down the connection.
This is probably it. Production timeouts vs developer timeouts. What a shame about Waze. I wanted to try it out but then realized its just owned by Google now so its kinda pointless for me to bother.
Case in point: Google's Waze. If I have a slow mobile connection (e.g. edge or even 3g), Waze will repeatedly fail to load a driving route. It will think for a few seconds at most, then timeout and tell me there was a problem. If it only would wait a few more seconds to load, then the app would be useful. Instead, due to their crappy choice of timeouts, the app becomes useless.