Does it mean ISC is relying on Linux heavily? Random people on internet says FreeBSD has better IP stack and somehow I was sure ISC/BIND platform of choice is FreeBSD.
This means that ISC is relying heavily on libuv and we want to see it improved for everyone, not just ISC and BIND 9. Nothing more and nothing less, and I would appreciate if we can avoid the partisanship discussion.
This test doesn't include epoll in any significant fashion at all, and also doesn't apply to network operations. The libuv thing is specifically about disk files, which linux doesn't really support with epoll (epoll just always considers them ready, regardless of if the requested page is in cache or not, meaning it is not helpful for the async case. by contrast, a socket without data immediately available in the kernel's buffer will not show readiness).
Since you can't tell if a disk io function would have to actually hit the disk or use the cache on older linuxes, libuv just always dispatches said operations to a worker thread pool which then calls the normal function. When it returns, it sends a message back to the other thread to indicate its completion.
If data is already present in the cache, this adds significant overhead vs just calling `read` directly - instead calling `read`, it will post a request to the worker thread queue (which may require synchronization, im not sure exactly how the impl does it tbh), wake the worker thread which calls `read`, which does not block significantly since the data is indeed already present, then it posts an event back to the main thread (which, sure, it is listening for this event on `epoll`, along with whatever else it is listening to, but that's an insignificant detail at this point) which calls the on-complete handler. All that work instead of just transferring the data from the cache to the result. Significant slowdown.
If the data is not already present in the cache, it does exactly those same steps, except this time `read` actually does block for some time, so the main thread can continue with other work in the mean time. (Of course, if the main thread has no other work to do, you've gained nothing from this!)
Hence why I said in my other comment that you ought not draw any general conclusion from this. This speedup has nothing to do with epoll vs io_uring (except maybe that linux file i/o is not really epoll compatible) and everything to do with libuv's block device implementation specifically. It is totally inapplicable to network loads entirely.
edit: in the first version I said "disk file" but it is technically block devices, of which disk files are just the most common example, but of course /dev/zero is a block device that is not a disk file.... and worth noting that /dev/zero is never going to actually load off a disk meaning it is "in the cache" all the time.
Yeah, I thought "file" refers to "file descriptor" with IP sockets included, but looks like that's not yet implemented...
A couple years ago I have added io_uring to my toy webserver but it was slower than epoll. I would love to see someone using io_uring for a real networking application and report more meaningful statistics than microbenchmarks.
It doesn't really make sense to have an API to tell if reading from a file would block or not because page cache data can be evicted at any time. Even if there was an API that could tell you your next read would be non-blocking, there could be a race where the page cache entry could be evicted before you could do the read.
That sort of race condition is only relevant if it affects correctness. Here it means a read might be unexpectedly slow, but you'll still get the same data.
99.9% of the time the data will not be evicted prior to the read, so it's still going to be a win overall.
Remember that a filesystem read might be a disk read in the background. You don't really want to block the main thread for 2 seconds while sshfs does its thing.
On Windows, the normal ReadFile function can return synchronously or async (if you pass the appropriate arguments to it) depending on presence in the cache. Since it is all one atomic operation, there's no race condition; you just call it and see in the return value if it succeeded immediately or queued the operation.
(If you haven't read about Windows' overlapped i/o, do so, it is actually quite nice to use.)
Depends how you define that API. It could do something like "check if available, and if so, pin it until the next read (which will happen right now)". Not saying that's a good idea, but if such API was needed, it is possible.
The nonblocking read just failed. The task initiates IO then because the IO is async it lets other tasks run while waiting for data. Later, when the IO finished, the task can read the memory.
Is this what you mean?
I think there must be a data race somewhere or io_uring would be superfluous. Everything consisting of at least two interruptible steps at the lowest level in the CPU but where people expect no change in state is subject to some data race. It's really difficult to get this 100% correct. Something works fine for 99.99% of the time. Such a situation can be very nasty to get it right.
Perhaps the step "Later, when the IO finished, the task can read the memory" cannot be done atomically.
I think the parent comment was talking about what you would need if you weren't using io_ring. First, in the main async thread, you attempt a non blocking read. If that succeeds, great. If not, you hand over to another thread to do a blocking read, which is what libuv currently does for all reads. Due to the race that you described, there's a chance that by the time it gets there it will be a non blocking read, but that's ok. Overall, the main thread often gets data straight away when it's available, and is never blocked when it's not.
If you're using io_uring, the data is read into the user buffer by the kernel during the async call, so there's no race. There's no point having a separate fast path in that case, because the io_uring call can just return the data immediately if it's available.
The 8x speedup seems to come from a microbenchmark that requires no real async work from the kernel (so I think is mostly going to be stressing out context switches and the threadpool data structures in the non-uring case) but I’m still excited about the improvements to async io from io_uring.
Question I’ve not figured out yet: how can one trace io_uring operations? The api seems kinda incompatible with ptrace (which is what strace uses) but maybe there is an appropriate place to attach an ebpf? Or maybe users of io_uring will have to add their own tracing?
Yeah, might have been better to reword it as a "8x reduction in overhead" or something to make it more clear that it's highly unlikely you'll see this in real workloads.
Maybe this is an arrogant question, but why adding asyncio to these libraries in general in all OSs is slow?
I would think that writing the kernel part would be the hardest, but it's usually the event loop implementations that don't use what the Windows/MacOS/Linux kenels offer.
We tried to add io_uring support to libnbd (and indeed still hope to do so). There's a bit of an "impedance mismatch" between existing event-driven code and io_uring, not impossible, but also not completely easy to just convert code. Especially when you want to keep the non-io_uring code working (for old Linux, BSD etc).
To give an example, libnbd does a lot of send and recv system calls interleaved, for sending requests and receiving replies, with a complicated state machine to keep track of when we would block. It's still not obvious how to convert this to an io_uring style where you have to add the send/recv requests to the submission queue and then separately pick up the responses from the completion queue. It's a particular problem dealing with possible errors and keeping the API the same (if you just call exit when you hit any error, it's a lot easier).
If you're writing all new code then io_uring is a great choice.
I was playing with Tokio / Rust on MacOS and hit an about max 500k requests / second limit where the CPU was mostly doing send / recieve calls, not using MacOS asyncio system calls.
I know it's a different library, and asyncio is not as flexible as io_uring, but at least batching send / recv calls from different sockets would be good to be able to do.
I don’t know why you’re being downvoted, because for most people that don’t work with low level syscalls this is the kind of question they have.
From what I understand, some async operations will be faster in node on newer versions of the linux kernel, when node uses a version of libuv that contains this PR.
Not sure, why they ignored socket, it would be have been great addition to NodeJS/Python stack. Let's hope they make it happen in near future release schedule.
io_uring was made primarily for filesystem operations and is most optimized for that use case. Other file descriptor types have to be explicitly handled by the kernel side of the io_uring implementation, and while network calls are among those that have been, they still all require different handling in some cases.
Your application must feel I/O bound. But in reality it must be syscall bound in order for io_uring to make any noticeable improvement.
In other words: if your application is already writing/reading at the max speed the hardware its on allows, you will see no improvement from this. That application is truly I/O bound.
An application that feels I/O bound will be doing lots and lots of little reads and writes and achieving low hardware utilization. That application is syscall bound rather than I/O bound, and io_uring will help tremendously.
If the app's workload is quite similar to whatever benchmark showed the 8x improvement, then yes.
Otherwise, i.e. in the real world, it's very unlikely to be that much. High throughout workloads should see improvement, but each application is different.
It's the same 'for' in the title that I don't understand in the 'Windows subsystem for Linux'. English is not my native language but it is only recently that I started to notice this usage of for. Has it always been used like this?
Github post does it normally: 'Add io_uring support for several asynchronous file operations:'
I guess the first thing to note is that ‘for’ is the kind of old common word that tends to have a lot of meanings. Nevertheless:
1. I think they are not the same senses. In ‘windows subsystem for Linux’, the word means ‘having as a function’ like in the phrase ‘spanner for 1/4 inch hex bolts’.
2. I think the sense in the title is more like ‘to the benefit of’ (or perhaps ‘affecting’ or ‘having the reason’) like ‘lunch for employees’ or ‘supports for the lintel’
3. I checked a couple of dictionaries and they had the sense but definitions for words like these can be pretty hard to read even for a native speaker.
Unless the title has been changed since your comment, it's not the same "for" as "Windows Services for Linux". In that, the first thing now has been modified to support the second thing. In "io_ring support for libuv", the second thing has been modified so it works with the first thing.
For what it's worth, I'm a native English speaker and I agree they both sound the wrong way round to me! But I can convince myself that they do also make sense the way round they were intended.
I think it's just that "Windows X for Y" where Y is a possibly trademarked entity works better for Microsoft in general. Maybe it wasn't risky for Linux, but if they ever want an other "Windows subsystem for Z", where Z is a trademarked name, then this puts them in the clear.
That would be "io_uring: support for libuv". Which is a common way to write headlines, so assuming the headline was missing punctuation was reasonable.
(in case the title changes later, it's currently "io_uring support for libuv")
I’m a native speaker, read it wrong at first, and had to think step by step. “well, libuv is a JS-related library, io_ring is an OS level concept, so its probably a change to libuv”, and still had uncertainty and opened GitHub to double check.
You make a really good point I never heard before: X for Y is an ambiguous construct with two names referring to the same class of object, without an obvious relationship. Ex. Barack for Don is easy, but only if you know American politics, a reference to “a predecessor”. But “Jon for Mary” is inscrutable.
Hm? The word “for” has the same meaning here as in “food for kids”. Read it as (io_uring support) for (libuv). “Io_uring support” is being provided. Provided for what? Provided for libuv.
I think this is a slightly different case than the WSL one. The headline is accurate, but perhaps a bit jarring. It might have been better to say "io_uring support in libuv" but I would say using from is not incorrect.
Looking at it again I think the problem is that both X and Y are of the same type which makes the overall sentence confusing, even though it is technically correct.
In a more normal construction ("food voucher support for kids") it is obvious from context that the kids are being given the support, because the converse would be nonsensical.
Underrated by whom? Is Deno giving up C goodness for Rusty goodness a mistake? I don't think so but it seems to be another time that a big improvement comes with a big drawback.
To me libuv seems highly rated and it deserves to be highly rated. I'm not sure how to quantify it exactly to say whether it's fairly rated, underrated, or overrated, but it seems like it's in the ballpark as far as ratings go.
It allowed our shop to drop Windows almost everywhere, so depends on your worldview. The platform itself is nice, although I too don't like the company, and probably never will.
The commenter mentioning reading small chunks from /dev/zero says specifically GREATER than the stated x8 throughput, which makes me think it's x8 in the general sense.
I could be mistaken though, it seems odd not to post any sort of numbers/benchmarks.
I'll eat my files if this gives 8x on normal applications reading normal sized files from real SSDs or HDDs.
The commenter mentions "greater than 8x" on a very artificial benchmark of reading /dev/zero. The original PR author than copies that to the PR description as "8x has been observed" (lossily dropping the greater, but not adding any additional datapoints).
The reality will be that disk IO dwarfs the overhead of syscalls and threadpools in all real cases.
That all that's being saved here: the overhead of having to manage a threadpool, and make a few more syscalls.
Compared to reading a few MB off disk, that overhead will not be noticed, and definitely won't be 8x.
I would expect this to benefit IO operations on NVMe media the most, and slow media like HDD the least. The OS stack imposes a substantial part of the overall latency in tiny random operations against NVMe drives.
Just one questions: what about older versions of linux that don't have io_uring, does it fall back gracefully to older system calls or are these older versions of linux no longer supported?
Does this potentially mean you could write a sql driver using libuv in python and benefit from async calls performance without the main python scripts using any async libraries or conventions?
libuv's implementation of "async" disk i/o brings massive overhead (somewhat necessarily on linux, totally unnecessarily on other systems), so just about anything (including just switching to straight blocking read/write/seek/etc syscalls) would result in a significant speed increase.
This isn't to say that io_uring is bad, just don't draw too much a conclusion from any benchmark of their old impl beyond the context of their old impl specifically.
really cool to read that thread and see neovim devs looking at it as well. we need a sort of open source hall of fame and axboe should be there for sure
I've been studying how to create an asynchronous runtime that works across threads. My goal: neither CPU and IO bound work slow down event loops.
How do you write code that elegantly defines a state machine across threads/parallelism/async IO?
How do you efficiently define choreographies between microservices, threads, servers and flows?
I've only written two Rust programs but in Rust you presumably you can use Rayon (CPU scheduling) and Tokio (IO scheduling)
I wrote about using the LMAX Disruptor ringbuffer pattern between threads.
I am designing a state machine formulation syntax that is thread safe and parallelises effectively. It looks like EBNF syntax or a bash pipeline. Parallel steps go in curly brackets. There is an implied interthread ringbuffer between pipes. It is inspired by prolog, whereby there can be multiple conditions or "facts" before a stateline "fires" and transitions. Transitions always go from left to right but within a stateline (what is between a pipe symbol) can fire in any order. A bit like a countdown latch.
In io_uring and LMAX Disruptor, you split all IO into two halves: submit and handle. Here is a liburing state machine that can send and receive in parallel.