Libuv – Linux: io_uring support

fanf2 · on May 28, 2023

My employer, ISC, sponsored this work because we use libuv in BIND9 and we would like everyone to benefit more from the new io_uring hotness.

oerdnj · on May 29, 2023

And adding io_uring support for networking operations is underway in the same manner. ISC is sponsoring the work.

CoolCold · on May 30, 2023

Does it mean ISC is relying on Linux heavily? Random people on internet says FreeBSD has better IP stack and somehow I was sure ISC/BIND platform of choice is FreeBSD.

oerdnj · on May 30, 2023

This means that ISC is relying heavily on libuv and we want to see it improved for everyone, not just ISC and BIND 9. Nothing more and nothing less, and I would appreciate if we can avoid the partisanship discussion.

0x000xca0xfe · on May 28, 2023

Awesome, do you have any data for performance or RAM usage of epoll vs. io_uring in a real-world scenario for BIND9?

destructionator · on May 29, 2023

This test doesn't include epoll in any significant fashion at all, and also doesn't apply to network operations. The libuv thing is specifically about disk files, which linux doesn't really support with epoll (epoll just always considers them ready, regardless of if the requested page is in cache or not, meaning it is not helpful for the async case. by contrast, a socket without data immediately available in the kernel's buffer will not show readiness).

Since you can't tell if a disk io function would have to actually hit the disk or use the cache on older linuxes, libuv just always dispatches said operations to a worker thread pool which then calls the normal function. When it returns, it sends a message back to the other thread to indicate its completion.

If data is already present in the cache, this adds significant overhead vs just calling `read` directly - instead calling `read`, it will post a request to the worker thread queue (which may require synchronization, im not sure exactly how the impl does it tbh), wake the worker thread which calls `read`, which does not block significantly since the data is indeed already present, then it posts an event back to the main thread (which, sure, it is listening for this event on `epoll`, along with whatever else it is listening to, but that's an insignificant detail at this point) which calls the on-complete handler. All that work instead of just transferring the data from the cache to the result. Significant slowdown.

If the data is not already present in the cache, it does exactly those same steps, except this time `read` actually does block for some time, so the main thread can continue with other work in the mean time. (Of course, if the main thread has no other work to do, you've gained nothing from this!)

Hence why I said in my other comment that you ought not draw any general conclusion from this. This speedup has nothing to do with epoll vs io_uring (except maybe that linux file i/o is not really epoll compatible) and everything to do with libuv's block device implementation specifically. It is totally inapplicable to network loads entirely.

edit: in the first version I said "disk file" but it is technically block devices, of which disk files are just the most common example, but of course /dev/zero is a block device that is not a disk file.... and worth noting that /dev/zero is never going to actually load off a disk meaning it is "in the cache" all the time.

0x000xca0xfe · on May 29, 2023

Yeah, I thought "file" refers to "file descriptor" with IP sockets included, but looks like that's not yet implemented...

A couple years ago I have added io_uring to my toy webserver but it was slower than epoll. I would love to see someone using io_uring for a real networking application and report more meaningful statistics than microbenchmarks.

eklitzke · on May 29, 2023

It doesn't really make sense to have an API to tell if reading from a file would block or not because page cache data can be evicted at any time. Even if there was an API that could tell you your next read would be non-blocking, there could be a race where the page cache entry could be evicted before you could do the read.

Filligree · on May 29, 2023

That sort of race condition is only relevant if it affects correctness. Here it means a read might be unexpectedly slow, but you'll still get the same data.

99.9% of the time the data will not be evicted prior to the read, so it's still going to be a win overall.

sgtnoodle · on May 29, 2023

If you're writing a latency sensitive program, 99.9% might as well be 0% though; you're going to end up with an asynchronous anyway.

mort96 · on May 29, 2023

Remember that a filesystem read might be a disk read in the background. You don't really want to block the main thread for 2 seconds while sshfs does its thing.

mort96 · on May 29, 2023

Damn, I obviously meant "network read" not a "disk read". Oh well.

destructionator · on May 29, 2023

On Windows, the normal ReadFile function can return synchronously or async (if you pass the appropriate arguments to it) depending on presence in the cache. Since it is all one atomic operation, there's no race condition; you just call it and see in the return value if it succeeded immediately or queued the operation.

(If you haven't read about Windows' overlapped i/o, do so, it is actually quite nice to use.)

io_uring brings something similar to Linux.

viraptor · on May 29, 2023

Depends how you define that API. It could do something like "check if available, and if so, pin it until the next read (which will happen right now)". Not saying that's a good idea, but if such API was needed, it is possible.

plorkyeran · on May 29, 2023

A nonblocking read which fails if the data isn't cached and it'd have to perform io would generally be easier to get correct.

_nalply · on May 29, 2023

Please let me try to understand you.

The nonblocking read just failed. The task initiates IO then because the IO is async it lets other tasks run while waiting for data. Later, when the IO finished, the task can read the memory.

Is this what you mean?

I think there must be a data race somewhere or io_uring would be superfluous. Everything consisting of at least two interruptible steps at the lowest level in the CPU but where people expect no change in state is subject to some data race. It's really difficult to get this 100% correct. Something works fine for 99.99% of the time. Such a situation can be very nasty to get it right.

Perhaps the step "Later, when the IO finished, the task can read the memory" cannot be done atomically.

quietbritishjim · on May 29, 2023

I think the parent comment was talking about what you would need if you weren't using io_ring. First, in the main async thread, you attempt a non blocking read. If that succeeds, great. If not, you hand over to another thread to do a blocking read, which is what libuv currently does for all reads. Due to the race that you described, there's a chance that by the time it gets there it will be a non blocking read, but that's ok. Overall, the main thread often gets data straight away when it's available, and is never blocked when it's not.

If you're using io_uring, the data is read into the user buffer by the kernel during the async call, so there's no race. There's no point having a separate fast path in that case, because the io_uring call can just return the data immediately if it's available.

tomohawk · on May 29, 2023

epoll vs io_uring discussion:

https://github.com/axboe/liburing/issues/189

fanf2 · on May 28, 2023

Not yet :-)

dan-robertson · on May 28, 2023

The 8x speedup seems to come from a microbenchmark that requires no real async work from the kernel (so I think is mostly going to be stressing out context switches and the threadpool data structures in the non-uring case) but I’m still excited about the improvements to async io from io_uring.

Question I’ve not figured out yet: how can one trace io_uring operations? The api seems kinda incompatible with ptrace (which is what strace uses) but maybe there is an appropriate place to attach an ebpf? Or maybe users of io_uring will have to add their own tracing?

truth_seeker · on May 28, 2023

Observability on how kernel handles io_uring request - https://github.com/axboe/liburing/issues/467

znpy · on May 28, 2023

> The 8x speedup seems to come from a microbenchmark that requires no real async work from the kernel

which is nice because you can measure the actual library overhead without the noise from real kernel work.

theLiminator · on May 28, 2023

Yeah, might have been better to reword it as a "8x reduction in overhead" or something to make it more clear that it's highly unlikely you'll see this in real workloads.

tptacek · on May 28, 2023

Yes: the title of this post should be "libuv: introduce io_uring support", not what it is now.

rektide · on May 28, 2023

And this is after node.js v20 already had some very impressive more compute-centric wins! https://blog.rafaelgss.dev/state-of-nodejs-performance-2023

gavinray · on May 28, 2023

Very excited for the performance implications for Node in general. Node is going to be very speedy in v21.

xiphias2 · on May 28, 2023

Maybe this is an arrogant question, but why adding asyncio to these libraries in general in all OSs is slow?

I would think that writing the kernel part would be the hardest, but it's usually the event loop implementations that don't use what the Windows/MacOS/Linux kenels offer.

rwmj · on May 28, 2023

We tried to add io_uring support to libnbd (and indeed still hope to do so). There's a bit of an "impedance mismatch" between existing event-driven code and io_uring, not impossible, but also not completely easy to just convert code. Especially when you want to keep the non-io_uring code working (for old Linux, BSD etc).

To give an example, libnbd does a lot of send and recv system calls interleaved, for sending requests and receiving replies, with a complicated state machine to keep track of when we would block. It's still not obvious how to convert this to an io_uring style where you have to add the send/recv requests to the submission queue and then separately pick up the responses from the completion queue. It's a particular problem dealing with possible errors and keeping the API the same (if you just call exit when you hit any error, it's a lot easier).

If you're writing all new code then io_uring is a great choice.

xiphias2 · on May 28, 2023

I was playing with Tokio / Rust on MacOS and hit an about max 500k requests / second limit where the CPU was mostly doing send / recieve calls, not using MacOS asyncio system calls.

I know it's a different library, and asyncio is not as flexible as io_uring, but at least batching send / recv calls from different sockets would be good to be able to do.

woooooo · on May 28, 2023

Liburing is pretty recent, and it started with a really limited set of syscalls it supports.

Everyone is adopting it pretty quickly, actually.

yxhuvud · on May 29, 2023

It is because there in general needs to be pretty big structural changes in the programs to make use of it properly.

truth_seeker · on May 28, 2023

Libraries, VMs, Runtimes backed by libuv :

https://github.com/libuv/libuv/blob/v1.x/LINKS.md

oaiey · on May 29, 2023

Impressive. Especially the DNS resolvers.

At least the .NET entry is however no longer correct

alberth · on May 28, 2023

Does this mean that NodeJS apps will gain 8x throughout?

binarymax · on May 28, 2023

I don’t know why you’re being downvoted, because for most people that don’t work with low level syscalls this is the kind of question they have.

From what I understand, some async operations will be faster in node on newer versions of the linux kernel, when node uses a version of libuv that contains this PR.

truth_seeker · on May 28, 2023

Only if they are highly I/O bound.

For example, implementing tcp/http/udp proxy servers. Static file servers too can benefit greatly.

bhaney · on May 28, 2023

Based on an exchange in the PR, it sounds like they have to be specifically bound on file I/O to see an improvement, not network I/O.

> is this PR just for files or also for network operations?

> Just file operations for now.

truth_seeker · on May 28, 2023

Ouch ! I missed that comment.

Not sure, why they ignored socket, it would be have been great addition to NodeJS/Python stack. Let's hope they make it happen in near future release schedule.

junon · on May 28, 2023

io_uring was made primarily for filesystem operations and is most optimized for that use case. Other file descriptor types have to be explicitly handled by the kernel side of the io_uring implementation, and while network calls are among those that have been, they still all require different handling in some cases.

Misdicorl · on May 28, 2023

Nit: Syscall bound, not necessarily I/O bound.

dontlaugh · on May 29, 2023

Both, because it allows async file I/O.

Misdicorl · on May 30, 2023

Your application must feel I/O bound. But in reality it must be syscall bound in order for io_uring to make any noticeable improvement.

In other words: if your application is already writing/reading at the max speed the hardware its on allows, you will see no improvement from this. That application is truly I/O bound.

An application that feels I/O bound will be doing lots and lots of little reads and writes and achieving low hardware utilization. That application is syscall bound rather than I/O bound, and io_uring will help tremendously.

biorach · on May 28, 2023

If the app's workload is quite similar to whatever benchmark showed the 8x improvement, then yes.

Otherwise, i.e. in the real world, it's very unlikely to be that much. High throughout workloads should see improvement, but each application is different.

xxs · on May 29, 2023

Likely the gain will be close to zero for Node

fulafel · on May 29, 2023

ithinkso · on May 28, 2023

It's the same 'for' in the title that I don't understand in the 'Windows subsystem for Linux'. English is not my native language but it is only recently that I started to notice this usage of for. Has it always been used like this?

Github post does it normally: 'Add io_uring support for several asynchronous file operations:'

inktype · on May 28, 2023

Unfortunately, "x y for z" can be parsed in multiple ways that result in opposite meanings.

Explaining with examples:

* [io_uring support] [for libuv]: "libuv has io_uring support"

* [io_uring] [support for libuv]: "io_uring has support for libuv"

* [Windows subsystem] [for Linux]: "Linux has a Windows subsystem"

* [Windows] [subsystem for Linux]: "Windows has a subsystem for Linux"

[x] [y for z] / "x has y for z" is much more common.

rkagerer · on May 29, 2023

Don't feel bad. Windows Subsystem for Linux was just a horrible name.

They couldn't even be bothered to put an apostrophe on the end of Windows.

dan-robertson · on May 28, 2023

I guess the first thing to note is that ‘for’ is the kind of old common word that tends to have a lot of meanings. Nevertheless:

1. I think they are not the same senses. In ‘windows subsystem for Linux’, the word means ‘having as a function’ like in the phrase ‘spanner for 1/4 inch hex bolts’.

2. I think the sense in the title is more like ‘to the benefit of’ (or perhaps ‘affecting’ or ‘having the reason’) like ‘lunch for employees’ or ‘supports for the lintel’

3. I checked a couple of dictionaries and they had the sense but definitions for words like these can be pretty hard to read even for a native speaker.

quietbritishjim · on May 28, 2023

Unless the title has been changed since your comment, it's not the same "for" as "Windows Services for Linux". In that, the first thing now has been modified to support the second thing. In "io_ring support for libuv", the second thing has been modified so it works with the first thing.

For what it's worth, I'm a native English speaker and I agree they both sound the wrong way round to me! But I can convince myself that they do also make sense the way round they were intended.

leni536 · on May 28, 2023

I think it's just that "Windows X for Y" where Y is a possibly trademarked entity works better for Microsoft in general. Maybe it wasn't risky for Linux, but if they ever want an other "Windows subsystem for Z", where Z is a trademarked name, then this puts them in the clear.

gnfargbl · on May 28, 2023

"X for Y" isn't a new thing ("Windows for Workgroups", 1993) but I think the phrasing is possibly more common in the tech world than outside it.

ithinkso · on May 28, 2023

That's not it, X for Y sounds completely normal as in 'food for kids' or something. But I have read the title as if io_uring added support for libuv.

wongarsu · on May 28, 2023

That would be "io_uring: support for libuv". Which is a common way to write headlines, so assuming the headline was missing punctuation was reasonable.

(in case the title changes later, it's currently "io_uring support for libuv")

refulgentis · on May 28, 2023

I’m a native speaker, read it wrong at first, and had to think step by step. “well, libuv is a JS-related library, io_ring is an OS level concept, so its probably a change to libuv”, and still had uncertainty and opened GitHub to double check.

You make a really good point I never heard before: X for Y is an ambiguous construct with two names referring to the same class of object, without an obvious relationship. Ex. Barack for Don is easy, but only if you know American politics, a reference to “a predecessor”. But “Jon for Mary” is inscrutable.

josephg · on May 28, 2023

Hm? The word “for” has the same meaning here as in “food for kids”. Read it as (io_uring support) for (libuv). “Io_uring support” is being provided. Provided for what? Provided for libuv.

earthling8118 · on May 28, 2023

I think this is a slightly different case than the WSL one. The headline is accurate, but perhaps a bit jarring. It might have been better to say "io_uring support in libuv" but I would say using from is not incorrect.

gnfargbl · on May 28, 2023

Looking at it again I think the problem is that both X and Y are of the same type which makes the overall sentence confusing, even though it is technically correct.

In a more normal construction ("food voucher support for kids") it is obvious from context that the kids are being given the support, because the converse would be nonsensical.

yencabulator · on May 29, 2023

Windows for Workgroups was a version of Windows specifically suited for use by "work groups", whatever they though that was.

So, by that logic, clearly then Windows Subsystem for Linux is a version of Windows specifically suited for running under Linux.

refulgentis · on May 28, 2023

Toys for Tots, food for thought.

kzrdude · on May 28, 2023

What about "German for Engineers"? is that the same "for"? I'm thinking of a ficticious language course for engineers..

wongarsu · on May 28, 2023

While "Engineers for Ukraine" is probably about building drones and shipping them to the Russian border region.

dwaite · on May 29, 2023

In some cases, like "Windows Subsystem for Linux", there's also a common distinction that is caused by consistency in trademark usage.

"xyz for Windows" is allowed, while "Windows xyz" is reserved by the company to indicate that it is actually an official part of Windows.

moralestapia · on May 28, 2023

libuv is such an underrated piece of technology!

If you haven't yet, please go check it out, write a program with it and be amazed.

So glad to be a contributor.

benatkin · on May 29, 2023

Underrated by whom? Is Deno giving up C goodness for Rusty goodness a mistake? I don't think so but it seems to be another time that a big improvement comes with a big drawback.

To me libuv seems highly rated and it deserves to be highly rated. I'm not sure how to quantify it exactly to say whether it's fairly rated, underrated, or overrated, but it seems like it's in the ballpark as far as ratings go.

moralestapia · on May 29, 2023

I'd add it to the CS curricula, it's that good.

Also, a polling loop + how to properly use it should be something as fundamental as pointers, linked lists, ..., all that stuff.

benatkin · on May 29, 2023

Excellent, makes sense. And keep C in the CS curricula. I'm for that. I also see how libuv must be underused in C code.

That indeed makes it underrated. Got it.

oaiey · on May 29, 2023

libuv brought .NET to Linux. Will never forget that.

Without libuv that story would not have been such a success.

Thank you!

goodpoint · on May 29, 2023

> libuv brought .NET to Linux. Will never forget that.

Is that a very good thing or a very bad one?

5e92cb50239222b · on May 29, 2023

It allowed our shop to drop Windows almost everywhere, so depends on your worldview. The platform itself is nice, although I too don't like the company, and probably never will.

oaiey · on May 29, 2023

It is a good thing. .NET is a wonderful stack nowadays. The company overlord is sometime what he is but generally behaving.

loeg · on May 28, 2023

8x throughput in what benchmark?

rwmj · on May 28, 2023

Reading small chunks from /dev/zero. Avoiding lots of small syscalls is the best case for io_uring, but also not uncommon in real code.

gavinray · on May 28, 2023

The commenter mentioning reading small chunks from /dev/zero says specifically GREATER than the stated x8 throughput, which makes me think it's x8 in the general sense.

I could be mistaken though, it seems odd not to post any sort of numbers/benchmarks.

TheDong · on May 29, 2023

I'll eat my files if this gives 8x on normal applications reading normal sized files from real SSDs or HDDs.

The commenter mentions "greater than 8x" on a very artificial benchmark of reading /dev/zero. The original PR author than copies that to the PR description as "8x has been observed" (lossily dropping the greater, but not adding any additional datapoints).

The reality will be that disk IO dwarfs the overhead of syscalls and threadpools in all real cases.

That all that's being saved here: the overhead of having to manage a threadpool, and make a few more syscalls.

Compared to reading a few MB off disk, that overhead will not be noticed, and definitely won't be 8x.

loeg · on May 29, 2023

I would expect this to benefit IO operations on NVMe media the most, and slow media like HDD the least. The OS stack imposes a substantial part of the overall latency in tiny random operations against NVMe drives.

quietbritishjim · on May 29, 2023

> Add io_uring support for several asynchronous file operations: read, write; fsync, fdatasync; stat, fstat, lstat

Does this mean libuv already supported io_uring for non-file operations? Or it still doesn't?

Async file operations are useful in some applications, but not the main things people normally think of when they hear async IO.

MichaelMoser123 · on May 29, 2023

very impressive!

Just one questions: what about older versions of linux that don't have io_uring, does it fall back gracefully to older system calls or are these older versions of linux no longer supported?

truth_seeker · on May 29, 2023

Yep. Auto Fallback to older Linux api is in place already.

heyoni · on May 28, 2023

Does this potentially mean you could write a sql driver using libuv in python and benefit from async calls performance without the main python scripts using any async libraries or conventions?

truth_seeker · on May 28, 2023

asyncio event loop in Python uses `uvloop` which in based on `libuv`

destructionator · on May 29, 2023

libuv's implementation of "async" disk i/o brings massive overhead (somewhat necessarily on linux, totally unnecessarily on other systems), so just about anything (including just switching to straight blocking read/write/seek/etc syscalls) would result in a significant speed increase.

This isn't to say that io_uring is bad, just don't draw too much a conclusion from any benchmark of their old impl beyond the context of their old impl specifically.

gigatexal · on May 29, 2023

really cool to read that thread and see neovim devs looking at it as well. we need a sort of open source hall of fame and axboe should be there for sure

samsquire · on May 29, 2023

This is really good. Thank you!

I've been studying how to create an asynchronous runtime that works across threads. My goal: neither CPU and IO bound work slow down event loops.

How do you write code that elegantly defines a state machine across threads/parallelism/async IO? How do you efficiently define choreographies between microservices, threads, servers and flows?

I've only written two Rust programs but in Rust you presumably you can use Rayon (CPU scheduling) and Tokio (IO scheduling)

I wrote about using the LMAX Disruptor ringbuffer pattern between threads.

https://github.com/samsquire/ideas4#51-rewrite-synchronous-c...

I am designing a state machine formulation syntax that is thread safe and parallelises effectively. It looks like EBNF syntax or a bash pipeline. Parallel steps go in curly brackets. There is an implied interthread ringbuffer between pipes. It is inspired by prolog, whereby there can be multiple conditions or "facts" before a stateline "fires" and transitions. Transitions always go from left to right but within a stateline (what is between a pipe symbol) can fire in any order. A bit like a countdown latch.

  states = state1 | {state1a state1b state1c} {state2a state2b state2c} | state3

You can can think of each fact as an "await" but all at the same time.

  initial_state.await = { state1a.await state1b.await state1c.await }.await { state2a.await state2b.await state2c.await } | state3.await

In io_uring and LMAX Disruptor, you split all IO into two halves: submit and handle. Here is a liburing state machine that can send and receive in parallel.

   accept | { submit_recv! | recv | submit_send } { submit_send! | send | submit_recv }

I want there to be ring buffers between groups of states. So we have full duplex sending and receiving.

Here is a state machine for async/await between threads:

  next_free_thread = 2
  task(A) thread(1) assignment(A, 1) = running_on(A, 1) | 
  paused(A, 1)

  running_on(A, 1)
  thread(1)
  assignment(A, 1)
  thread_free(next_free_thread) = fork(A, B)
                                | send_task_to_thread(B, next_free_thread)
                                |   running_on(B, 2)
                                    paused(B, 1)
                                    running_on(A, 1)
                               | { yield(B, returnvalue) | paused(B, 2) }
                                 { await(A, B, returnvalue) | paused(A, 1) }
                               | send_returnvalue(B, A, returnvalue)

Ahmad498 · on May 29, 2023

This is ready for review now. Linux v5.13 is the minimum required, otherwise libuv falls back to the thread pool.

29athrowaway · on May 28, 2023

Add "On Linux" to the title.

speed_spread · on May 28, 2023

io_uring, as mentioned in the title, implies Linux.

29athrowaway · on May 29, 2023

The title was changed by the time you replied.