So a story: I've been a kernel hack since Unix V6, made a living doing it one way or another for over half my life ... learning to think about concurrency, time, interrupts, race conditions etc is hard, very hard - I got pretty good at it ... but then my career took a diversion, I designed chips for a decade or so, everything is concurrency, at the lowest levels .... after a while I came back to doing kernel stuff and found that with this new background all that hard stuff was trivial and obvious.
Mostly you just have to steep your brain in it for long enough
That’s exactly it. It’s the only way to master something. The more varied exposure over time we have to the core ideas of a discipline, the more we come to master the thought process of comprehending it’s limits and possibilities to the extent where we can make it do whatever we like.
I'm self trained in electronics, I'd started building nubus cards for Mac's and was hired as an architect for new graphics cards ... Started using C as an architectural reference language, from there it was a small step to using verilog instead ... Pretty soon I was building CPUs .... I've always been the hardware guy who understands software, and/or the software guy who understands hardware
>learning to think about concurrency, time, interrupts, race conditions
So what books can you recommend to understand the above subjects? I know of only UNIX Systems for Modern Architectures: Symmetric Multiprocessing and Caching for Kernel Programmers by Curt Schimmel.
>everything is concurrency, at the lowest levels .... after a while I came back to doing kernel stuff and found that with this new background all that hard stuff was trivial and obvious
I see a lot of HDL programmers say this. But how exactly do you map the concepts since the very language semantics between HDLs and "Standard" computer languages are different?
Consider the simplest RISC execution path, from the software view there is an instruction executing in one cycle but from the hardware view in the same cycle along with the execute there is a different decode and fetch happening.
Yes it's thinking about designs partly at that level - but, we don't just design CPUs, it's also understanding synchronization right down to the gate and flop level .... and you need to understand, and deal with, things like metastability ... effectively there exist things that can't always be synchronized, that sometimes fail and you need to deal with them .... and sometimes fail and can't be dealt with, all you can do is to design to minimize those failures .... (done right your design only melts down into a hot buzzy charred mess every century or so - not an issue that software ever needs to deal with)
> However, printk can block (while allocating memory)
No, printk() is magic. It can be called even in NMI context, which is a worse place. Quoting https://lwn.net/Articles/800946/, "[...] kernel code must be able to call printk() from any context. Calls from atomic context prevent it from blocking; calls from non-maskable interrupts (NMIs) can even rule out the use of spinlocks. [...]"
This is really good to know. I had assumed it could block when allocating memory for the formatted string buffer, but the rationale explained in that article makes a lot of sense. Being able to use printk simplifes things a lot.
I hadn't considered eBPF because I needed some pretty obscure information from the kernel internals (i.e. the addresses of the `struct file`s) and I didn't realize eBPF was as capable as it is. Another commenter suggested trying it, though, so I'm checking it out now!
I did use printk for debugging, but I (incorrectly) assumed it could block. Another commenter pointed out that this is not the case. TIL!
The gdb link looks very helpful and I'll try that next time. Thanks for linking that.
"But when BPF got extended, it allowed users to add code that is executed by the kernel in a safe manner in various points of its execution, not only in the network code."
Where would you look for a list of what you can do with eBPF and how? (I think maybe I’m searching for a list of hook points?) I keep seeing tantalizing hints about all of the things it lets you do, but the tutorials I’ve seen only seem to cover networking and tracing.
(The project I have in mind at the moment is making a bindfs-like filesystem without FUSE, but I’ve had a few different ideas where eBPF seemed like it might have been a good fit if I could figure it out.)
Hi HN, this was my first attempt at writing any sort of kernel code. I would love to hear your thoughts on this experience and on the fixes I applied, especially from anyone with more Linux experience than me :)
You should also check out bpftrace which is a specific DSL to write both the kernel and userspace part in one language - rather than the mixed python/C approach people mostly took before that. And you can output things potentially as text or json for parsing.
Seems like someone did try to get those functions exported, but the maintainer rejected it, saying that no driver should be poking so deep into fd internals. Makes sense. Your use case is kind of niche.
It could be handled differently. The kernel author could simply say "this isn't how the kernel works, so we cannot accept this". There isn't a need to come up with wacky insults, as humorous as they may be.
> Sounds harsh. Now for comparison try standing next to an electrician and suggest alternate ways of doing things that are dangerous and wrong.
To become an electrician you take classes and become certified. How does someone become a kernel developer? I would assume by interacting with other kernel developers, suggesting ideas, getting feedback on those ideas, etc.
An electrician wiring a house is a single person job. An open source project is a team job, and there's a reason development takes place out in the open: so that others can contribute. If outside contributions to the project isn't allowed, why not make it a source available project instead of open source?
Definitely try to get comfortable with building a kernel eventually. You don't have to run it on your bare metal machine; you can boot test kernels in a VM. The actual test / development process is not especially different between kernel and modules.
I see the world “nightmare” used a lot in this attic ale.
I wonder if I am the only one that loves debugging difficult/weird problems. It’s something like trying to solve a puzzle. And knowing that the system will never deceive me(it will not be the system’s fault if I get deceived), and that a perfectly reasonable good explanation exists for what I observe helps me do not give up.
Same. I would love a job comprising solely of jumping into big hairy systems and debugging weird issues. Its much more interesting to understand how exactly things work at every level of the stack (the bottom of the stack being OS/kernel or even hardware stuff, not a backend endpoint or database) than writing code.
> I wonder if I am the only one that loves debugging difficult/weird problems.
Same here. At times, I'd prefer to just work on debugging things for colleagues versus writing rather boring code. It can give some insights when it comes to design, as well as enabling customer support to fix certain issues.
It helps to have colleagues that break things in interesting ways. ;) Also important is a supportive manager, and a 'real job' that is usually time flexible; you might need to drop what you're working on to debug an issue when it's happening, so that needs to be mostly OK.
Or if you like networking challenges, having widely distributed users on diverse platforms and networks, and either running your own load balancers or using DNS or application level balancing, so that you can see the actual network flow, and not only the parts that make it through a load balancer.
Of course, it's a lot of frustration when you find the issue, and it's in some random router in some far off locale with no way to contact. Things like the linux large receive offloading bug that would receive larger than MTU packets because of offloading, then drop the packet (and send ICMP needs frag) because it's larger than the MTU of the destination address. I fixed the FreeBSD bad behavior when getting such an ICMP, but it would be nice if systems operating as routers would update their kernels a couple of times a decade. I could (and have, elsewhere) rant about more MTU problems, but let's just say, they're out there, they're stupid, and it's hard to get them fixed. Ugh.
You probably already did this, but for the audience: one of the best ways to make sure you're using a function reasonably is to use elixir.bootlin.com to look at other uses and make sure you're using the function similarly. For instance, check out https://elixir.bootlin.com/linux/latest/A/ident/for_each_pro... .
Elixir was extremely helpful to me! It didn't always help me understand _why_ code was written the way it was (hence my incorrect use of rcu_read_lock), but it was very helpful to see some examples.
I've not done too much kernel programming, but for sure I know that looking for existing uses of code is very helpful.
It looks like the author of the piece did something similar, and noted other people doing similar things to themselves.
I wrote some modules to experiment with the Security Module API, because working with the APIs seemed like a good way to learn how they worked, and what was possible beyond just SELinux,Apparmor, etc.:
My knee jerk reading this article and seeing a kernel module near 'nodejs' was to grumble and say "wtf they clearly didn't need a kernel module for this". But upon reading deeper I see that accessing the kernel is kinda appropriate.
Regardless of whether you end up using eBPF or a .ko like you already have, you may have a yet simpler option. By leveraging the loader you can do an interposition trick with LD_PRELOAD to hook C library accesses. Maybe this is all you need in order to "help students understand system calls such as open, close, dup2, fork, pipe, and others. "
Takes me back to the days of ATM device driver debugging.
I’ve written 9 kernel drivers.
All in all, a dedicated standalone terminal attached to the serial port of the target is still your best friend.
Great post, also love what you are trying to do with C playground, this is awesome!
I've recently been trying to build something similar, visualizing forks/exeve/read/write, but using the strace output of a binary, which is much less challenging.
Great story! I've had a lot of debugging nightmares, but thankfully never anything as bad as that.
One thing that looks fishy is this branch:
if (container_tasks_len == max_container_tasks) {
printk("cplayground: ERROR: container_tasks list hit capacity! We "
"may be missing processes from the procfile output.\n");
break;
}
Since you said printk can block, why isn't calling it in the rcu critical section a bug? Is it because you immediately break afterwards and don't try to reference the next task?
That's a good point. I'm hoping that this never gets hit, and if that line ever appears in the logs, then things are already broken. However, it's probably better to improve the failure mode where possible :)
[edit] and yes, since we break and don't follow the `next` pointer in the linked list, that also shouldn't cause any problems.
[edit 2] a sibling comment by cesarb pointed out that printk actually does not block, since it's important for it to be usable in critical sections to debug when the kernel gets into trouble
Great article! Reminds me of when I was working on a bug in a phone kernel and adding its equivalent of printk() made the bug disappear! Lauterbach time!
Back in the Windows NT/2000 days, IIS executed as part of the kernel, debugging ISAPI extensions was an exercise in patience every time a programming error crashed the kernel and a reboot was in order.
You can do most or all of that by reading /proc/<pid>/fdinfo/<fd> and /proc/<pid>/fd/<fd> or by making system calls on the affected fds (which you can do e.g. by injecting code with LD_PRELOAD or ptrace or with nsenter with fd namespace or equivalent C code).
Even if you write a kernel driver, iterating over all tasks in the system is a terrible design (there may be millions), not to mention "determining if a task belongs to a C playground program" in the kernel (obviously the kernel should have no knowledge about such specifics).
Of course, if a developer cannot even produce a reasonable overall design, it's not surprising that they aren't capable of writing correct code.
"Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community."
I actually cannot get enough information from doing that. Crucially, I need to be able to recognize whether two file descriptors point to the same open `file_struct`. (To be clear, this isn't the same as whether they're pointing to the same file path. I need to know when the two file descriptors are sharing the same cursor.) There is no way to do this using existing APIs, because there is nothing identifying a `struct file` besides the memory address of the struct. (The "open file IDs" I mention are hashes of the `file_struct` address.)
I did spend a lot of time trying to avoid writing a kernel module, and this was the only way I could find to do it :)
You can use the kcmp system call with KCMP_FILE argument to find out if two fds point to the same files structure (of course you must use this as the custom comparison function of a sort algorithm so you don't end up with quadratic run time).
Linux has a project called CRIU that can save and restore processes to disk without needing additional kernel modules, so pretty much all state is already gettable and settable from user space.
I can't do that across processes, though, can I? (to see whether two processes have file descriptors pointing to the same open file) edit -- it does look like it works cross-process!
I hadn't heard of CRIU. I'll check that out. (edit: CRIU looks super useful. I think the speed/overhead of snapshotting will decide whether I can use it for this project, but I can imagine it being handy in the future regardless. Thanks for the link.)
I recommend checking out podman (or docker) - they have built-in criu support. Otherwise you’ll need some other namespacing mechanism to avoid colliding pids
Mostly you just have to steep your brain in it for long enough