My First Kernel Module: A Debugging Nightmare

Taniwha · on Nov 19, 2020

So a story: I've been a kernel hack since Unix V6, made a living doing it one way or another for over half my life ... learning to think about concurrency, time, interrupts, race conditions etc is hard, very hard - I got pretty good at it ... but then my career took a diversion, I designed chips for a decade or so, everything is concurrency, at the lowest levels .... after a while I came back to doing kernel stuff and found that with this new background all that hard stuff was trivial and obvious.

Mostly you just have to steep your brain in it for long enough

suifbwish · on Nov 20, 2020

That’s exactly it. It’s the only way to master something. The more varied exposure over time we have to the core ideas of a discipline, the more we come to master the thought process of comprehending it’s limits and possibilities to the extent where we can make it do whatever we like.

febed · on Nov 20, 2020

How did you pivot from kernel programming to designing chips? Did you already have a background in embedded electronics?

Taniwha · on Nov 20, 2020

I'm self trained in electronics, I'd started building nubus cards for Mac's and was hired as an architect for new graphics cards ... Started using C as an architectural reference language, from there it was a small step to using verilog instead ... Pretty soon I was building CPUs .... I've always been the hardware guy who understands software, and/or the software guy who understands hardware

rramadass · on Nov 20, 2020

>learning to think about concurrency, time, interrupts, race conditions

So what books can you recommend to understand the above subjects? I know of only UNIX Systems for Modern Architectures: Symmetric Multiprocessing and Caching for Kernel Programmers by Curt Schimmel.

>everything is concurrency, at the lowest levels .... after a while I came back to doing kernel stuff and found that with this new background all that hard stuff was trivial and obvious

I see a lot of HDL programmers say this. But how exactly do you map the concepts since the very language semantics between HDLs and "Standard" computer languages are different?

arkj · on Nov 20, 2020

Consider the simplest RISC execution path, from the software view there is an instruction executing in one cycle but from the hardware view in the same cycle along with the execute there is a different decode and fetch happening.

Taniwha · on Nov 22, 2020

Yes it's thinking about designs partly at that level - but, we don't just design CPUs, it's also understanding synchronization right down to the gate and flop level .... and you need to understand, and deal with, things like metastability ... effectively there exist things that can't always be synchronized, that sometimes fail and you need to deal with them .... and sometimes fail and can't be dealt with, all you can do is to design to minimize those failures .... (done right your design only melts down into a hot buzzy charred mess every century or so - not an issue that software ever needs to deal with)

ksml · on Nov 19, 2020

Concurrency is still hard for me, but I do find it getting much easier over the years :) thanks for the story!

cesarb · on Nov 19, 2020

> However, printk can block (while allocating memory)

No, printk() is magic. It can be called even in NMI context, which is a worse place. Quoting https://lwn.net/Articles/800946/, "[...] kernel code must be able to call printk() from any context. Calls from atomic context prevent it from blocking; calls from non-maskable interrupts (NMIs) can even rule out the use of spinlocks. [...]"

ksml · on Nov 19, 2020

This is really good to know. I had assumed it could block when allocating memory for the formatted string buffer, but the rationale explained in that article makes a lot of sense. Being able to use printk simplifes things a lot.

kanox · on Nov 19, 2020

Also: allocating memory with GFP_ATOMIC doesn't sleep.

m463 · on Nov 20, 2020

now that is technical leadership.

msla · on Nov 20, 2020

If you have to rely on printf debugging, you go to epic lengths to ensure printf always works.

lallysingh · on Nov 19, 2020

EBPF is honestly the first thing to try before writing a module.

I'm glad to see you used a VM. That's the first step in the right direction. Others have mentioned that you should've used printk(), which is true.

I'll mention that you can also run the kernel in a debugger: https://www.kernel.org/doc/html/latest/dev-tools/gdb-kernel-...

ksml · on Nov 19, 2020

I hadn't considered eBPF because I needed some pretty obscure information from the kernel internals (i.e. the addresses of the `struct file`s) and I didn't realize eBPF was as capable as it is. Another commenter suggested trying it, though, so I'm checking it out now!

I did use printk for debugging, but I (incorrectly) assumed it could block. Another commenter pointed out that this is not the case. TIL!

The gdb link looks very helpful and I'll try that next time. Thanks for linking that.

PeterCorless · on Nov 20, 2020

Yeah, my mind immediately went to eBPF too.

"But when BPF got extended, it allowed users to add code that is executed by the kernel in a safe manner in various points of its execution, not only in the network code."

Read more here:

https://thenewstack.io/how-io_uring-and-ebpf-will-revolution...

wolfgang42 · on Nov 20, 2020

Where would you look for a list of what you can do with eBPF and how? (I think maybe I’m searching for a list of hook points?) I keep seeing tantalizing hints about all of the things it lets you do, but the tutorials I’ve seen only seem to cover networking and tracing.

(The project I have in mind at the moment is making a bindfs-like filesystem without FUSE, but I’ve had a few different ideas where eBPF seemed like it might have been a good fit if I could figure it out.)

dmitris · on Nov 20, 2020

https://github.com/iovisor/bcc/blob/master/docs/reference_gu... lists the hook points where BPF code can be attached. Also take a look at https://blogs.oracle.com/linux/notes-on-bpf-1 (there are follow-ups - https://blogs.oracle.com/linux/notes-on-bpf-7 has the links at the bottom) and from the Linux source, https://github.com/torvalds/linux/blob/master/include/uapi/l.... https://docs.cilium.io/en/latest/bpf/ is an extensive reference but with an emphasis on the network-related areas (xdp, tc).

lallysingh · on Nov 20, 2020

There are 1-2 good books on it. I've skimmed them via O'reilly Safari when I needed something in the past.

megous · on Nov 19, 2020

Linux has some debug options that could have probably helped here. It's a good idea to enable them when developing new code.

https://megous.com/dl/tmp/b6e8f550de4539a8.png

ksml · on Nov 19, 2020

Ah! This would have been really helpful!

PeterCorless · on Nov 20, 2020

Hackernews at its very best.

ksml · on Nov 19, 2020

Hi HN, this was my first attempt at writing any sort of kernel code. I would love to hear your thoughts on this experience and on the fixes I applied, especially from anyone with more Linux experience than me :)

warybeary · on Nov 19, 2020

Have you looked into using eBPF instead of writing a kernel module?

http://ebpf.io for some more insights.

At the very least, it'll provide some useful tooling for you to debug problems in kernel-space.

ksml · on Nov 19, 2020

I hadn't considered this! Can eBPF be used to access arbitrary kernel data structures, though?

warybeary · on Nov 19, 2020

Yes (to a degree) :)

Check out https://github.com/iovisor/bpftrace and the example tools/ for a taste. You'll likely want to play with kprobes/kretprobes.

ksml · on Nov 19, 2020

This is really interesting; I hadn't realized it was so capable/general. I'll look into this. Thanks for the references!

lathiat · on Nov 20, 2020

You should also check out bpftrace which is a specific DSL to write both the kernel and userspace part in one language - rather than the mixed python/C approach people mostly took before that. And you can output things potentially as text or json for parsing.

https://github.com/iovisor/bpftrace

I would also strongly recommend Brendan Greggs book: http://www.brendangregg.com/bpf-performance-tools-book.html

ylyn · on Nov 19, 2020

Seems like someone did try to get those functions exported, but the maintainer rejected it, saying that no driver should be poking so deep into fd internals. Makes sense. Your use case is kind of niche.

https://lore.kernel.org/lkml/20180730163256.GC27761@infradea...

By the way, C Playground is really helpful for teaching an OS course!

ksml · on Nov 19, 2020

That is really interesting and good to know -- thanks for that!

I hope C Playground is helpful, and I'm building it with teaching in mind. If you teach anywhere and could find it useful, let me know!

waiseristy · on Nov 20, 2020

That entire email chain was unpleasant. Are Linux maintainers typically that combative?

Thorrez · on Nov 20, 2020

> ... and there's a perfectly sane solution to that - it's called git rm.

> The fundamental problem here (besides "who the hell thought that this Fine Piece Of Software belongs anywhere other than in /dev/null?") [...]

Lol, is this a group of people trying to write software, or a group of people having a dissing contest?

tinus_hn · on Nov 20, 2020

It’s the result of someone (appearing to be) trying to play politics to get their way, while their way is not the way the kernel works.

Sounds harsh. Now for comparison try standing next to an electrician and suggest alternate ways of doing things that are dangerous and wrong.

Thorrez · on Nov 21, 2020

> It’s the result

It could be handled differently. The kernel author could simply say "this isn't how the kernel works, so we cannot accept this". There isn't a need to come up with wacky insults, as humorous as they may be.

> Sounds harsh. Now for comparison try standing next to an electrician and suggest alternate ways of doing things that are dangerous and wrong.

To become an electrician you take classes and become certified. How does someone become a kernel developer? I would assume by interacting with other kernel developers, suggesting ideas, getting feedback on those ideas, etc.

An electrician wiring a house is a single person job. An open source project is a team job, and there's a reason development takes place out in the open: so that others can contribute. If outside contributions to the project isn't allowed, why not make it a source available project instead of open source?

tinus_hn · on Nov 21, 2020

You are closing your eyes and then asking someone to show you something.

If you think you know it all you don’t have to ask me.

You’ll probably find out for yourself how it works eventually. Good luck!

Thorrez · on Nov 21, 2020

I certainly don't know it all. Yeah, hopefully I do continue to learn, and I hope I don't have my eyes closed. Thanks, and likewise.

ylyn · on Nov 19, 2020

Here's a hack you could use to get around the functions not being exported: https://github.com/anbox/anbox-modules/blob/master/binder/de...

Soft · on Nov 20, 2020

This will stop working since kallsyms_lookup_name is no longer exported by recent kernels. See [1].

[1]: https://lwn.net/Articles/813350/

ksml · on Nov 19, 2020

Oh, that's clever! I might try that. I really don't feel comfortable building my own kernel

loeg · on Nov 19, 2020

Definitely try to get comfortable with building a kernel eventually. You don't have to run it on your bare metal machine; you can boot test kernels in a VM. The actual test / development process is not especially different between kernel and modules.

noncoml · on Nov 19, 2020

I see the world “nightmare” used a lot in this attic ale.

I wonder if I am the only one that loves debugging difficult/weird problems. It’s something like trying to solve a puzzle. And knowing that the system will never deceive me(it will not be the system’s fault if I get deceived), and that a perfectly reasonable good explanation exists for what I observe helps me do not give up.

zaptheimpaler · on Nov 20, 2020

Same. I would love a job comprising solely of jumping into big hairy systems and debugging weird issues. Its much more interesting to understand how exactly things work at every level of the stack (the bottom of the stack being OS/kernel or even hardware stuff, not a backend endpoint or database) than writing code.

zerkten · on Nov 19, 2020

> I wonder if I am the only one that loves debugging difficult/weird problems.

Same here. At times, I'd prefer to just work on debugging things for colleagues versus writing rather boring code. It can give some insights when it comes to design, as well as enabling customer support to fix certain issues.

toast0 · on Nov 20, 2020

It helps to have colleagues that break things in interesting ways. ;) Also important is a supportive manager, and a 'real job' that is usually time flexible; you might need to drop what you're working on to debug an issue when it's happening, so that needs to be mostly OK.

Or if you like networking challenges, having widely distributed users on diverse platforms and networks, and either running your own load balancers or using DNS or application level balancing, so that you can see the actual network flow, and not only the parts that make it through a load balancer.

Of course, it's a lot of frustration when you find the issue, and it's in some random router in some far off locale with no way to contact. Things like the linux large receive offloading bug that would receive larger than MTU packets because of offloading, then drop the packet (and send ICMP needs frag) because it's larger than the MTU of the destination address. I fixed the FreeBSD bad behavior when getting such an ICMP, but it would be nice if systems operating as routers would update their kernels a couple of times a decade. I could (and have, elsewhere) rant about more MTU problems, but let's just say, they're out there, they're stupid, and it's hard to get them fixed. Ugh.

Glyptodon · on Nov 20, 2020

I enjoy it, but hate that it's almost always for something that needed to be figured out yesterday.

sweettea · on Nov 19, 2020

You probably already did this, but for the audience: one of the best ways to make sure you're using a function reasonably is to use elixir.bootlin.com to look at other uses and make sure you're using the function similarly. For instance, check out https://elixir.bootlin.com/linux/latest/A/ident/for_each_pro... .

ksml · on Nov 19, 2020

Elixir was extremely helpful to me! It didn't always help me understand _why_ code was written the way it was (hence my incorrect use of rcu_read_lock), but it was very helpful to see some examples.

stevekemp · on Nov 20, 2020

I've not done too much kernel programming, but for sure I know that looking for existing uses of code is very helpful.

It looks like the author of the piece did something similar, and noted other people doing similar things to themselves.

I wrote some modules to experiment with the Security Module API, because working with the APIs seemed like a good way to learn how they worked, and what was possible beyond just SELinux,Apparmor, etc.:

https://github.com/skx/linux-security-modules

wyldfire · on Nov 20, 2020

My knee jerk reading this article and seeing a kernel module near 'nodejs' was to grumble and say "wtf they clearly didn't need a kernel module for this". But upon reading deeper I see that accessing the kernel is kinda appropriate.

Regardless of whether you end up using eBPF or a .ko like you already have, you may have a yet simpler option. By leveraging the loader you can do an interposition trick with LD_PRELOAD to hook C library accesses. Maybe this is all you need in order to "help students understand system calls such as open, close, dup2, fork, pipe, and others. "

Just a suggestion. Carry on, good show.

egberts1 · on Nov 20, 2020

Takes me back to the days of ATM device driver debugging. I’ve written 9 kernel drivers. All in all, a dedicated standalone terminal attached to the serial port of the target is still your best friend.

lhoursquentin · on Nov 19, 2020

Great post, also love what you are trying to do with C playground, this is awesome!

I've recently been trying to build something similar, visualizing forks/exeve/read/write, but using the strace output of a binary, which is much less challenging.

ksml · on Nov 19, 2020

Thank you! It's open source, and I'd love to hear if you have any suggestions for it. Would also love to see what you're building!

lhoursquentin · on Nov 19, 2020

Cool I'll definitely try to set it up in the coming days!

Here's my humble strace visualizer: https://lhoursquentin.github.io/visual-strace/

nosefrog · on Nov 19, 2020

Great story! I've had a lot of debugging nightmares, but thankfully never anything as bad as that.

One thing that looks fishy is this branch:

  if (container_tasks_len == max_container_tasks) {
    printk("cplayground: ERROR: container_tasks list hit capacity! We "
    "may be missing processes from the procfile output.\n");
    break;
  }

Since you said printk can block, why isn't calling it in the rcu critical section a bug? Is it because you immediately break afterwards and don't try to reference the next task?

ksml · on Nov 19, 2020

That's a good point. I'm hoping that this never gets hit, and if that line ever appears in the logs, then things are already broken. However, it's probably better to improve the failure mode where possible :)

[edit] and yes, since we break and don't follow the `next` pointer in the linked list, that also shouldn't cause any problems.

[edit 2] a sibling comment by cesarb pointed out that printk actually does not block, since it's important for it to be usable in critical sections to debug when the kernel gets into trouble

secondcoming · on Nov 19, 2020

Great article! Reminds me of when I was working on a bug in a phone kernel and adding its equivalent of printk() made the bug disappear! Lauterbach time!

pjmlp · on Nov 20, 2020

Back in the Windows NT/2000 days, IIS executed as part of the kernel, debugging ISAPI extensions was an exercise in patience every time a programming error crashed the kernel and a reboot was in order.

known · on Nov 20, 2020

Free Book https://www.tldp.org/LDP/lkmpg/2.6/html/lkmpg.html

foxhlchen · on Nov 21, 2020

nice article but I think op should use debugfs instead of /proc. debugfs is designed for this purpose.

devit · on Nov 19, 2020

You can do most or all of that by reading /proc/<pid>/fdinfo/<fd> and /proc/<pid>/fd/<fd> or by making system calls on the affected fds (which you can do e.g. by injecting code with LD_PRELOAD or ptrace or with nsenter with fd namespace or equivalent C code).

Even if you write a kernel driver, iterating over all tasks in the system is a terrible design (there may be millions), not to mention "determining if a task belongs to a C playground program" in the kernel (obviously the kernel should have no knowledge about such specifics).

Of course, if a developer cannot even produce a reasonable overall design, it's not surprising that they aren't capable of writing correct code.

nosefrog · on Nov 19, 2020

"Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community."

https://news.ycombinator.com/newsguidelines.html

ksml · on Nov 19, 2020

I actually cannot get enough information from doing that. Crucially, I need to be able to recognize whether two file descriptors point to the same open `file_struct`. (To be clear, this isn't the same as whether they're pointing to the same file path. I need to know when the two file descriptors are sharing the same cursor.) There is no way to do this using existing APIs, because there is nothing identifying a `struct file` besides the memory address of the struct. (The "open file IDs" I mention are hashes of the `file_struct` address.)

I did spend a lot of time trying to avoid writing a kernel module, and this was the only way I could find to do it :)

devit · on Nov 19, 2020

You can use the kcmp system call with KCMP_FILE argument to find out if two fds point to the same files structure (of course you must use this as the custom comparison function of a sort algorithm so you don't end up with quadratic run time).

Linux has a project called CRIU that can save and restore processes to disk without needing additional kernel modules, so pretty much all state is already gettable and settable from user space.

ksml · on Nov 19, 2020

I can't do that across processes, though, can I? (to see whether two processes have file descriptors pointing to the same open file) edit -- it does look like it works cross-process!

I hadn't heard of CRIU. I'll check that out. (edit: CRIU looks super useful. I think the speed/overhead of snapshotting will decide whether I can use it for this project, but I can imagine it being handy in the future regardless. Thanks for the link.)

dilyevsky · on Nov 19, 2020

I recommend checking out podman (or docker) - they have built-in criu support. Otherwise you’ll need some other namespacing mechanism to avoid colliding pids

ksml · on Nov 19, 2020

Every C Playground program runs in a Docker container, so this is already perfectly set up for CRIU. I might give it a try!

dilyevsky · on Nov 19, 2020

Also check out kcmp man it totally allows you to compare fds across pids