Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Some ways to get better at debugging (jvns.ca)
94 points by mfrw on Aug 31, 2022 | hide | past | favorite | 47 comments


I’d like to signal boost strace. It is a lifesaver. In some ways it’s more important than your compiler: it’s how you know what wrong-ass directory your compiler is pulling the broken config from.

I only know like 10% of the tool and it’s just indispensable. Use strace!!!!!!!


It's especially great in how easy it is to just immediately drop in and get some useful feedback.

Except not on MacOS. Lots of tracing is denied by System Integrity Protection; you can use Instruments.app [0] but it doesn't lend itself as well to quick exploration in my experience.

[0] - something like: `xctrace record --template 'System Trace' --launch -- /bin/ls`, then open the trace with Instruments.app.


>it’s how you know what wrong-ass directory your compiler is pulling the broken config from

Mind=blown. Of course! I've been neglecting this thing for so many years!

Question: any recommended reading for using strace effectively?


`strace` has a bunch of features that I'd like to learn thoroughly someday, but I get weekly if not daily value from a few basic invocations:

`run_my_build.whatever` -> hangs, damn.

`strace run_my_build.whatever` -> awesome i can see all the system calls so i know what `.so`s it's pulling in, and what config files it's trying to read, and if it's hanging on a network socket or whatever. This is usually where `strace` just immediately solves my problem.

If you've got like a background daemon that's misbehaving you can do `strace -p <PID>` and it will attach and start printing out the syscalls, this can also be really useful.

`strace` (on all my systems at least) logs to `STDERR` by default, so sometimes you want some combination of like `2>/tmp/log.strace.blah` or to interleave the `STDOOUT` of the process so it's just the usual shell stuff `strace whatever 2>&1 | rg -C ...`.

My use of the tool is very basic, but that's part of what makes it such a great tool, a few simple invocations will just save your bacon. This is especially true on a new team/company or whatever where you run the thing from the Wiki and it doesn't build or start or...


If you use `htop`, you can press `s` while hovering over the process. In addition to providing the `strace -p` functionality, you also get some rudimentary helper functionality like searching, filtering, or manually scrolling vs. tailing the output.


That's super cool. Thanks for the tip!


Nice. I definitely need to use this more. Thanks :)


Having strace and gdb working always is a major reason why I prefer to develop on Linux. I don't typically use much of the feature set, but I almost always have to reach for '-s' as the default string length of 32 is much too tiny.


Does anyone have alternatives that work well on Mac? Right now, I keep a virtual machine so that I can use strace (among other reasons), which is workable but tough to use quickly.


Debugging is an important skill in software engineering and it’s hardly ever taught formally in school. So much time is spent in investigating red herrings like “let’s add capacity to system x because rpcs from y to x are slow”, without debugging simpler things like: “why each rock is slower than before?” “is it because of total number of rpcs or single rpcs?”

I wonder if there was formal education for this, a lot of people would sleep better.


For what it's worth, I think physical science education is closer to what you are describing.

I studied chemistry, and I'm not sure if it's a causal relationship, but I find I am much better at debugging (in development) and "firefighting" (in production) than my peers.

I guess my conclusion is CS students should be required to take more science classes, or, preferably, software engineer hopefuls should be pushed to study other disciplines.


That’s a very astute observation. I agree with utility of science education for programming. Something that science and programming share is the process of reductionism — identifying causal relationships by breaking into ever smaller subsystems.


A higher level approach that I try follow is to design a quick experiment, log output etc., as soon as I 'think I know what the problem is related to'. Because I'm very often wrong about correlated events in the system actually being related to the cause of the bug, the quicker I can invalidate that bad idea with real evidence the better.

My longest debugging session were too long because I spent so much time trying to validate a hunch when I should have been trying to invalidate it. This might seem like common sense but it's taken me years to figure this out and it's really made me much more efficient at finding root causes.


Recording yourself debugging, then possibly reviewing the video later, is a good way to improve.

Maybe during debugging I consider the strategy of writing a minimal reproduction, but decide it's too time consuming. Then after reviewing the video, I can see that it would've been much faster if I had just written the reproduction to begin with.

Taking breaks does interfere with this, but I still think it's often a good idea.

Another strategy is to add assertions for your hypotheses, which can sometimes be preferable to logging.


Many decades ago I spent a few days with a colleague in one exhausting days-long continuous session, debugging a generational garbage collector. The nature of those (essentially a global graph rewriting) is such that failures will be greatly separated from their causes.

I really learned to appreciate an abundance of asserts; the sooner you catch the issue, the less time you have to spend chase it around. Of course, asserts are in essence filtering the allowable state at a point and it's always more powerful when you can express that in the type system, which is why I love languages with rich static type systems, like Rust and Haskell.


I think I've spent more time debugging errors (hitting head against wall) that come from caching than anything else.

This is some combination of: content cached by Cloudflare / Vercel / node server (e.g. using "node cache") / the browser (figuring out SWR, why it's working, not working, working too well, etc. etc.) / cookies/local storage / PWA settings...

Aside from using browser tools like Chrome's Network and Application tabs combined with switching to different browsers and outputting a lot of console logs... what tools should I be using to get to the root of these issues?


Just spent three hours debugging a thing that turned out to be a cache issue. I think a big part of the problem is that it's not always obvious that the resource is coming from a cache. Some proxy somewhere is serving old data, but you won't know until you actually compare the data with the origin.


Read the stack trace / error message

A lot of the time the answer is there if you read it :)


I have no idea the number of times someone has come and told me my code is broken, and when I run it like they did the error message clearly indicates the input was incorrect. Sometimes even when I detect incorrect input and print a message like "input 'manufacturer' must be either 'lexus' or 'toyota'." and they still keep trying to input 'bananna' or '3.14159' or '{'lexus', 'toyota'}' or something that isn't just 'lexus' or 'toyota'!!!


Quickly parsing / perusing stack traces is sort of a skill that comes from experience (looking at you, large-ass Java stack traces with multiple "caused by:"s).

I've noticed this while coaching / working with junior developers, is that this is a still that some haven't quite developed yet, and others clearly have.


I’m not even talking stack traces, this case was literally someone not reading a line of text.


> Read the stack trace / error message

Unless you're using golang. There's no stack trace by default. It's left as an exercise to to programmer to wrap and add context to errors.


Some stack traces / error messages are less than awesome though


Does #5 sound like a tautology to anyone else?

"In order to know how to solve problems get more experience solving problems so you know how."

or maybe

"To know the answers to your questions learn what you need to know to answer them"


"Good judgment comes from experience. Experience comes from bad judgment."

One of my fave quotes (Attributed to "Nasrudin," but usually referenced from Will Rogers or Rita Mae Brown).


It's very much true though. A master has failed far more times than a beginner has tried.


I'll add two more that I find important to the "Strategies" category:

1. Write things down. Have an assumption? Write it down. Then write down how you will test your assumption and the results.

2. Being able to systematically enumerate possibilities and rule them out one by one, often times indirectly. Yes, one needs to understand the system but one also needs to have a good strategy to use that knowledge. By "indirectly," I mean being able to reason like so: If P, then Q. Not Q. Therefore not P.


That reminds me of the Mikado method [1]:

  - Set a goal
  - Experiment
  - Visualize
  - Undo
[1] Ellnestam, O. Brolund, D. (2014). The Mikado Method. Manning Publications Co.


Or as written by George Pólya in How To Solve It

  1. See
  2. Plan
  3. Do
  4. Check


“OODA loop" also can be applied here:

- Observe

- Orient

- Decide

- Act


I always see these types of acronyms and check lists and wonder why they exist. Is this something that human beings use?



Completely irrelevant to debugging, but the "OODA loop" concept is one that's applied often in martial arts and close-quarters combat.


- Observe: observe bug.

- Orient: check code that maybe involved, check in environment related variable values.

- Decide: decide what you will change in your code to try to fix bug.

- Act : implement fix.

Repeat all steps until you found fix.

Of course you can use different approach to debugging but this one works quite well too.


I think debugging is the number one skill for the average programmer. It implies being quite good at reverse engineering. But as far as debugging is concerned, it is a must to know:

1) how to reset frame 2) how to set conditional breakpoints 3) do binary search for interesting breakpoint candidates. Going down from a single point isn't very efficient, is better to put a break before and after, then restrict the instruction space by moving them both. That is for non obvious bugs of course

Bonus point: use an ide or something to display objects in a custom way (if your language is OO) like IntelliJ allows.

And another thing to keep in mind is Sherlock's rule: once everything possible is ruled out, it remains the impossible. Your bug is there


> adding extra logging

Were that could be done without altering production code or restarting production services that don’t have any facility to change the logging levels on the fly.

Was investigating an issue today in some code that only logged that an HTTP request was made to a specific endpoint: no details on why it decided to make that request where it was making the request from, what the request is expected to do or if it succeeded. The unit test approach is the only viable option at that level; isolate as little data as possible to replay until it succeeds. But that costs a lot of time and may not be safe depending on the side-effects. Maybe knowing how to snapshot and restore system state should be on the list too?


you can do that using macros in langs such as c, cpp or elixir


If you're in C/C++ the number 1 way is to use RR. It's gdb+time travel. Reverse watches on variables are the absolute best thing ever.


Almost as good as https://pernos.co (a commercial product built on top of rr produced by the developers of rr)


My favorite tip is "Stop Thinking and Start Looking".

Looking at the error and trying to deduce where the problem can work.

But normally, you need to run the code and observe what happens. When what does happen differs from what should happen, you can zero in on the bug empirically.

This process is typical a binary search, which finishes in a few steps, as opposed to the intellectual riddle solving.


This.


When you cannot find the origin of the bug try to isolate a problem as much as possible. By isolation I mean remove as much code as possible but at the same time reproduce the bug. After start debugging. That way you can end up debuging smaller code and concentrate your attention on smaler piece of the code.


Somehow they missed the obvious: "be sloppier while coding--thinking less and typing more--so you cause more bugs and are forced to get almost continual practice debugging".


Nice article, but ...

WHERE IS THE BIT ABOUT THE DUCK????


> explaining the bug to a friend and then figuring out what’s wrong halfway through

Just assume the friend is a rubber duck.


unit tests and good design eliminate the vast majority of needs for old-fashioned debugging


They really don't.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: