I broadly agree with the thesis of the post, which if I understand correctly is ...

rwmj · on Nov 4, 2024

> - Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.

I believe it's because of the landing pad for Control Flow Integrity which basically all functions now need. Grabbing main() from a random program on Fedora (which uses frame pointers):

    0000000000007000 <main>:
    7000:       f3 0f 1e fa       endbr64     ; landing pad
    7004:       55                push   %rbp ; set up frame pointer
    7005:       48 89 e5          mov    %rsp,%rbp

It's not much of an issue in practice as the stack trace will still be nearly correct, enough for you to identify the problematic area of the code.

> - Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?

Yes shadow stacks are limited to 32 entries on the most recent Intel CPUs (and as little as 4 entries on very old ones). However they are basically cost free so that's a big advantage.

I think SFrame is a sensible middle ground here. It's saner than DWARF and has a long history of use in the kernel so we know it will work.

Sesse__ · on Nov 4, 2024

If you're limited to 32 entries, why not just use LBR, then? It has basically the same pros and cons.

Sesse__ · on Nov 4, 2024

> - 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.

TBH I wouldn't be surprised on x86. There are so many registers to be pushed and popped due to the ABI, so every time I profile stuff I get depressed… Aarch64 seems to be better, the prologues are generally shorter when I look at those. (There's probably a reason why Intel APX introduces push2/pop2 instructions.)

manwe150 · on Nov 4, 2024

This sounds to me more like an inlining problem than an ABI problem. If the calls take as much time than the running, perhaps you just need a better language that doesn’t arbitrarily prevent inlining due to compilation boundaries (eg. basically any modern language that isn’t in the C/C++ family, before LTO)

Sesse__ · on Nov 4, 2024

I see this in LTO/PGO binaries as well. If a function is 20 instructions long, it's not like you can inline it uncritically, yet a five-cycle prologue and a five-cycle epilogue will hurt. (Also, recursive functions etc.)

audidude · on Nov 4, 2024

> Shadow stacks are cool but aren't they limited to a fixed number of entries?

Current available hardware yes. But I think some of the future Intel stuff was going to allow for much larger depth.

> Is the memory overhead of lookup tables for very large programs acceptable?

I don't think SFrame is as "dense" as DWARF as a format so you trade a bit of memory size for a much faster unwind experience. But you are definitely right that this adds memory pressure that could otherwise be ignored.

Especially if the anomalies are what they sound like, just account for them statistically. You get a PID for cost accounting in the perf_event frame anyway.

quotemstr · on Nov 4, 2024

> temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers,

Temporary solutions have a way of becoming permanent. I was against the recent frame pointer enablement on the grounds of moral hazard. I still think it would have been better to force the ecosystem to get its act together first.

Another factor nobody is talking about is JITed and interpreted languages. Whatever the long-term solution might be, it should enable stack traces that interleave accurate source-level frame information from native and managed code. The existing perf /tmp file hack is inadequate in many ways, including security, performance, and compatibility with multiple language runtimes coexisting in a single process.

audidude · on Nov 4, 2024

It's a disaster no doubt.

But, at least from the GNOME side of things, we've been complaining about it for roughly 15 years and kept getting push-back in the form of "we'll make something better".

Now that we have frame-pointers enabled in Fedora, Ubuntu, Arch, etc we're starting to see movement on realistic alternatives. So in many ways, I think the moral hazard was waiting until 2023 to enable them.