Pipelining is the direct reason behind 1996 Quake running 30 fps on Intel Pentiu...

gary_0 · on Dec 30, 2024

Specifically, Carmack exploited the fact that on the Pentium, integer instructions could run in parallel with floating-point division[0]. This goes to show that an optimization that usually gets you ~10% might get you 200% depending on what software you're running. And that no implementation detail is safe from an ambitious software engineer.

[0] https://news.ycombinator.com/item?id=38249029

rasz · on Dec 31, 2024

No, thats not it as explained in my reply to that TerrifiedMouse comment in 2023 :)

On x86 Integer instructions _never_ waited for floating point opcode to retire. You can check it yourself by reading FPU status register busy flag - if FPU was blocking you could never catch it in BUSY state and FWAIT would be useless :-)

Abrash exploited the fact Pentium was the first time x87 FPU instructions could run pipelined overlapping one another. All other x86 vendors FPUs waited for previous FPU instruction to retire.

https://www.agner.org/optimize/microarchitecture.pdf page 47

    While floating point instructions in general cannot be paired, many can be pipelined, i.e. one
    instruction can begin before the previous instruction has finished. Example:
     fadd st1,st0 ; Clock cycle 1-3
     fadd st2,st0 ; Clock cycle 2-4
     fadd st3,st0 ; Clock cycle 3-5
     fadd st4,st0 ; Clock cycle 4-6
    Obviously, two instructions cannot overlap if the second instruction needs the result of the
    first one. Since almost all floating point instructions involve the top of stack register, ST0, 
    there are seemingly not very many possibilities for making an instruction independent of the
    result of previous instructions. The solution to this problem is register renaming. The FXCH
    instruction does not in reality swap the contents of two registers; it only swaps their names.

Paradoxically often mentioned Texture Divide every 16 pixels overlaps just fine on all CPUs and doesnt explain performance discrepancies. Intel FDIV latency is 19 cycles compared to Cyrix 24, mere 20% yet you need almost double the MHz on Cyrix to match FPS numbers. Answer lies in rest of Quake heavily pipelined FPU code.

gpderetta · on Dec 30, 2024

So that was not actually pipelining, but superscalar execution. Incidentally division wasn't pipelined at all.

ack_complete · on Dec 31, 2024

Pipelining did also play a role. The Quake inner rasterization loop has a decent amount of non-division math in it as well that leverages the Pentium's ability to execute FP add/multiplies at 1/cycle. The K6 and 6x86 FPUs were considerably slower -- 2 and 4 cyclesnon-pipelined (http://www.azillionmonkeys.com/qed/cpuwar.html).

Additionally, the FXCH instructions required to optimally schedule FPU instructions on the Pentium hurt 486/K6/6x86 performance even more since they cost additional cycles. Hard for the 6x86 to keep up when it takes 7 cycles to execute an FADD+FXCH pair vs. 1 for the Pentium.