> The frontend decoder hasn't been a frequent bottleneck in Intel cores for a lo...

psykotic · on Dec 20, 2020

Of course a fixed-length ISA has an inherent advantage for parallel decoding efficiency. The question is whether that is a decisive advantage in M1's impressive performance. After Intel refined their decoder and uop cache, you virtually never see that part of the frontend as a bottleneck when doing microarchitectural profiling. That's been true since Sandy Bridge but even more so since Skylake.

All the legacy junk in x86 is obviously a pain for Intel to support. Any blank-slate ISA is going to have an advantage there.

twic · on Dec 20, 2020

Presumably the importance of decode bandwidth depends on what you're decoding.

Most classic computationally intensive work (video encoding, science, but also benchmarks) spends its time in fairly tight loops or small kernels, running over large data. uop caches make decode bandwidth irrelevant here.

But general usage of a machine sees the instruction pointer wander all over the place (particularly if you have multiple tabs of JavaScript open). More decode bandwidth means more performance here.

Are compilers are an an example of a heavy workload with a large hot code size? It would be interesting to compare the M1's advantage in compiling to its advantage in, say, video encoding.

Symmetry · on Dec 20, 2020

It doesn't take exponential power. My understanding is that the basic approach for instructions without boundary tagging in L1I$ is to start decoding every byte in the stream in parallel, discard the ones that don't make sense, and then later propagate boundary to boundary across the length of the fetch window. Sort of similar to how a carry-bypass adder works. This is expensive but not that expensive compared to other structures.

But it does mean that x86 designs tend to carefully balance the size of the decoders to other structures to make sure they're not the binding constraint too often. With ARM the approach seems to be more to make the front end 50% bigger than you think you need to be sure it's never a problem and refill the front end buffers more quickly after a mispredict.

psykotic · on Dec 21, 2020

Yeah, the algorithm for parallel decoding you outlined scales linearly in area and power with respect to the speculative look-ahead depth. This is true even if you speculate on more than the per-byte "boundary or not boundary" condition. A parallel-prefix circuit for processing a DFA with m states where you speculate on all m possible initial states for each of n bytes "only" consumes O(m n) power. [1] In absolute terms this is obviously still a problem as you crank up m or n, but the scaling is certainly not exponential. You do see local exponential scaling if the state space is large enough (think of minimax search in chess) but for these decoding problems the state space is tiny and you don't even need to speculate over all possible states (e.g. you're not going to decode all possible combinations of 4 instructions per cycle, only certain prefixes, etc).

[1] The Hillis-Steele paper on data-parallel algorithms from 1986 describes this algorithm for parallel lexing.

FullyFunctional · on Dec 21, 2020

You are right it's x^2 not 2^x. However it's bit worse still because the area grows too which either hurts your timing (longer distances) or forces more stages (power, yet more area, and mispredict penalty).

It simply isn't practically scalable much beyond where we are; if it were, you can be sure Intel would have scaled it instead of using µOP caches.

psykotic · on Dec 21, 2020

The power scales linearly, not quadratically with the amount of look-ahead. The "m" is the number of states you're speculating on which doesn't grow with look-ahead length. In the case where you're just speculating on whether an instruction starts at a given byte offset, you would have m = 2.

I don't think anyone is saying they could scale up the decoder "for free". If they had a fixed-length ISA, I'm sure they would have increased the decoder width sooner (and using different techniques) since with high-end out-of-order cores you're always looking for cheap ways to over-provision your pipeline even if it only helps on some workloads some of the time. Their current use of the uop cache tells us that they consider it the most economical trade-off at that point in the design space (where the decoder can output up to 4 instructions and the uop cache can output up to 6 instructions); you can't infer that they've hit an impassable brick-wall with instruction decoding.

amelius · on Dec 20, 2020

In any case, it would be relatively simple for intel/amd engineers to evaluate the effect of different parameters using their quantitative analysis tools which include an emulation environment. I don't think it makes much sense to speculate here about these parameters.