Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The frontend decoder hasn't been a frequent bottleneck in Intel cores for a long time and they could scale it up more aggressively if they wanted.

This isn't grounded in any facts. Decoding the variable length x86 ISA costs you exponentially in decoding width, both power and area. You can scale it, but it will never be efficient. The way Intel and AMD combat this is by having a decoded uOP cache from which the issue width is typically twice that of the frontend decoder. Arm64 has an inherent advantage here (RISC-V does not have quite the same advantage as RV64GC instructions are a mix of 16- and 32-bit). Arm64 also is much more recent design than x86_64 that learned a lot from the past experience and isn't bogged down by a lot of useless legacy. This helps.

Arm64 is rather large for a RISC ISA, but it's mostly pretty good (however IMO RISC-V's lack of flags and implementation of conditional branches is superior).



Of course a fixed-length ISA has an inherent advantage for parallel decoding efficiency. The question is whether that is a decisive advantage in M1's impressive performance. After Intel refined their decoder and uop cache, you virtually never see that part of the frontend as a bottleneck when doing microarchitectural profiling. That's been true since Sandy Bridge but even more so since Skylake.

All the legacy junk in x86 is obviously a pain for Intel to support. Any blank-slate ISA is going to have an advantage there.


Presumably the importance of decode bandwidth depends on what you're decoding.

Most classic computationally intensive work (video encoding, science, but also benchmarks) spends its time in fairly tight loops or small kernels, running over large data. uop caches make decode bandwidth irrelevant here.

But general usage of a machine sees the instruction pointer wander all over the place (particularly if you have multiple tabs of JavaScript open). More decode bandwidth means more performance here.

Are compilers are an an example of a heavy workload with a large hot code size? It would be interesting to compare the M1's advantage in compiling to its advantage in, say, video encoding.


It doesn't take exponential power. My understanding is that the basic approach for instructions without boundary tagging in L1I$ is to start decoding every byte in the stream in parallel, discard the ones that don't make sense, and then later propagate boundary to boundary across the length of the fetch window. Sort of similar to how a carry-bypass adder works. This is expensive but not that expensive compared to other structures.

But it does mean that x86 designs tend to carefully balance the size of the decoders to other structures to make sure they're not the binding constraint too often. With ARM the approach seems to be more to make the front end 50% bigger than you think you need to be sure it's never a problem and refill the front end buffers more quickly after a mispredict.


Yeah, the algorithm for parallel decoding you outlined scales linearly in area and power with respect to the speculative look-ahead depth. This is true even if you speculate on more than the per-byte "boundary or not boundary" condition. A parallel-prefix circuit for processing a DFA with m states where you speculate on all m possible initial states for each of n bytes "only" consumes O(m n) power. [1] In absolute terms this is obviously still a problem as you crank up m or n, but the scaling is certainly not exponential. You do see local exponential scaling if the state space is large enough (think of minimax search in chess) but for these decoding problems the state space is tiny and you don't even need to speculate over all possible states (e.g. you're not going to decode all possible combinations of 4 instructions per cycle, only certain prefixes, etc).

[1] The Hillis-Steele paper on data-parallel algorithms from 1986 describes this algorithm for parallel lexing.


You are right it's x^2 not 2^x. However it's bit worse still because the area grows too which either hurts your timing (longer distances) or forces more stages (power, yet more area, and mispredict penalty).

It simply isn't practically scalable much beyond where we are; if it were, you can be sure Intel would have scaled it instead of using µOP caches.


The power scales linearly, not quadratically with the amount of look-ahead. The "m" is the number of states you're speculating on which doesn't grow with look-ahead length. In the case where you're just speculating on whether an instruction starts at a given byte offset, you would have m = 2.

I don't think anyone is saying they could scale up the decoder "for free". If they had a fixed-length ISA, I'm sure they would have increased the decoder width sooner (and using different techniques) since with high-end out-of-order cores you're always looking for cheap ways to over-provision your pipeline even if it only helps on some workloads some of the time. Their current use of the uop cache tells us that they consider it the most economical trade-off at that point in the design space (where the decoder can output up to 4 instructions and the uop cache can output up to 6 instructions); you can't infer that they've hit an impassable brick-wall with instruction decoding.


In any case, it would be relatively simple for intel/amd engineers to evaluate the effect of different parameters using their quantitative analysis tools which include an emulation environment. I don't think it makes much sense to speculate here about these parameters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: