Is this still the case if I have different ALU operations. Say I have a single A...

BeeOnRope · on Dec 30, 2024

Yes, it applies to different operations. E.g. you could interleave two or three different operations with 3 cycle latency and 1 cycle inv throughput on the same port and get 1 cycle inv throughput in aggregate for all of them. There is no restriction that they must be same operation.

In some cases mixing operations with _different_ latencies on the same execution port will leave you with less throughput than you expect due to "writeback conflicts", i.e., two instructions finishing on the same cycle (e.g., a 2 cycle operation starting on cycle 0 and a 1 cycle operation on cycle 1, will both finish on cycle 2 and in some CPUs this will delay the results of one of the operations by 1 cycle due to a conflict).

bjourne · on Dec 30, 2024

A modern ALU has multiple pipelined data paths for its operations. So maybe three adders with a one-cycle latency, two multipliers with a three-cycle latency, and one divider with a 16-cycle latency. Sustained throughput depends on the operation. Maybe one per cycle for add and multiply, but only one every eight cycle for divide.

barbegal · on Dec 30, 2024

You need to be far more specific than x86, x86 is just the instruction set, the actual architecture can vary massively with the same instruction set.

In general though there is no penalty for interleaved operations.

Tuna-Fish · on Dec 30, 2024

You can interleave most operations how you like, without any extra latency, each op starting as soon as all the results are ready, regardless of where they were computed.

There are some exceptions where "domain crossing", or using a very different operation costs an extra clock cycle. Notably, in vector registers using FP or integer operations on the result of the different type of op, as modern CPUs don't actually hold FP values in their IEEE754 transfer format inside registers, but instead registers have hidden extra bits and store all values in normal form, allowing fast operations on denormals. The downside of this is that if you alternate between FP and INT operations, the CPU has to insert extra conversion ops between them. Typical cost is 1-3 cycles per domain crossing.