VLIW was the best implementation (20 years ago) of instruction level parallelism...

jabl · on Sept 14, 2020

I agree there is merit in the idea of encoding instruction dependencies in the ISA. There have been a number of research projects in this area, e.g. wavescalar, EDGE/TRIPS, etc.

It's not only about reducing the need for figuring out dependencies at runtime, but you could also partly reduce the need for the (power hungry and hard to scale!) register file to communicate between instructions.

floatboth · on Sept 15, 2020

Main lesson: we failed to make all the software JIT-compiled or AOT-recompiled-on-boot or something, that would allow retargeting the optimizations for the new generation of a VLIW CPU. Barely anyone even tried. Well I guess in the early 2000s there was this vision that everything would be Java, which is JIT, but lol

dragontamer · on Sept 15, 2020

Your point seems invalid, in the face of a large chunk of HPC (neural nets, matrix multiplication, etc. etc.) getting rewritten to support CUDA, which didn't even exist back when Itanium was announced.

VLIW is a compromise product: its more parallel than a traditional CPU, but less parallel than SIMD/GPUs.

And modern CPUs have incredibly powerful SIMD engines: AVX2 and AVX512 are extremely fast and parallel. There are compilers that auto-vectorize code, as well as dedicated languages (such as ipsc) which work for SIMD.

Encoders, decoders, raytracers, and more have been rewritten for Intel AVX2 SIMD instructions, and then re-rewritten for GPUs. The will to find faster execution has always existed, but unfortunately, Itanium failed to perform as well as its competition.

floatboth · on Sept 16, 2020

I'm not talking about rewrites and GPUs. I'm saying we do not have dynamic recompilation of everything. As in – if we would have ALL binaries that run on the machine (starting with the kernel) stored in some portable representation like wasm (or not-fully-portable-but-still-reoptimizable like llvm bitcode) and recompiled with optimization for the current exact processor when starting. Only that would solve the "new generation of VLIW-CPU needs very different compiler optimizations to perform, oops all your binaries are for first generation and they are slow now" problem.

GPUs do work like this – shaders recompiled all the time – so VLIW was used in GPUs (e.g. TeraScale). But on CPUs we have a world of optimized, "done" binaries.

ATsch · on Sept 14, 2020

All of this hackery with hundreds of registers just to continue to make a massively parallel computer look like an 80s processor is what something like Itanium would have prevented. Modern processors ended up becoming basically VLIW anyway, Itanium just refused to lie to you.

dragontamer · on Sept 14, 2020

When standard machine code is written in a "Dependency cutting" way, then it scales to many different reorder registers. A system from 10+ years ago with only 100-reorder registers will execute the code with maximum parallelism... while a system today with 200 to 300-reorder buffers will execute the SAME code with also maximum parallelism (and reach higher instructions-per-clock tick).

That's why today's CPUs can have 4-way decoders and 6-way dispatch (AMD Zen and Skylake), because they can "pick up more latent parallelism" that the compilers have given them many years ago.

"Classic" VLIW limits your potential parallelism to the ~3-wide bundles (in Itanium's case). Whoever makes the "next" VLIW CPU should allow a similar scaling over the years.

-----------

It was accidental: I doubt that anyone actually planned the x86 instruction set to be so effectively instruction-level parallel. Its something that was discovered over the years, and proven to be effective.

Yes: somehow more parallel than the explicitly parallel VLIW architecture. Its a bit of a hack, but if it works, why change things?

anarazel · on Sept 14, 2020

I don't understand how an increase, including the implied variability, of CPU internal parallelism and VLIW benefits go together?

dragontamer · on Sept 14, 2020

I'm talking about a mythical / mystical VLIW architecture. Obviously, older VLIW designs have failed in this regards... but I don't necessarily see "future" VLIW processors making the same mistake.

Perhaps from your perspective, a VLIW architecture that fixes these problems wouldn't necessarily be VLIW anymore. Which... could be true.

moonchild · on Sept 14, 2020

Have you seen the mill cpu?

javajosh · on Sept 14, 2020

Has anyone?

dralley · on Sept 14, 2020

At the rate they're going, all the patents they've been filing will be expired by the time they get a chip out the door.