FPGAs for native FP4 will change the entire landscape

blacklion · on March 27, 2024

Entire landscape of open graphic chips?

Not every GPU should be used to train or infer so-called AI.

Please, stop, we need some hardware to put images on the screens.

Y_Y · on March 27, 2024

Four-bit floats are not as useful as Nvidia would have you believe. Like structured sparsity it's mainly a trick to make newer-gen cards look faster in the absence of an improvement in the underlying tech. If you're using it for NN inference you have to carefully tune the weights to get good accuracy and it offers nothing over fixed-point.

imtringued · on March 28, 2024

The actual problem is that nobody uses these low precision floats for training their models. When you do quantization you are merely compressing the weights to minimize memory usage and to use memory bandwidth more efficiently. You still have to run the model at the original precision for the calculations so nobody gives a damn about the low precision floats for now.

Y_Y · on March 29, 2024

That's not entirely true. Current-gen Nvidia hardware can use fp8 and newly announced Blackwell can do fp4. Lots of existing specialized inference hardware uses int8 and some int4.

You're right that low-precison training still doesn't seem to work, presumably because you lose the smoothness required for SGD-type optimization.

jsheard · on March 27, 2024

Very briefly, until someone makes an ASIC that does the same thing and FPGAs are relegated to niche use-cases once again.

FPGAs only make long-term sense in applications that are so low-volume that it's not worth spinning an ASIC for them.

iAkashPaul · on March 27, 2024

Absolutely

imtringued · on March 28, 2024

How? NPUs are going to be included in every PC in 2025. The only differentiators will be how much SRAM and memory bandwidth you have or whether you use processing in memory or not. AMD is already shipping APUs with 16 TOPS or 4 TFLOPS (bfloat16) and that is more than enough for inference considering the limited memory bandwidth. Strix Halo will have around 12 TFLOPS (bfloat16) and four memory channels.

llama.cpp already supports 4 bit quantization. They unpack the quantization back to bfloat16 at runtime for better accuracy. The best use case for an FPGA I have seen so far was to pair it with SK Hynix's AI GDDR and even that could be replaced by an even cheaper inference chip specializing in multi board communication and as many memory channels as possible.

luma · on March 27, 2024

How so?

CamperBob2 · on March 27, 2024

4-bit values (or 6-bit values, nowadays) values are interesting because they're small enough to address a single LUT, which is the lowest-level atomic element of an FPGA. That gives them major advantages in the timing and resource-usage departments.

iAkashPaul · on March 27, 2024

Reduced memory requirements, dropping higher precision IP blocks for starters