> with a vanishingly small fraction of flops and a small fraction of memory band...

> with a vanishingly small fraction of flops and a small fraction of memory bandwidth

Is it though?

Wikipedia says [1] an M3 Max can do 14 TFLOPS of FP32, so an M3 Ultra ought to do 28 TFLOPS. nVidia claims [2] a Blackwell GPU does 80 TFLOPs of FP32. So M3 Ultra is 1/3 the speed of a Blackwell.

Calling that "a vanishingly small fraction" seems like a bit of an exaggeration.

I mean, by that metric, a single Blackwell GPU only has "a vanishingly small fraction" of the memory of an M3 Ultra. And the M3 Ultra is only burning "a vanishingly small fraction" of a Blackwell's electrical power.

nVidia likes throwing around numbers like "20 petaFLOPs" for FP4, but that's not real floating point... it's just 1990's-vintage uLaw/aLaw integer math.

[1] https://en.wikipedia.org/wiki/Apple_silicon#Comparison_of_M-...

[2] https://resources.nvidia.com/en-us-blackwell-architecture/da...

Edit: Further, most (all?) of the TFLOPs numbers you see on nVidia datasheets for "Tensor FLOPs" have a little asterisk next to them saying they are "effective" TFLOPs using the sparsity feature, where half the elements of the matrix multiplication are zeroed.