You can use Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4...

atwrk · 2025-03-05T16:14:06 1741191246

But 80Gbit/s is way slower than even regular dual channel RAM, or am I missing something here? That would mean the LLM would be excruciatingly slow. You could get an old EPYC for a fraction of that price and have more performance.

wmf · 2025-03-05T16:59:38 1741193978

The weights don't go over the network so performance is OK.

atwrk · 2025-03-05T18:25:07 1741199107

If I'm not mistaken, each token produced roughly equals the whole model in memory transfers (the exception being MoE models). That's why memory bandwidth is so important in the first place, or not?

wmf · 2025-03-05T21:12:45 1741209165

My understanding is that if you can store 1/Nth of the weights in RAM on each of the N nodes then there's no need to send the weights over the network.

unsatchmo · 2025-03-06T16:37:47 1741279067

You're correct about the weights: each machine could in fact store all of the weights. However I think you still have to transfer the activations and the KV-Cache while performing inference.

whimsicalism · 2025-03-05T15:40:24 1741189224

why would you ever want to do that remains an open question

aurareturn · 2025-03-05T15:53:28 1741190008

Probably some kind of local LLM server. 1TB of 1.6 TB/s memory if you link 2 together. $20k total. Half the price of a single Blackwell chip.

whimsicalism · 2025-03-05T15:59:29 1741190369

with a vanishingly small fraction of flops and a small fraction of memory bandwidth

aurareturn · 2025-03-05T16:26:34 1741191994

It's good enough to run whatever local model you want. 2x 80core GPU is no joke. Linking them together gives it effectively 1.6 TB/s of bandwidth. 1TB of total memory.

You can run the full Deepseek 671b q8 model at 40 tokens/s. Q4 model at 80 tokens/s. 37B active params at a time because R1 is MoE.

Linking 2 of these together let's you run a model more capable (R1) than GPT4o at a comfortable speed at home. That was simply fantasy a year ago.

burnerthrow008 · 2025-03-06T03:06:08 1741230368

> with a vanishingly small fraction of flops and a small fraction of memory bandwidth

Is it though?

Wikipedia says [1] an M3 Max can do 14 TFLOPS of FP32, so an M3 Ultra ought to do 28 TFLOPS. nVidia claims [2] a Blackwell GPU does 80 TFLOPs of FP32. So M3 Ultra is 1/3 the speed of a Blackwell.

Calling that "a vanishingly small fraction" seems like a bit of an exaggeration.

I mean, by that metric, a single Blackwell GPU only has "a vanishingly small fraction" of the memory of an M3 Ultra. And the M3 Ultra is only burning "a vanishingly small fraction" of a Blackwell's electrical power.

nVidia likes throwing around numbers like "20 petaFLOPs" for FP4, but that's not real floating point... it's just 1990's-vintage uLaw/aLaw integer math.

[1] https://en.wikipedia.org/wiki/Apple_silicon#Comparison_of_M-...

[2] https://resources.nvidia.com/en-us-blackwell-architecture/da...

Edit: Further, most (all?) of the TFLOPs numbers you see on nVidia datasheets for "Tensor FLOPs" have a little asterisk next to them saying they are "effective" TFLOPs using the sparsity feature, where half the elements of the matrix multiplication are zeroed.

whimsicalism · 2025-03-07T15:23:10 1741360990

TFLOPS are teraflops not “tensor flops”.

Blackwell and modern AI chips are built for fp16. B100 has 1750 tflops of fp16. M3 ultra has ~80tflops of fp16 or about 4% that of b100