I wonder how much this is a result of Strix Halo. I had a fairly standard stipen...

hamandcheese · 2025-07-15T01:08:34 1752541714

For the uninitiated, Strix Halo is the same as the AMD Ryzen AI Max+ 395 which will be in the Framework Desktop and is starting to show up in some mini PCs as well.

The memory bandwidth on that thing is 200GB/s. That's great compared to most other consumer-level x86 platforms, but quite far off of an Nvidia GPU (a 5090 has 1792GB/s, dunno about the pro level cards) or even Apple's best (M3 Ultra has 800GB/s).

It certainly seems like a great value. But for memory bandwidth intensive applications like LLMs, it is just barely entering the realm of "good enough".

Rohansi · 2025-07-15T01:58:26 1752544706

You're comparing theoretical maximum memory bandwidth. It's not enough to only look at memory bandwidth because you're a lot more likely to be compute limited when you have a lot of memory bandwidth available. For example, M1 had so much bandwidth available that it couldn't make use of even when fully loaded.

hamandcheese · 2025-07-15T21:18:41 1752614321

Memory bandwidth puts an upper limit on LLM tokens per second.

At 200GB/s, that upper limit is not very high at all. So it doesn't really matter if the compute is there or not.

Rohansi · 2025-07-15T22:03:25 1752617005

The M1 Max's GPU can only make use of about 90GB/s out of the 400GB/s they advertise/support. If the AMD chip can make better use of its 200GB/s then, as you say, it will manage to have better LLM tokens per second. You can't just look at what has the wider/faster memory bus.

https://www.anandtech.com/show/17024/apple-m1-max-performanc...

hamandcheese · 2025-07-16T01:10:54 1752628254

This mainly shows that you need to watch out when it comes to unified architectures. The sticker bandwidth might not be what you can get for GPU-only workloads. Fair point. Duly noted.

But my overarching point still stands: LLM inference needs memory bandwidth, and 200GB/s is not very much (especially for the higher ram variants).

If the M1 Max is actually 90GBs that just means it's a poor choice for LLM inference.

zargon · 2025-07-15T05:23:49 1752557029

GPUs have both the bandwidth and the compute. During token generation, no compute is needed. But both Apple silicon and Strix Halo fall on their face during prompt ingestion, due to lack of compute.

supermatt · 2025-07-15T08:09:33 1752566973

Compute (and lots of it) is absolutely needed for generation - 10s of billions of FLOPs per token on the smaller models (7B) alone - with computations of the larger models scaling proportionally.

Each token requires a forward pass through all transformer layers, involving large matrix multiplications at every step, followed by a final projection to the vocabulary.

zargon · 2025-07-15T08:30:35 1752568235

Obviously I don't mean literally zero compute. The amount of compute needed scales with the number of parameters, but I have yet to use a model that has so many parameters that token generation becomes compute bound. (Up to 104B for dense models.) During token generation most of the time is spent idle waiting for weights to transfer from memory. The processor is bored out of its mind waiting for more data. Memory bandwidth is the bottleneck.

supermatt · 2025-07-15T12:42:32 1752583352

It sounds like you aren’t batching efficiently if you are being bound by memory bandwidth.

zargon · 2025-07-15T17:01:47 1752598907

That’s right, in the context of Apple silicon and Halo Strix, these use cases don’t involve much batching.

yieldcrv · 2025-07-15T01:38:30 1752543510

Apple is just being stupid, handicapping their own hardware so they can sell the fixed one next year or the year after

This is time tested Apple strategy is now undermining their AI strategy and potential competitiveness

tl;dr they could have done 1600GB/s

Nevermark · 2025-07-15T04:50:57 1752555057

So their products are so much better, in customer demand terms that they don’t need to rush tech out the door?

Whatever story you want to create, if customers are happy year after year then Apple is serving them well.

Maybe not with same feature dimension balance you want, or other artificial/wishful balances you might make up for them.

(When Apple drops the ball it is usually painful, painfully obvious and most often a result of a deliberate and transparent priority tradeoff. No secret switcherooos or sneaky downgrading. See: Mac Pro for years…)

yieldcrv · 2025-07-15T09:24:59 1752571499

Apple is absolutely fumbling on their AI strategy despite their vertical hardware integration, there is no strategy. Its a known problem inside Apple, not a 4-D chess thing to wow everyone with a refined version in 2030

saagarjha · 2025-07-15T02:01:31 1752544891

They could have shipped a B200 too. Obviously there are reasons they don't do that.

drcongo · 2025-07-15T08:55:55 1752569755

This was nice to read, I ordered an EVO-X2 a week ago though I'm still waiting for them to actually ship it - I was waiting on a DGX Spark but ended up deciding that was never actually going to ship. Got any good resources for getting the thing up and running with LLMs, diffusion models etc.?

benreesman · 2025-07-15T11:59:02 1752580742

However excited you are, it's merited. Mine took forever too, and it's just completely worth it. It's like a flagship halo product, they won't make another one like this for a while I don't think. You won't be short on compute relative to a trip to best buy for many years.

jitl · 2025-07-15T00:46:57 1752540417

It’s pretty explicitly targeting cloud cluster training in the PR description.

ivape · 2025-07-15T02:24:13 1752546253

If we believe that there’s not enough hardware to meet demand, then one could argue this helps Apple meet demand, even if it’s just by a few percentage points.

adultSwim · 2025-07-15T12:23:22 1752582202

Do you need to copy to load a model from CPU memory into GPU memory?

attentive · 2025-07-15T01:32:07 1752543127

how is it vs m4 mac mini?

nl · 2025-07-15T01:00:10 1752541210

> The catch of course is ROCm isn't there yet (they're seeming serious now though, matter of time).

Competitive AMD GPU neural compute has been any day now for at least 10 years.

bigyabai · 2025-07-15T01:05:35 1752541535

The inference side is fine, nowadays. llama.cpp has had a GPU-agnostic Vulkan backend for a while, it's the training side that tends to be a sticking point for consumer GPUs.