Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Apple M5 could ditch unified memory architecture for split CPU and GPU designs (notebookcheck.net)
118 points by akyuu on Dec 30, 2024 | hide | past | favorite | 79 comments


> That being said, such a split design may help with improved AI inferencing.

Unified memory is the only reason Macs are so coveted right now for local AI. A single 192 gb ram Mac costs less than the equivalent in standalone GPUs.


The execution speed for LLM inference gets so slow once you reach models that even fill out a 64GB machine. I was tempted, too, but then realized it was unusable past 48GB-ish and stacking used 3090s was the best price / perf / vram ratio.

What are the good use cases for very large memory amounts?


Mixture of Expert models, where all parameters must be in memory but only a subset are accessed per token, are a sweet spot for Macs.

DeepSeek v3 for instance has 671B params, but should have the memory bandwidth of a 37B dense model with a batch size of one.


Any model you'd recommend for a Mac with 48 GB RAM?


48GB is maybe just enough to squeeze a quantized 70B model in, like Llama 3.3, but you'll need to raise the GPU memory allocation limit [1] and it might not be super fast.

You could also try Qwen 2.5 32b, which you should just work with ollama or LM Studio with no config changes.

I've got a 32gb M1 Max and a 24gb 4090, and I barely ever run models on my Mac, as the memory bandwidth and compute for prefill is much better on the 4090. But I'm essentially locked out of Llama 3 70B class models, which I only use via API.

[1] See: https://www.reddit.com/r/LocalLLaMA/comments/186phti/m1m2m3_...


It's slower than using just GPU RAM, but it's still faster than using a regular PC that has a much more limited bandwith between the main memory and the GPU. It's some sort of middle ground on how fast and how cheap you can do inference with LLMs that don't fit into a consumer GPU's RAM.


It's definitely usable past 48GB, I have a 96GB M2 Max and regularly run models that use around 70GB that are very usable.

I also have a home server with 2x3090 and 2xA4000 (80GB vRAM) - yes it's a lot faster, but it's a pain in the ass to build, it takes up a lot of space, uses 10x the power, and honestly - cost about the same as my MacBook Pro.


I ran all my research papers (about LLMs) on a Mac Studio.


Unusable in what way? Llama 3.3 70b q8 with 100k+ context runs as well as GPT 3.5 did a couple years back, except local on Mac and smarter.


Large context has different perf than large model. Op was likely thinking of running 400b models and finding the compute wasn’t enough to make the memory useful.


Look on youtube running various models and even a 70b model does a slow 3tok/s


If you don't have tensor parallelism even 8 x A100 is going to struggle with that kind of model.


This isn't splitting the unified memory, it's splitting the soc into a sip. The ram is still on the same interposer/substrate. The actual mm distance won't be regressing to say the distance of a dimm socket to the cpu socket.

The software will still see a single memory pool

I know im disagreeing with the article


I'm sorry, I think you mispronounced market segmentation... This is Apple we're talking about.


When LLMs and diffusion models began spreading freely, it was very funny to have support for some beefy high end GPUs and MacBook Air M1.

I hope Apple sticks with the architecture. Even if its not very practical, its great to have it as possible.


I don’t think the principal market actually cares about that.


Yeah, this is the edge Macs have right now in the AI space. It's why people are looking forward to the Strix Halo from AMD as it will also have a unified memory architecture and will probably cost a good bit less than a Mac.


> Another intriguing aspect is the separate CPU and GPU design. If true, this essentially means M5 will not use a unified memory architecture (UMA) shared between the CPU and the GPU.

This does not follow. Intel is shipping unified memory processors with CPU cores and GPU cores on separate chiplets but still sharing the same memory controller (on a third chiplet, for Meteor Lake and Arrow Lake). AMD is about to launch Strix Halo, a high-end mobile processor that is rumored to consist of one or two CPU chiplets and an IO die with a big GPU and 256-bit memory controller.


Agree. Here is an article on SoIC from Anandtech [1].

Edit: [2] The tweet doesn't even mention about UMA. The interpretation is entirely made up by Notebookcheck, I feel like I am reading WCCFtech again making stuff up.

I am just thinking if this allow Apple to do something crazy like 1024bit LPDDR5x or HBM3e memory solution.

[1] https://www.anandtech.com/show/21414/tsmcs-3d-stacked-soic-p...

[2] https://x.com/mingchikuo/status/1871185666362745227?ref_src=...


Plus Apple themselves are already kind of doing it with the M Ultras - those are two chiplets glued together, each of which is only connected to half of the systems memory directly, but it still behaves like unified memory even though half of the memory traffic has to be routed via the other chiplet.


And in the datacenter space, AMD has taken things even further with the MI300A:

> Twenty-four x86-architecture ‘Zen 4’ cores in three chiplets

> Six accelerated compute dies (XCDs) with 38 compute units (CUs), each with 32 KB of L1 cache, 4 MB L2 cache shared across CUs, and 256 MB AMD Infinity Cache™ shared between XCDs and CPUs

> 128 GB of HBM3 memory shared coherently between CPUs and GPUs with 5.3 TB/s on-package peak throughput


Indeed. Very similar to the grace + hopper, grace + dual hopper, and grace + dual blackwell.


Maybe Apple has figured out something better than a unified memory architecture.

It's hard to rule out their ability to create silicon that is a step change.


I wasn't really trying to comment on what Apple could or could not pull off. Just pointing out that Notebookcheck seems to be misunderstanding what they're reporting on from Ming-chi Kuo, and the headline itself seems to be something Notebookcheck made up rather than something from Kuo's rumors. So this whole thread is even more baseless than it appears at first glance, but it would still be interesting to have an informed discussion about the pros and cons of unified memory for consumer SoCs, and alternatives.


They didn't even figure out unified memory - even original xbox (20+ years ago) had that.


And SGI O2 30 years ago.


Great reminder. I should have been clearer that I meant in the current context the note was being made about Intel and AMD working on things that apple's not.


Nvidia Jetson are all unified memory as are the raspberry pis.


CHIP memory


I think the Apple II frame buffer was in a unified system memory.


Ha ha, right! And with interlaced rows, 7 pixels per byte, and a cockamamie color generation scheme, it had a hardware "graphics decelerator".


And yet it is legendary for running all sorts of great games (Ultima, Bard's Tale, Wizardry, Castle Wolfenstein, Robotron, Prince of Persia, Maniac Mansion, Choplifter, Lode Runner, and many, many others.)

I still want to try last year's Wizardry remake – which actually emulates the original Apple II code (or subsequent NES code), with a capability of displaying the original interface if desired.


UMA is trivial if you have so little RAM that bandwidth doesn't matter.

The original XBox (2001) had 64MB. I think my PC from 1998 had that.


Good point - maybe this is a part of Apple's reasoning, or there's something else architecturally.


For sure, they aren't always the first, but they do seem to scale through some things in their own way. Got me my first fastest cpu laptop with meaningful battery life.


The Amiga had unified memory almost 40 years ago, at least for the first 512K to 2 megabytes (depending on hardware, which Agnus chip, etc.)


Not to mention nearly every Intel and AMD desktop that has a small iGPU on board, even for the chips not marketed that way.


The Mac 128k in 1984 had unified memory.


Notebookcheck is a poor source of parsing technical information.

The actual rumour from Kuo is that they’d move to a chiplet style design where the CPU tile and GPU tile are independent. This is actually in the article as linked.

That does not however mean that unified memory would go away. It’s just a new packaging system.


UMA hurts the GPU too much. Widely parallel processing wants to access memory in bigger chunks than a CPU. If you try to mix access and modification, you lose the benefit of widely parallel processing. Other GPU designers have considered and eschewed unified memory models, to the tune of hundreds of millions in research dollars.


I agree that single cache-line fetches are pretty poor for parallel vector units, but supporting the former in an environment designed for the latter doesn't seem to off-putting (the CM-5 did this).


By the way: Does your user name convolvotron refer to the hardware 3d audio audio processing system originally developed at NASA Ames Research?

Such a cool name! And it says just what it does.

https://spinoff.nasa.gov/node/8965

https://spinoff.nasa.gov/sites/default/files/thumbnail0000_2...

https://pubs.aip.org/asa/jasa/article/92/4_Supplement/2376/7...

Body Electric supported the Convolvotron for visually programming VR simulations with 3D sound:

https://news.ycombinator.com/item?id=24266722

Did you ever meet (or better yet get a tour of Ames from) the late Ron Reisman, and see the virtual reality, flight simulator, and air traffic control systems his research lab developed?

Vertical Motion Simulator:

https://www.youtube.com/watch?v=5-lHcv_olkE

Marvin Minsky flies a simulator and wears VR goggles:

https://www.youtube.com/watch?v=mOKENF_-z8Y


no. but thank you so much for the references. that's actually really great.

I needed a username in the early 90s, I had just finished a paper where we microcoded a CM-2 to support high-throughput convolutions with spatially varying kernels for Hubble image correction (before the launched the eyeglasses mission), and I decided I could be the hero or anti-hero of convolution.


Then you would probably appreciate one of the more obscure and specialized Kai Power Tools for Photoshop: KTP Convolver!

https://www.macintoshrepository.org/724-kpt-convolver-1-0

I can't find any demos of it on youtube, but it's the kind of obscure retro thing that LGR loves to review. He's really into the better known Kai Power Goo, which is a bit more accessible to kids than KPT Convolver:

https://www.youtube.com/watch?v=xt06OSIQ0PE


It would be very interesting to dust off some of those old projects with modern affordances like not needing 1 SGI machine per eye lol but to review the first principles thinking in the problem spaces you guys looked at like air traffic control which is still the same use case and XR should have better chance today of making out of R&D labs onto a shop floor. I have seen talks from Tom Furness about the early applications being tested and seems like we are just now getting to a place on the development curve where some of them might just be practical. Thanks for all those links they will keep me busy for a while !


I think all the mobile GPUs use UMA. I think the tradeoff point is some complicated function of power envelopes and the benefit of more, though slower, memory vs raw performance at any power or $ cost. Though there are several dozen important algorithms that run much better on GPUs, there are really only two of them, 3D graphics and ML tensors, that have had a big consumer and broad professional appeal.


Why? Not like there's a single memory channel. Keeping the memory controller busy with tons of pending requests is a great way to make use of a large fraction of the total memory bandwidth. The M2 Ultra has 32 or 64 memory channels, a cache line pending for each would allow good bandwidth utilization.


Could you give some concrete examples, including when (approximate year/decade is ok) they were considering UMA for CPU/GPU?

As a couple of others have mentioned, smartphones/tablets/laptops seem to be the driving force in UMA's spread.


Not sure why that would be true. Slow UMA (like the vast majority of Intel and AMD desktop chips with 128 bit wide memory) hurts GPU performance.

However the M4 Pro has 256 bits wide, M4 max 512 bits wide, and M2 Ultra has 1024 bits wide. GPU workloads are latency tolerant and embarrassingly parallel, don't see how allowing a CPU to make random accesses is going to hurt the GPU much.


The gpu has a cache on it. So does the cpu. Blow the cache and performance is gone anyway. So uniform memory access is really annoying to implement, really convenient for developers, non-issue performance-wise.


> So uniform memory access is really annoying to implement,

Is it really, though? It seems like almost every SoC small enough to be implemented as a single piece of monolithic silicon has gone the route of unified memory shared by the CPU and GPU.

NVIDIA's GH200 and GB200 are NUMA, but they put the CPU and GPU in separate packages and re-use the GPU silicon for GPU-only products. Among solutions that actually put the CPU and GPU chiplets in the same package, I think everyone has gone with a unified memory approach.


Indeed, much like the pending AMD Strix Halo and already chipping AMD MI300A.

Much like dual socket servers, where each socket can address all memory, these new servers have two memory systems, one optimized for CPU and another optimized for GPU. Seems like a good idea to me, why serial/deserialize complex data structures between the CPU and GPU, which are then bulk transfered, and then checked for completion. With NUMA you can pass a pointer, caches help, everything is coherent, and it "just works". No more failures when you don't have enough memory for textures or a LLM, it would just gracefully page to the CPUs memory.


Is the purpose here to make Apple computers a better alternative for gaming? The market is larger than that of people trying to run local LLMs.


Not mentioned in the article, but another motivation behind this could be that with split a CPU/GPU Apple could try to up-sale on both when purchasing Macs.

The prices they charge just to go from 16GB to 32GB of RAM is outrageous ($400 for Macbook pro).


I love the fact you can buy two 16GB/256GB Mac Minis and have cash leftover compared to somone that bought a single 32GB/512GB Mac Mini. Apple's upsells are insane


I’m seeing 16/256 for $600 and 32/512 for $1,200 on apple.com


I'l admit it's a technicality, but the 16/256 is $599.00. A 32/512 is $1199.00 and two 16/256s are $1198.00, so it is $1 cheaper


The change, was literal.


On the Apple Edu store, it's $499 for the 16/256 and $1079 for the 32/512


Hahaha wow I just checked Apple UK and the base 16GB/256GB is £600. 32GB upgrade is +£400, 512GB upgrade is +£200.

It should not cost that much! 2x Mac mini M4 16GB/256GB should not cost the same as 1x Mac mini M4 32GB/512GB!

Can someone help explain this in a way that isn’t just absolute price gauging of the higher end customer base? Are the components genuinely that much more expensive?


>Can someone help explain this in a way that isn’t just absolute price gauging of the higher end customer base?

Price gouging, as a meaningful term, is restricted to:

https://en.wikipedia.org/wiki/Price_gouging

>Price gouging is a pejorative term used to refer to the practice of increasing the prices of goods, services, or commodities to a level much higher than is considered reasonable or fair by some. This commonly applies to price increases of basic necessities after natural disasters. Usually, this event occurs after a demand or supply shock.

Using the term "price gouging" anytime a potential buyer thinks a seller is asking for too much money renders it meaningless. I ask for as much money as the buyers for my labor will pay, as I assume the people selling to me do also.

It's just business, you try to earn as much as possible (and that could involve not maximizing in a specific transaction to incentivize repeat business in the future). But in no way is anyone under any duress when deciding to buy an Apple device, so if a buyer does not feel like being price gouged, they should buy something else.


> Can someone help explain this in a way that isn’t just absolute price gauging of the higher end customer base?

It's a pretty normal pricing strategy. It's more common than not. Most products or services you buy anywhere will be sold at higher margins for more premium offerings.

It might seem strange when compared to legacy PCs with socketed components, but this isn't that, nor are most products. Even among PCs this isn't strange anymore: go take a look at MS's pricing on their first-party PCs.

Calling this "price gouging" is not really the right use of the term -- usually it refers to price increases of basic necessities in emergency situations.


Microsoft isn't a great example. They basically just crib Apple's approach. And they do at least still have socketed storage so that's very cheap to upgrade if you do it that way.


All of the big OEMs are soldering memory on at least some (if not all) of their thin-and-lights, and I haven't seen a single one priced at margins that weren't significantly above the cost of materials.

Either way, my point is that flat margin pricing is exceedingly rare. Everywhere from the grocery store to the car dealer is charging higher margins on more premium products.

Luxury cars have higher margins than economy cars. Organic milk has higher profit margins than regular milk. And Macs with 32GB of memory have higher profit margins than Macs with 16GB of RAM. The fact that the desktops PCs of our past priced RAM upgrades nearly at cost was an outlier; a courtesy, not anything normal.


This is broadly called price discrimination.

https://en.wikipedia.org/wiki/Price_discrimination

It is basic microeconomics that a seller wants to be able to get as high of a price as buyers are willing to pay, but since different buyers have different abilities and willingnesses to pay, a seller can maximize their revenue by providing options at different price points.

Especially with societal wealth gaps, the people able and willing to pay higher prices are going to be able to pay higher price premiums, resulting in higher profit margins.


Right, and a sibling comment already pointed that out, I just wanted to expand on the topic with examples.


The reason why the change to 16gb was such a big deal was at least in part because it meant people didn't feel forced into shelling out 200 dollars (or whatever it was) for an extra 8gb of RAM.

It creates this weird dichotomy of having arguably the best value computer on the market in the base mac mini with 16gb of RAM and 256gb of storage and some of the absolutely worse value upgrades (like spending $400 on 16gb of RAM or $200 on 256gb of storage).

There's not much to explain here; they price gouge upgrades because they can. People that want/need MacOS for their work will pay for it, even if begrudgingly. I'm not necessarily happy about paying that much for these spec bumps but the benefits of using a Mac still outweigh the cons for me.


> Can someone help explain this in a way that isn’t just absolute price gauging of the higher end customer

No, it's the same reason Nvidia has a vastly higher margin on datacenter cards:

https://en.wikipedia.org/wiki/Price_discrimination


You can split the CPU and GPU and still have UMA. Splitting CPU/GPU is a packaging and interconnect concern and is not mutually exclusive with UMA.


Given they control both hardware and software, can they have both like efficiency core + power core, with some memory is unified (up to 64 GB) and some dGPU has different memory which they sell against NVIDIA ... Apple has to grow further is hard, but having a T$ firm in front of you with a hardware piece you have is much better than building an Apple Car ...


I've always wondered if they'll do something for a true monster Mac Pro.. 128 cores, gobs of memory, etc.


The way they've treated the Mac Studio by simply not updating it, letting the MacBook Pro M4 nearly surpass it at the top end of performance, doesn't bode well for the future.

Seems like they think Ultras aren't worth the investment, let alone building a true "unleashed" SIP.

Apple never says "hey what's the fastest and most powerful thing we can build for X price", they always box themselves in with space or energy constraints, so they never truly compete for the high end. The existing Mac Pro body was their chance to do that, and instead they put something designed for a smaller chassis in there.


I agree 100%, though I guess holding onto the monster Power Supply in the Mac Pro has me hoping they're still working on something "big" that would justify it.


My guess is we won’t ever see anything more than what can also fit in a Studio. It’s just not worth the extra R&D for whatever fancy interposer tech they’d need to make a standout Mac Pro


The strange part is the M MacPro still has the monster 100W power supply from the intel days. Why on earth did they leave that?


It's a little overbuilt, but a 1000w for a 300w processor with ability to add a lot of accessories isn't terrible actually. only needs a few more PCIe devices to make it meet the 50% minimum to reach peak efficieny on the PSU.


Expansion ports. Arguably, the redesign you're looking for is the Mac Studio.


For your PCI devices and whatever they may draw.


Hackintosh on MI300A?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: