> That being said, such a split design may help with improved AI inferencing.
Unified memory is the only reason Macs are so coveted right now for local AI.
A single 192 gb ram Mac costs less than the equivalent in standalone GPUs.
The execution speed for LLM inference gets so slow once you reach models that even fill out a 64GB machine. I was tempted, too, but then realized it was unusable past 48GB-ish and stacking used 3090s was the best price / perf / vram ratio.
What are the good use cases for very large memory amounts?
48GB is maybe just enough to squeeze a quantized 70B model in, like Llama 3.3, but you'll need to raise the GPU memory allocation limit [1] and it might not be super fast.
You could also try Qwen 2.5 32b, which you should just work with ollama or LM Studio with no config changes.
I've got a 32gb M1 Max and a 24gb 4090, and I barely ever run models on my Mac, as the memory bandwidth and compute for prefill is much better on the 4090. But I'm essentially locked out of Llama 3 70B class models, which I only use via API.
It's slower than using just GPU RAM, but it's still faster than using a regular PC that has a much more limited bandwith between the main memory and the GPU. It's some sort of middle ground on how fast and how cheap you can do inference with LLMs that don't fit into a consumer GPU's RAM.
It's definitely usable past 48GB, I have a 96GB M2 Max and regularly run models that use around 70GB that are very usable.
I also have a home server with 2x3090 and 2xA4000 (80GB vRAM) - yes it's a lot faster, but it's a pain in the ass to build, it takes up a lot of space, uses 10x the power, and honestly - cost about the same as my MacBook Pro.
Large context has different perf than large model. Op was likely thinking of running 400b models and finding the compute wasn’t enough to make the memory useful.
This isn't splitting the unified memory, it's splitting the soc into a sip. The ram is still on the same interposer/substrate. The actual mm distance won't be regressing to say the distance of a dimm socket to the cpu socket.
Yeah, this is the edge Macs have right now in the AI space. It's why people are looking forward to the Strix Halo from AMD as it will also have a unified memory architecture and will probably cost a good bit less than a Mac.
> Another intriguing aspect is the separate CPU and GPU design. If true, this essentially means M5 will not use a unified memory architecture (UMA) shared between the CPU and the GPU.
This does not follow. Intel is shipping unified memory processors with CPU cores and GPU cores on separate chiplets but still sharing the same memory controller (on a third chiplet, for Meteor Lake and Arrow Lake). AMD is about to launch Strix Halo, a high-end mobile processor that is rumored to consist of one or two CPU chiplets and an IO die with a big GPU and 256-bit memory controller.
Agree. Here is an article on SoIC from Anandtech [1].
Edit: [2] The tweet doesn't even mention about UMA. The interpretation is entirely made up by Notebookcheck, I feel like I am reading WCCFtech again making stuff up.
I am just thinking if this allow Apple to do something crazy like 1024bit LPDDR5x or HBM3e memory solution.
Plus Apple themselves are already kind of doing it with the M Ultras - those are two chiplets glued together, each of which is only connected to half of the systems memory directly, but it still behaves like unified memory even though half of the memory traffic has to be routed via the other chiplet.
And in the datacenter space, AMD has taken things even further with the MI300A:
> Twenty-four x86-architecture ‘Zen 4’ cores in three chiplets
> Six accelerated compute dies (XCDs) with 38 compute units (CUs),
each with 32 KB of L1 cache, 4 MB L2 cache shared across CUs, and
256 MB AMD Infinity Cache™ shared between XCDs and CPUs
> 128 GB of HBM3 memory shared coherently between CPUs and GPUs
with 5.3 TB/s on-package peak throughput
I wasn't really trying to comment on what Apple could or could not pull off. Just pointing out that Notebookcheck seems to be misunderstanding what they're reporting on from Ming-chi Kuo, and the headline itself seems to be something Notebookcheck made up rather than something from Kuo's rumors. So this whole thread is even more baseless than it appears at first glance, but it would still be interesting to have an informed discussion about the pros and cons of unified memory for consumer SoCs, and alternatives.
Great reminder. I should have been clearer that I meant in the current context the note was being made about Intel and AMD working on things that apple's not.
And yet it is legendary for running all sorts of great games (Ultima, Bard's Tale, Wizardry, Castle Wolfenstein, Robotron, Prince of Persia, Maniac Mansion, Choplifter, Lode Runner, and many, many others.)
I still want to try last year's Wizardry remake – which actually emulates the original Apple II code (or subsequent NES code), with a capability of displaying the original interface if desired.
For sure, they aren't always the first, but they do seem to scale through some things in their own way. Got me my first fastest cpu laptop with meaningful battery life.
Notebookcheck is a poor source of parsing technical information.
The actual rumour from Kuo is that they’d move to a chiplet style design where the CPU tile and GPU tile are independent. This is actually in the article as linked.
That does not however mean that unified memory would go away. It’s just a new packaging system.
UMA hurts the GPU too much. Widely parallel processing wants to access memory in bigger chunks than a CPU. If you try to mix access and modification, you lose the benefit of widely parallel processing. Other GPU designers have considered and eschewed unified memory models, to the tune of hundreds of millions in research dollars.
I agree that single cache-line fetches are pretty poor for parallel vector units, but supporting the former in an environment designed for the latter doesn't seem to off-putting (the CM-5 did this).
Did you ever meet (or better yet get a tour of Ames from) the late Ron Reisman, and see the virtual reality, flight simulator, and air traffic control systems his research lab developed?
no. but thank you so much for the references. that's actually really great.
I needed a username in the early 90s, I had just finished a paper where we microcoded a CM-2 to support high-throughput convolutions with spatially varying kernels for Hubble image correction (before the launched the eyeglasses mission), and I decided I could be the hero or anti-hero of convolution.
I can't find any demos of it on youtube, but it's the kind of obscure retro thing that LGR loves to review. He's really into the better known Kai Power Goo, which is a bit more accessible to kids than KPT Convolver:
It would be very interesting to dust off some of those old projects with modern affordances like not needing 1 SGI machine per eye lol but to review the first principles thinking in the problem spaces you guys looked at like air traffic control which is still the same use case and XR should have better chance today of making out of R&D labs onto a shop floor. I have seen talks from Tom Furness about the early applications being tested and seems like we are just now getting to a place on the development curve where some of them might just be practical. Thanks for all those links they will keep me busy for a while !
I think all the mobile GPUs use UMA. I think the tradeoff point is some complicated function of power envelopes and the benefit of more, though slower, memory vs raw performance at any power or $ cost. Though there are several dozen important algorithms that run much better on GPUs, there are really only two of them, 3D graphics and ML tensors, that have had a big consumer and broad professional appeal.
Why? Not like there's a single memory channel. Keeping the memory controller busy with tons of pending requests is a great way to make use of a large fraction of the total memory bandwidth. The M2 Ultra has 32 or 64 memory channels, a cache line pending for each would allow good bandwidth utilization.
Not sure why that would be true. Slow UMA (like the vast majority of Intel and AMD desktop chips with 128 bit wide memory) hurts GPU performance.
However the M4 Pro has 256 bits wide, M4 max 512 bits wide, and M2 Ultra has 1024 bits wide. GPU workloads are latency tolerant and embarrassingly parallel, don't see how allowing a CPU to make random accesses is going to hurt the GPU much.
The gpu has a cache on it. So does the cpu. Blow the cache and performance is gone anyway. So uniform memory access is really annoying to implement, really convenient for developers, non-issue performance-wise.
> So uniform memory access is really annoying to implement,
Is it really, though? It seems like almost every SoC small enough to be implemented as a single piece of monolithic silicon has gone the route of unified memory shared by the CPU and GPU.
NVIDIA's GH200 and GB200 are NUMA, but they put the CPU and GPU in separate packages and re-use the GPU silicon for GPU-only products. Among solutions that actually put the CPU and GPU chiplets in the same package, I think everyone has gone with a unified memory approach.
Indeed, much like the pending AMD Strix Halo and already chipping AMD MI300A.
Much like dual socket servers, where each socket can address all memory, these new servers have two memory systems, one optimized for CPU and another optimized for GPU. Seems like a good idea to me, why serial/deserialize complex data structures between the CPU and GPU, which are then bulk transfered, and then checked for completion. With NUMA you can pass a pointer, caches help, everything is coherent, and it "just works". No more failures when you don't have enough memory for textures or a LLM, it would just gracefully page to the CPUs memory.
Not mentioned in the article, but another motivation behind this could be that with split a CPU/GPU Apple could try to up-sale on both when purchasing Macs.
The prices they charge just to go from 16GB to 32GB of RAM is outrageous ($400 for Macbook pro).
I love the fact you can buy two 16GB/256GB Mac Minis and have cash leftover compared to somone that bought a single 32GB/512GB Mac Mini. Apple's upsells are insane
Hahaha wow I just checked Apple UK and the base 16GB/256GB is £600. 32GB upgrade is +£400, 512GB upgrade is +£200.
It should not cost that much! 2x Mac mini M4 16GB/256GB should not cost the same as 1x Mac mini M4 32GB/512GB!
Can someone help explain this in a way that isn’t just absolute price gauging of the higher end customer base? Are the components genuinely that much more expensive?
>Price gouging is a pejorative term used to refer to the practice of increasing the prices of goods, services, or commodities to a level much higher than is considered reasonable or fair by some. This commonly applies to price increases of basic necessities after natural disasters. Usually, this event occurs after a demand or supply shock.
Using the term "price gouging" anytime a potential buyer thinks a seller is asking for too much money renders it meaningless. I ask for as much money as the buyers for my labor will pay, as I assume the people selling to me do also.
It's just business, you try to earn as much as possible (and that could involve not maximizing in a specific transaction to incentivize repeat business in the future). But in no way is anyone under any duress when deciding to buy an Apple device, so if a buyer does not feel like being price gouged, they should buy something else.
> Can someone help explain this in a way that isn’t just absolute price gauging of the higher end customer base?
It's a pretty normal pricing strategy. It's more common than not. Most products or services you buy anywhere will be sold at higher margins for more premium offerings.
It might seem strange when compared to legacy PCs with socketed components, but this isn't that, nor are most products. Even among PCs this isn't strange anymore: go take a look at MS's pricing on their first-party PCs.
Calling this "price gouging" is not really the right use of the term -- usually it refers to price increases of basic necessities in emergency situations.
Microsoft isn't a great example. They basically just crib Apple's approach. And they do at least still have socketed storage so that's very cheap to upgrade if you do it that way.
All of the big OEMs are soldering memory on at least some (if not all) of their thin-and-lights, and I haven't seen a single one priced at margins that weren't significantly above the cost of materials.
Either way, my point is that flat margin pricing is exceedingly rare. Everywhere from the grocery store to the car dealer is charging higher margins on more premium products.
Luxury cars have higher margins than economy cars. Organic milk has higher profit margins than regular milk. And Macs with 32GB of memory have higher profit margins than Macs with 16GB of RAM. The fact that the desktops PCs of our past priced RAM upgrades nearly at cost was an outlier; a courtesy, not anything normal.
It is basic microeconomics that a seller wants to be able to get as high of a price as buyers are willing to pay, but since different buyers have different abilities and willingnesses to pay, a seller can maximize their revenue by providing options at different price points.
Especially with societal wealth gaps, the people able and willing to pay higher prices are going to be able to pay higher price premiums, resulting in higher profit margins.
The reason why the change to 16gb was such a big deal was at least in part because it meant people didn't feel forced into shelling out 200 dollars (or whatever it was) for an extra 8gb of RAM.
It creates this weird dichotomy of having arguably the best value computer on the market in the base mac mini with 16gb of RAM and 256gb of storage and some of the absolutely worse value upgrades (like spending $400 on 16gb of RAM or $200 on 256gb of storage).
There's not much to explain here; they price gouge upgrades because they can. People that want/need MacOS for their work will pay for it, even if begrudgingly. I'm not necessarily happy about paying that much for these spec bumps but the benefits of using a Mac still outweigh the cons for me.
Given they control both hardware and software, can they have both like efficiency core + power core, with some memory is unified (up to 64 GB) and some dGPU has different memory which they sell against NVIDIA ... Apple has to grow further is hard, but having a T$ firm in front of you with a hardware piece you have is much better than building an Apple Car ...
The way they've treated the Mac Studio by simply not updating it, letting the MacBook Pro M4 nearly surpass it at the top end of performance, doesn't bode well for the future.
Seems like they think Ultras aren't worth the investment, let alone building a true "unleashed" SIP.
Apple never says "hey what's the fastest and most powerful thing we can build for X price", they always box themselves in with space or energy constraints, so they never truly compete for the high end. The existing Mac Pro body was their chance to do that, and instead they put something designed for a smaller chassis in there.
I agree 100%, though I guess holding onto the monster Power Supply in the Mac Pro has me hoping they're still working on something "big" that would justify it.
My guess is we won’t ever see anything more than what can also fit in a Studio. It’s just not worth the extra R&D for whatever fancy interposer tech they’d need to make a standout Mac Pro
It's a little overbuilt, but a 1000w for a 300w processor with ability to add a lot of accessories isn't terrible actually. only needs a few more PCIe devices to make it meet the 50% minimum to reach peak efficieny on the PSU.
Unified memory is the only reason Macs are so coveted right now for local AI. A single 192 gb ram Mac costs less than the equivalent in standalone GPUs.