This is exciting. So this is using unified memory of CUDA? I wonder how well tha...

zcbenz · 2025-07-14T23:20:42 1752535242

In the absence of hardware unified memory, CUDA will automatically copy data between CPU/GPU when there are page faults.

fenced_load · 2025-07-15T00:08:30 1752538110

There is also NVLink c2c support between Nvidia's CPUs and GPUs that doesn't require any copy, CPUs and GPUs directly access each other's memory over a coherent bus. IIRC, they have 4 CPU + 4 GPU servers already available.

benreesman · 2025-07-15T00:35:06 1752539706

Yeah NCCL is a whole world and it's not even the only thing involved, but IIRC that's the difference between 8xH100 PCI and 8xH100 SXM2.

saagarjha · 2025-07-15T02:02:57 1752544977

This seems like it would be slow…

freeone3000 · 2025-07-15T02:28:22 1752546502

Matches my experience. It’s memory stalls all over the place, aggravated (on 12.3 at least) there wasn’t even a prefetcher.

nickysielicki · 2025-07-15T01:21:02 1752542462

See also: https://www.kernel.org/doc/html/v5.0/vm/hmm.html

ethan_smith · 2025-07-15T13:40:57 1752586857

CUDA's Unified Memory uses page migration with on-demand faulting to create the illusion of shared memory, whereas Apple Silicon has true shared physical memory, resulting in different performance characteristics despite the similar programming model.

MBCook · 2025-07-14T23:03:55 1752534235

This is my guess, but does higher end hardware they sell, like the server rack stuff for AI, perhaps have the unified memory?

I know standard GPUs don’t.

The patch suggested one of the reasons for it was to make it easy to develop on a Mac and run on a super computer. So the hardware with the unified memory might be in that class.

ajuhasz · 2025-07-14T23:12:37 1752534757

The Jetsons[1] have unified memory[2].

[1] https://www.nvidia.com/en-us/autonomous-machines/embedded-sy... [2] https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s...

tonyarkles · 2025-07-14T23:56:03 1752537363

They sure do and it's pretty amazing. One iteration of a vision system I worked on got frames from a camera over a Mellanox NIC that supports RDMA (Rivermax), preprocessed the images using CUDA, did inference on them with TensorRT, and the first time a single byte of the inference pipeline hit the CPU itself was when we were consuming the output.

patrickkrusiec · 2025-07-14T23:19:29 1752535169

The physical memory is not be unified, but on modern rack scale Nvidia systems, like Grace Hopper or NVL72, the CPU and the GPU(s) share the same virtual address space and have non-uniform memory access to each other's memory.

freeone3000 · 2025-07-15T02:30:47 1752546647

Standard GPUs absolutely do. Since CUDA 11, all CUDA cards expose the same featureset at differing speeds (based on backing capability). You can absolutely (try to) run CUDA UMA on your 2060, and it will complete the computation.

Y_Y · 2025-07-14T23:09:19 1752534559

The servers don't, but the Jetsons do