Eh yes but from my experience its lack of prefetch lends to significant memory s...

		freeone3000 16 days ago \| parent \| context \| favorite \| on: Apple's MLX adding CUDA support Eh yes but from my experience its lack of prefetch lends to significant memory stalls waiting for the copy. It might be suitable if your entire dataset fits in VRAM after doing a “manual prefetch” but it killed performance for my application (ML training) so hard that we actually got time to move to streaming loads.