Only the runtime components matter, though. Nobody cares about the dev tools beyond the core compiler. What people want is to be able to recompile and run on competitive hardware, and I don't understand why that's such an intractable problem.
It's the same essential problem as with e.g. Wine - if you're trying to reimplement someone else's constantly evolving API with a closed-source implementation, it takes a lot of effort just to barely keep up.
As far as portability, people who care about that already have the option of using higher-level APIs that have CUDA backend among several others. The main reason why you'd want to do CUDA directly is to squeeze that last bit of performance out of the hardware, but that is also precisely the area where deviation in small details starts to matter a lot.
However, companies may still be hoping to get their own solutions in place instead of CUDA. If they do implement CUDA, that cements its position forever. That ship has probably already sailed, of course.
Because literally the entire rest of the ecosystem is immature demoware. Rather than each vendor buying into opencl+SPIRV and building a robust stack around it, they are all doing their own half baked tech demos hoping to lock up some portion of the market to duplicate nvidia's success, or at least carve out a niche. While nvidia continues to extend and mature their ecosystem. Intel has oneAPI, AMD has ROCM, Arm has ACL/Kleidi/etc, and a pile of other stacks like MLX, Windows ML, whatever. Combined with a confusing mix of pure software plays like pytorch and windows ML.
A lot of people talk about 'tooling' quality and no one hears them. I just spent a couple weeks porting a fairly small library to some fairly common personal hardware and hit all the same problems you see everywhere. Bugs aren't handled gracefully. Instead of returning "you messed up here", the hardware locks up, and power cycling is the only solution. Not a problem when your writing hello world, but trolling through tens of thousands of lines of GPU kernel code to find the error is going to burn engineer time without anything to show for it. Then when its running, spending weeks in an open feedback loop trying to figure out why the GPU utilization metrics are reporting 50% utilization (if your lucky enough to even have them) and the kernel is running at 1/4 the expected performance is again going to burn weeks. All because there isn't a functional profiler.
And the vendors can't even get this stuff working. People rant about the ROCm support list not supporting, well the hardware people actually have. And it is such a mess, that in some cases it actually works but AMD says it doesn't. And of course, the only reason you hear people complaining about AMD is because they are literally the only company that has a hardware ecosystem that in theory spans the same breadth of devices from small embedded systems to giant data center grade products that NVIDIA does. Everyone else wants a slice of the market, but take apple here, they have nothing in the embedded/edge space that isn't a fixed function device (ex a watch, or apple TV), and their GPU's while interesting are nowhere near the level of the datacenter grade stuff, much less even top of the line AIC boards for gamers.
And its all gotten to be such an industry wide pile of trash that people can't even keep track of basic feature capabilities. Like, a huge pile of hardware actually 'supports' openCL, but its buried to the point where actual engineers working on say ROCm are unaware its actually part of the ROCm stack (imagine my surprise!). And its been the same for nvidia, they have at times supported openCL, but the support is like a .dll they install with the GPU driver stack and don't even bother to document that its there. Or tensorflow that seems to have succumbed to the immense gravitational black hole it had become, where just building it on something that wasn't the blessed platform could take days.