AI training is moving away from CUDA and toward TPUs anyway. DGX clusters can't ...

ladberg · on Sept 14, 2020

And Nvidia's GPUs now include the same type of hardware that TPUs have, so there's no reason to believe that TPUs will win out over GPUs.

sillysaurusx · on Sept 14, 2020

The key difference between a TPU and a GPU is that a TPU has a CPU. It's an entire computer, not just a piece of hardware. Is nVidia moving in that direction?

newsclues · on Sept 14, 2020

They just bought ARM for 40$billion. I think they want to integrate CPU,GPU and high speed networks

dikei · on Sept 14, 2020

In term of cutting edge tech, They have their own GPUs, CPUs from ARM, Networking from Mellanox, so I'd say they're pretty much set to build a kick ass TPU.

shaklee3 · on Sept 14, 2020

A TPU is a chip you cannot program. It's purpose built and can't run the fraction of the type of workloads that a GPU can.

sillysaurusx · on Sept 14, 2020

I don't know where all of this misinformation is coming from or why, but, as someone who has spent the last year programming TPUs to do all kinds of things that a GPU can't do, this isn't true.

Are we going to simply say "Nu uh" at each other, or do you want to throw down some specific examples so I can show you how mistaken they are?

sorenbouma · on Sept 14, 2020

I'm a TPU user and I'd be interested to see a specific example of something that can be done on TPU but not GPU.

Perhaps I'm just not experienced enough with the programming model, but I've found them to be strictly less flexible/more tricky than GPUs, especially for things like conditional execution, multiple graphs, variable size inputs and custom ops.

sillysaurusx · on Sept 14, 2020

Sure! I'd love to chat TPUs. There's a #tpu discord channel on the MLPerf discord: https://github.com/shawwn/tpunicorn#ml-community

The central reason that TPUs feel less flexible is Google's awful mistake in encouraging everyone to use TPUEstimator as the One True API For Doing TPU Programming. Getting off that API was the single biggest boost to my TPU skills.

You can see an example of how to do that here: https://github.com/shawwn/ml-notes/blob/master/train_runner.... This is a repo that can train GPT-2 1.5B at 10 examples/sec on a TPUv3-8 (aka around 10k tokens/sec).

Happy to answer any specific questions or peek at codebases you're hoping to run on TPUs.

slaymaker1907 · on Sept 14, 2020

That doesn't answer the question of what a TPU can do that a GPU can't. I think the OP means impossible for the GPU, not just slower.

slaymaker1907 · on Sept 14, 2020

You can run basically any C program on a CUDA core even those requiring malloc. It may not be efficient but you can do it. Google themselves call GPUs general purpose and TPUs domain specific. https://cloud.google.com/blog/products/ai-machine-learning/w...

shaklee3 · on Sept 14, 2020

Please show me the API where I can write a generic function on a TPU. I'm talking about writing something like a custom reduction or a peak search, not offloading a tensor flow model.

I'll make it easier for you, directly from Google's website:

TPUs Cloud TPUs are optimized for specific workloads. In some situations, you might want to use GPUs or CPUs on Compute Engine instances to run your machine learning workloads.

Please tell me a workload a gpu can't do that a TPU can.

sillysaurusx · on Sept 14, 2020

Sure, here you go: https://www.tensorflow.org/api_docs/python/tf/raw_ops

In my experience, well over 80% of these operations are implemented on TPU CPUs, and at least 60% are implemented on TPU cores.

Again, if you give a specific example, I can simply write a program demonstrating that it works. What kind of custom reduction do you want? What's a peak search?

As for workloads that GPUs can't do, we regularly train GANs at 500+ examples/sec across a total dataset size of >3M photos. Rather hard to do that with GPUs.

shaklee3 · on Sept 14, 2020

Well, there you go. For one TensorFlow is not a generic framework like cuda is, so you lose a whole bunch of the configurability you have with cuda. So, for example, even though there is an FFT raw function, there doesn't appear to be a way to do more complicated FFTs, such as an overlap-save. This is trivial to do on a GPU, and is built into the library. The raw functions it provides is not direct access to the hardware and memory subsystem. It's a set of raw functions that is a small subset of the total problem space. And certainly if you are saying that running something on a TPU's CPU cores are in any way going to compete with a gpu, then I don't know what to tell you.

You did not give an example of something GPUs can't do. all you said was that TPUs are faster for a specific function in your case.

sillysaurusx · on Sept 14, 2020

For one TensorFlow is not a generic framework like cuda is, so you lose a whole bunch of the configurability you have with cuda

Why make generalizations like this? It's not true, and we've devolved back into the "nu uh" we originally started with.

This is trivial to do on a GPU, and is built into the library

Yes, I'm sure there are hardwired operations that are trivial to do on GPUs. That's not exactly a +1 in favor of generic programmability. There are also operations that are trivial to do on TPUs, such as CrossReplicaSum across a massive cluster of cores, or the various special-case Adam operations. This doesn't seem related to the claim that TPUs are less flexible.

The raw functions it provides is not direct access to the hardware and memory subsystem.

Not true. https://www.tensorflow.org/api_docs/python/tf/raw_ops/Inplac...

Jax is also going to be giving even lower-level access than TF, which may interest you.

You did not give an example of something GPUs can't do. all you said was that TPUs are faster for a specific function in your case.

Well yeah, I care about achieving goals in my specific case, as you do yours. And simply getting together a VM that can feed 500 examples/sec to a set of GPUs is a massive undertaking in and of itself. TPUs make it more or less "easy" in comparison. (I won't say effortless, since it does take some effort to get yourself into the TPU programming mindset.)

shaklee3 · on Sept 14, 2020

I gave you an example of something you can't do, which is an overlap-save FFT, and you ignored that completely. Please implement it, or show me any example of someone implementing any custom FFT that's not a simple, standard, batched FFT. I'll take any example of implementing any type of signal processing pipeline on TPU, such as a 5G radio.

Your last sentence is pretty funny: a GPU can't do certain workloads because one it can do is too slow for you. Yet it remains a fact that TPU cannot do certain workloads without offloading to the CPU (making it orders of magnitude slower), and that's somehow okay? It seems where this discussion is going is you pointed to a TensorFlow library that may or may not offload to a TPU, and it probably doesn't. But even that library is incomplete to implement things like a 5G LDPC decoder.

sillysaurusx · on Sept 14, 2020

Which part of this can't be done on TPUs? https://en.wikipedia.org/wiki/Overlap%E2%80%93save_method#Ps... As far as I can tell, all of those operations can be done on TPUs. In fact, I linked to the operation list that shows they can be.

You'll need to link me to some specific implementation that you want me to port over, not just namedrop some random algorithm. Got a link to a github?

If your point is "There isn't a preexisting operation for overlap-save FFT" then... yes, sure, that's true. There's also not a preexisting operation for any of the hundreds of other algorithms that you'd like to do with signal processing. But they can all be implemented efficiently.

Yet it remains a fact that TPU cannot do certain workloads without offloading to the CPU (making it orders of magnitude slower), and that's somehow okay?

I think this is the crux of the issue: you're saying X can't be done, I'm saying X can be done, so please link to a specific code example. Emphasis on "specific" and "code".

shaklee3 · on Sept 14, 2020

Let's just leave this one alone then. I can't argue with someone who claims anything is possible, yet absolutely nobody seems to be doing what you're referring to (except you). A100 now tops all MLPerf benchmarks, and the unavailable TPUv4 may not even keep up.

Trust me, I would love if TPUs could do what you're saying, but they simply can't. There's no direct DMA from the NIC to where I can do a streaming application at 40+Gbps to it. Even if TPU could do all the things you claim, if it's not as fast as the A100, what's the point? To go through undocumented pain to prove something?

sillysaurusx · on Sept 14, 2020

FWIW, you can stream at 10Gbps to TPUs. (I've done it.)

10Gbps isn't quite 40Gbps, but I think you can get there by streaming to a few different TPUs on different VPC networks. Or to the same TPU from different VMs, possibly.

The point is that there's a realistic alternative to nVidia's monopoly.

shaklee3 · on Sept 14, 2020

When I can run a TPU in my own data center, there is. Until then it precludes a lot of applications.

lostmsu · on Sept 14, 2020

Where did you get this from? AFAIK GPT-3 (for example) was trained on a GPU cluster, not TPUs.

sillysaurusx · on Sept 14, 2020

Experience, for one. TPUs are dominating MLPerf benchmarks. That kind of performance can't be dismissed so easily.

GPT-2 was trained on TPUs. (There are explicit references to TPUs in the source code: https://github.com/openai/gpt-2/blob/0574c5708b094bfa0b0f6df...)

GPT-3 was trained on a GPU cluster probably because of Microsoft's billion-dollar Azure cloud credit investment, not because it was the best choice.

lostmsu · on Sept 14, 2020

I checked MLPerf website, and it looks like A100 is outperforming TPUv3, and is also more capable (there does not seem to be a working implementation of RL for Go on TPU).

To be fair, TPUv4 is not out yet, and it might catch up using the latest processes (7nm TSMC or 8nm Samsung).

https://mlperf.org/training-results-0-7

option · on Sept 14, 2020

no they are not. Go read recent MLPerf results more carefully and not Google’s blogpost. NVIDIA won 8/8 benchmarks for publicly available SW/HW combo. Also 8/8 on per chip performance. Google did show better results with some “research” system which is not available to anyone other then them yet.

sillysaurusx · on Sept 14, 2020

This is a weirdly aggressive reply. I don't "read Google's blogpost," I use TPUs daily. As for MLPerf benchmarks, you can see for yourself here: https://mlperf.org/training-results-0-6 TPUs are far ahead of competitors. All of these training results are openly available, and you can run them yourself. (I did.)

For MLPerf 0.7, it's true that Google's software isn't available to the public yet. That's because they're in the middle of transitioning to Jax (and by extension, Pytorch). Once that transition is complete, and available to the public, you'll probably be learning TPU programming one way or another, since there's no other practical way to e.g. train a GAN on millions of photos.

You'd think people would be happy that there are realistic alternatives to nVidia's monopoly for AI training, rather than rushing to defend them...

p1esk · on Sept 14, 2020

transitioning to Jax (and by extension, Pytorch)

Wait, what? Why would transition to Jax imply transition to Pytorch?

llukas · on Sept 14, 2020

You are basing your opinion on last year MLPerf and some stuff that may or may not be available in the future. MLPerf 0.7 "available" category has been ghosted by google.

Pointing this out is not aggressive.

make3 · on Sept 14, 2020

this is just false