I've heard images are better modeled in DCT space (which isn't based on complex numbers) because it's better at energy compaction than FFT, and also because it doesn't assume that the image is periodic. Also some people think that the FFT is insufficient, even for audio, because it doesn't model time-domain hearing perception. Some people say that wavelets are better at modeling images than purely frequency-domain transforms, because they take spatiality more into account. From what I've heard, wavelets work well for modeling human vision (in fact convolutional neural network input kernels tend to converge to Gabor filters, which I don't know howw they differ from Gabor wavelets) and noise reduction, but have fallen flat for image/video compression codec design.
All excellent points, and I think you should DM me on twitter to chat about this more. (I hope you will!)
DCT is on my radar. But there are several serious limitations that I think are overlooked. For example, convolution is no longer a simple component-wise multiplication. This seems, to me, a big deal.
In other words, you're probably right, but I'm focused solely on FFTs on the (very low) chance that people have overlooked something that will work well.
Sorry I don't work on neural networks much, and have my plate too full with other projects (and my DSP is a bit rusty?) to hold a conversation on this right now. And I don't use Twitter much either.