But now you're comparing median video bitstream with peak audio bitstream.
YouTube uses variable bitrate for audio, which can vary dramatically in size. Your example of podcasts or "talking heads" is actually perfect. Most encoders are extremely efficient at compressing voices, as they will only have to encode 30-300Hz, and voices have less data variation than images.
Image encoding is just very complex. It'll get better and better, but audio encoders of the same generation will also improve.