Using AES in ECB mode is clearly a bad choice, but honestly it's not that horrible for high entropy data like compressed audio/video. I'm sure someone could prove me wrong one day, but it seems hard to extract any useful patterns out of compressed audio/video. It does check the box of "uses encryption" for regulatory reasons (while missing the intent). It's pretty egregious considering how easy this is to get right.
The 128-bit key is not inherently wrong if they were rotating these out during the stream. That being said, there's no reason not to do it right and use a mode like GCM with a longer key - most hardware supports acceleration for AES-256 these days. It can actually be slower to use a 128-bit key on 64-bit systems.
While I respect the decision not to disclose the waiting room vulnerability, it's pretty obvious what's going on given the context. They probably shouldn't have mentioned where the vulnerability is.
I'm honestly surprised anyone with technical knowledge thought that Zoom was actually doing end-to-end encryption given how the software works. All of the video transcoding/downconversion is clearly happening on the server. Your client is not sending multiple compressed streams for varying connection bandwidths. That's the main reason a lot of people like Zoom - it actually works well with dozens or hundreds of participants.
Is it just me to whom it seems obvious why they've gone with ECB?
Zoom's design has a single key for everybody and for everything [ in the context of a particular video conference call ] . It's simpler and, to a layman, it sounds secure. [ We arguably contribute to this if we say e.g. "the key" implying it's just one thing when we mean something like a master secret in TLS used to derive lots of actual keys ].
Once you've committed to a single key ECB behaves exactly how you'd want.
You've got some audio, or video, ready to send? Just encrypt it with the key. Receive some encrypted data? Just decrypt it.
What happens if you have some network trouble briefly? Nothing, everybody decrypts whatever does arrive and maybe a few frames are missing.
All of the other modes don't work at all if you try to use them this way. They all expect you to have thought about the problem and track a bunch more state and then maintain that state despite an unreliable network and other issues.
Unless there is somebody in the room who says we can't do ECB because it's fundamentally not a secure choice, ECB is what you're going to get from this design decision.
And I've been in rooms like that as the only voice, or at least as the only person who spoke up. I've been in rooms where I was part of a chorus too, but as organisations get bigger and "security is everyone's problem" becomes a phrase people learn but don't act on, it gets lonelier.
Also, I actually can't even work out what a "correct" key rotation strategy could be for ECB with a variable number of parties all encrypting stuff at once. As a result it seems unlikely that Zoom did figure out such a strategy and then correctly implemented it. Instead it seems safe to assume there is no key rotation, everything sent by every participant for the life of the stream is encrypted with the same key, even though that's a terrible idea.
> Using AES in ECB mode is clearly a bad choice, but honestly it's not that horrible for high entropy data like compressed audio/video. I'm sure someone could prove me wrong one day, but it seems hard to extract any useful patterns out of compressed audio/video.
...you're joking right? The Wikipedia example for why ECB is not recommended is literally an image:
It's definitely a terrible choice for uncompressed images or video. I'm arguing it probably isn't that bad for highly compressed video. That being said, if you're encrypting any data stream you should use an appropriate stream cipher.
You're forgetting about the technical intricacies of compressed video. Compressed video is a mix of high and low entropy content, with a predictable time pattern to this. For example, one can easily use traffic analysis to find B-Frames, and run analysis on that. Bam, you get very low entropy due to the stationary nature of video conference.
Yeah, the B-frames were my thought too, but ordinary sensor noise would hopefully make the individual frames different enough. If you’re doing green-screen background switching, or transmitting a static image then it’s definitely going to be a problem.
But given all the other security issues that seem to hover around Zoom like a cloud of angry bees, this is probably all moot - if you really want to crack a video stream then there are probably easier ways.
Zoom does screen sharing, right? Surely it's not transmitted uncompressed, but it is stationary for a long time and perhaps only small parts changing when they do (eg, switching slides). Is there an ECB-based weakness here?
Those unchanged pixels won't be transmitted. Only the changed part of the screen needs to be sent to the client. For example, RFB[RFC 6143] would send a 16-byte header with the size and position of a rectangular area of the screen followed by the pixels in that rectangle. Or multiple rectangles can be sent in one update message. But if you consider the case of text being typed, there will be a single rectangle per keystroke(s).
Now I wonder if a sequence of these rectangles, all the same size and in roughly the same area of the screen, would lend to some sort of statistical analysis. At the very least the timestamps of the updates would tell you how fast the operator is typing.
Compression already makes a compressed file roughly indistinguishable from random noise (module access to the decompressor). So the patterns have been removed.
That doesn’t make this good, but it means that one specific example isn’t immediately applicable.
There's more in the stream than just compressed data. There'll be metadata info that you can make reasonable guesses about. ECB mode lets you take that information and apply it to other blocks in the ciphertext.
This thread is an excellent illustration of why you don't want your encryption implemented by merely good coders. You need people who know what they are doing.
I’m literally one of those people who “knows what they’re doing”. This is the problem with discussing ECB on an online forum. There’s no space to have a nuanced discussion without people cargo culting “ECB bad” over every comment.
Yes, ECB is almost always the wrong choice. Yes, there are other ways it’s going to fail in this use case. Yes, compression before encryption itself often enables other attacks. No, I should not have to prefix a comment about ECB with this type of disclaimer when I’m making (what should be) an uncontroversial statement that the tux attack doesn’t directly apply to compressed image data.
Ironically, when designing a protocol for my company, one of the reasons we didn’t use ECB when it would have been entirely justified (each chunk of data was precisely one block in size and keys were only ever used once) was because of potential backlash from people who only know “ECB bad” and nothing more.
I’m not arguing that. I’m saying that for compressed data, underlying patterns in data aren’t trivially exposed by ECB. Ergo, the “tux” attack on bitmap image files doesn’t really apply here. I meant nothing more and nothing less than that.
And I'm saying that they aren't just sending compressed data, nor would hardly any practical communication application, which makes the "well maybe they have enough entropy that it doesn't matter?" argument moot.
It's harder to extract patterns from high entropy data. I don't think anyone's saying that this is even an OK thing to rely on, at all, just that the nature of the data means that this specific weakness is likely more difficult to take advantage of.
If zoom were transmitting text this would be relatively more serious.
What about the chat system? I doubt they're intentionally compressing the text there in order to increase the entropy. I guess they could be using gzip or whatever, but we'd need to look at how the protocol works in more detail. Or do they use a different system for the chat protocol altogether?
AES-ECB isn't necessarily insecure, it's just very easy to misuse (and I agree what the article described is a misuse). I think the argument is that if there are patterns in the input data, the same patterns will show up in the AES-ECB encrypted data, just with different values. Compressed data should be high entropy and hard to predict, so there really shouldn't be structure or patterns to the input data. There's no guarantee that any given compression algorithm provides sufficient randomness, though.
No, their argument is that if you have a higher entropy per byte, there will be more variation in the aligned 16-byte chunks that are relevant for attacking AES-128-ECB. This reduces the probability of the attacker being able to find equal blocks.
And my argument is if I know the video encoding and compression sequence, I wouldn't depend on AES-ECB. I know the patterns that show up.
If I am encrypting something, I only want to depend on the strength of the encryption. I don't want to hope that something else ensures that an adversary cannot figure out my ciohertext. That is a very bad idea.
Sure, and nobody was arguing they should have used ECB or that they shouldn’t change it. Only that the ability to exploit this given compressed data is lower than the uncompressed penguin image example.
No comment on the original claim, but that example is encryption applied to an uncompressed image. (Adjacent identical pixels are not typically represented individually when compressed, and thus encryption could not cause the banding patterns seen in those regions of the image if it were compressed prior to encryption.)
The point is that any pattern in the plaintxt data shows up in encrypted data if you use AES-ECB.
Compression does not introdoce entropy to a stream. So assuming that saying the stream is compressed and calling it good is a very bad idea. Please refer to Shannon's source coding theorem. If anything, compression reduces the entropy in the information.
I think you may want to look closer at Shannon’s source coding theorem; The Shannon entropy of the output of a compression algorithm will be higher than the entropy of the source as identifiable patterns are eliminated. Otherwise the theorem would trivially contradict itself.
Shannon's source coding theorn says that the entropy in a compressed information is at most the entropy of the uncompressed information. If you add entropy to a compressed algorithm, you are by definition adding noise to the SNR of a signal.
We’re not talking about a noisy channel here, so I’m not sure where you’re getting the SNR from. I think we’re talking about entropy of different distributions here so let’s cut to a concrete example relevant to your original claim (that compression doesn’t help reduce the impact of repeated blocks in ECB by reducing the rate of repeated blocks).
Suppose we have some string of bytes. When we split it into aligned 16-byte blocks (let’s assume it divides evenly for simplicity), we find that the distribution of these blocks are not evenly distributed. For example, 1% of blocks turn out to be the same, which given the number of symbols in this code is massively out of proportion.
We apply a Huffman code using the 16-byte blocks present in the message as the alphabet and their observed statistics for this particular message (if that aspect bothers you, you can assume we pretend the dictionary to the message). Huffman codes are optimal for per-symbol encoding.
Suppose we re-evaluate the distribution of 16-byte blocks in the compressed data; will this distribution have higher entropy (meaning there will be fewer duplicate blocks to exploit ECB with) or not?
> The point is that any pattern in the plaintxt data shows up in encrypted data if you use AES-ECB.
No, that's false. ECB reveals repeating plaintext blocks. "F0123456789ABCDEF0123456789ABCDEF" contains a repeating block-length sequence, but would encrypt to three distinct blocks under ECB, because the sequence is not aligned to a block boundary.
Untrue. Many performance trade-offs have to be made and the entropy has to vary drastically with time. See for example B-Frames vs I-frames in compressed video. Couple that with the very low entropy video conference data and bam.
Even uncompressed video will be hard to see that "penguin image effect" in, because the pixels that make up each block will be constantly changing in a random way, and unlike that synthetically generated image, it's highly unlikely for a block to be the exact same as any other one in any given frame.
You greatly overestimate both the image quality of crappy videoconferencing streamed video, the amount of pixel-wise sensor noise after noise reduction (pretty low actually), while underestimating the ingenuity of crypt-analysts and the power of having a lot of data. Like seriously, the only way the shitty 1mm or less sensors on webcams are able to deliver HD video is through an abject amount of noise reudction, sharpening and filtering. All of which greatly reduce entropy.
Hint: You don't need to know the plaintext exactly. you just need to be able to build a reasonably precise probability distribution.
I'm actually not sure - that's a good point. The whole point of compressing audio for video conferencing is to preserve human speech, so things that produce radically different waveforms but "sound the same" to us might show up as patterns. I guess it's better to avoid the question entirely and use an appropriate stream cipher!
There are lots of embedded processors with hardware support for AES-128 only. I have to fight to keep AES-256 out of the ciphersuite list because of the performance regression. The rest of the world will probably force the issue eventually but the saving grace is that 3DES is still considered secure.
> the saving grace is that 3DES is still considered secure.
Nobody who wants to do AES-256 rather than AES-128 thinks 3DES is "still secure". 3DES is perhaps 112 bits of useful keyspace but it has 64-bit blocks which was already bad news when DES was invented.
TLS 1.3 doesn't have a 3DES option at all. You can do AES 128 or AES 256 (or ChaCha20).
The 128-bit key is not inherently wrong if they were rotating these out during the stream. That being said, there's no reason not to do it right and use a mode like GCM with a longer key - most hardware supports acceleration for AES-256 these days. It can actually be slower to use a 128-bit key on 64-bit systems.
While I respect the decision not to disclose the waiting room vulnerability, it's pretty obvious what's going on given the context. They probably shouldn't have mentioned where the vulnerability is.
I'm honestly surprised anyone with technical knowledge thought that Zoom was actually doing end-to-end encryption given how the software works. All of the video transcoding/downconversion is clearly happening on the server. Your client is not sending multiple compressed streams for varying connection bandwidths. That's the main reason a lot of people like Zoom - it actually works well with dozens or hundreds of participants.