Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What does "a single bad CPU instruction" mean here? That the CPU was faulty and reliably miscalculated AVX instructions? I don't understand.


Yes. For that particular CPU, in one core (and its sibling thread (as in hyperthread)), AVX instruction is broken. It doesn't do what it is supposed to do.

Recently, again at Facebook, we found a few machines with CPUs where ADCX instruction is broken on at least one core. This is especially fatal because OpenSSL and Fizz (our TLS 1.3 impl) use these instructions for AMD64 architecture in RSA implementation.


What kind of burn-in process is used to tease these out before deploying? Curious


Barring the specifics, we deploy new machines in batches of tens of thousands at a time and they practically don't receive any production traffic for weeks while repeatedly running burn-in tests followed by baseline daemons. If they survive, they get provisioned for production.

Infant mortality rate of CPUs--especially single socket designs--is very low compared to DIMMs and SSDs. They tend to develop these issues later in their economic lives. They are also comparatively rare to other module failures.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: