How Reliable Are Modern CPUs? (theregister.com) 64
Slashdot reader ochinko (user #19,311) shares The Register's report about a recent presentation by Google engineer Peter Hochschild. His team discovered machines with higher-than-expected hardware errors that "showed themselves sporadically, long after installation, and on specific, individual CPU cores rather than entire chips or a family of parts."
The Google researchers examining these silent corrupt execution errors (CEEs) concluded "mercurial cores" were to blame CPUs that miscalculated occasionally, under different circumstances, in a way that defied prediction...The errors were not the result of chip architecture design missteps, and they're not detected during manufacturing tests. Rather, Google engineers theorize, the errors have arisen because we've pushed semiconductor manufacturing to a point where failures have become more frequent and we lack the tools to identify them in advance.
In a paper titled "Cores that don't count" [PDF], Hochschild and colleagues Paul Turner, Jeffrey Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David Culler, and Amin Vahdat cite several plausible reasons why the unreliability of computer cores is only now receiving attention, including larger server fleets that make rare problems more visible, increased attention to overall reliability, and software development improvements that reduce the rate of software bugs. "But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design," the researchers state, noting that existing verification methods are ill-suited for spotting flaws that occur sporadically or as a result of physical deterioration after deployment.
Facebook has noticed the errors, too. In February, the social ad biz published a related paper, "Silent Data Corruption at Scale," that states, "Silent data corruptions are becoming a more common phenomena in data centers than previously observed...."
The risks posed by misbehaving cores include not only crashes, which the existing fail-stop model for error handling can accommodate, but also incorrect calculations and data loss, which may go unnoticed and pose a particular risk at scale. Hochschild recounted an instance where Google's errant hardware conducted what might be described as an auto-erratic ransomware attack. "One of our mercurial cores corrupted encryption," he explained. "It did it in such a way that only it could decrypt what it had wrongly encrypted."
How common is the problem? The Register notes that Google's researchers shared a ballpark figure "on the order of a few mercurial cores per several thousand machines similar to the rate reported by Facebook."
In a paper titled "Cores that don't count" [PDF], Hochschild and colleagues Paul Turner, Jeffrey Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David Culler, and Amin Vahdat cite several plausible reasons why the unreliability of computer cores is only now receiving attention, including larger server fleets that make rare problems more visible, increased attention to overall reliability, and software development improvements that reduce the rate of software bugs. "But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design," the researchers state, noting that existing verification methods are ill-suited for spotting flaws that occur sporadically or as a result of physical deterioration after deployment.
Facebook has noticed the errors, too. In February, the social ad biz published a related paper, "Silent Data Corruption at Scale," that states, "Silent data corruptions are becoming a more common phenomena in data centers than previously observed...."
The risks posed by misbehaving cores include not only crashes, which the existing fail-stop model for error handling can accommodate, but also incorrect calculations and data loss, which may go unnoticed and pose a particular risk at scale. Hochschild recounted an instance where Google's errant hardware conducted what might be described as an auto-erratic ransomware attack. "One of our mercurial cores corrupted encryption," he explained. "It did it in such a way that only it could decrypt what it had wrongly encrypted."
How common is the problem? The Register notes that Google's researchers shared a ballpark figure "on the order of a few mercurial cores per several thousand machines similar to the rate reported by Facebook."