Serious Computer Glitches Can Be Caused By Cosmic Rays (computerworld.com) 264
The Los Alamos National Lab wrote in 2012 that "For over 20 years the military, the commercial aerospace industry, and the computer industry have known that high-energy neutrons streaming through our atmosphere can cause computer errors." Now an anonymous reader quotes Computerworld:
When your computer crashes or phone freezes, don't be so quick to blame the manufacturer. Cosmic rays -- or rather the electrically charged particles they generate -- may be your real foe. While harmless to living organisms, a small number of these particles have enough energy to interfere with the operation of the microelectronic circuitry in our personal devices... particles alter an individual bit of data stored in a chip's memory. Consequences can be as trivial as altering a single pixel in a photograph or as serious as bringing down a passenger jet.
A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.
Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.
A "single-event upset" was also blamed for an electronic voting error in Schaerbeekm, Belgium, back in 2003. A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible. "This is a really big problem, but it is mostly invisible to the public," said Bharat Bhuva. Bhuva is a member of Vanderbilt University's Radiation Effects Research Group, established in 1987 to study the effects of radiation on electronic systems.
Cisco has been researching cosmic radiation since 2001, and in September briefly cited cosmic rays as a possible explanation for partial data losses that customer's were experiencing with their ASR 9000 routers.
This is news...? (Score:5, Funny)
Re: (Score:2)
gamma radiation
I also believed that cosmic rays trouble were about gamma radiation, but TFA says it is all about neutron radiation.
Re: (Score:2)
Both are right. Protons are deflected in the atmosphere and neutrons have no charge and no (almost no) magnetic oment so don't interact with things unless they hit a nucleas head-on (elastic interaction).
That being said, if a high energy neutron hits something the energy my be sufficient to create other particles (like protons and gamma rays). So in a way both theories in this thread are correct (protons produced in lower atmosphere from neutrons from space).
If you have the equipment, cosmic ray bit flip
Blame Canada (Score:2)
Yep, it's good news. Very useful.
Dumb user error can be blamed on IT problems
IT problems can be blamed on computer glitches
Computer glitches can be blamed on cosmic rays
As a result, dumb user errors can and shall always be blamed on cosmic rays
Re:This is news...? (Score:5, Funny)
A bit flip in the electronic voting machine added 4,096 extra votes to one candidate. The issue was noticed only because the machine gave the candidate more votes than were possible.
How could they tell this apart from standard operations on a Diebold machine?
Re: (Score:2)
It 2017 anything goes for news. I expect the Los Alamos National Lab is worried about its funding so will repurpose one of its old hypothesis and try to get it on Fox News so the president see it and decides to keeps it funding. These organizations if smart realize how manipulatable the president is, and just a few simple things can cause him to change his mind and course. Just as long as you stroke his ego you can do whatever you want.
I am sorry I didn't want to make this political, but we had a proble
Re:This is news...? (Score:5, Funny)
Whenever a user calls up to ask why his computer rebooted after I install an update, I say... drumroll, please... gamma radiation.
Computers and Incredible Hulks don't interface well together, but a Ctrl-Alt-SMASH sequence? I'd buy that.
Re: (Score:2, Interesting)
I really would like to visit your house. After I eat a lot of fiber, maybe some bran muffins or flax seeds. That way I can take a great big SHIT and put a very large, moist turd in your microwave. You will just LOVE what happens when it's in there on high for about ten minutes!
You need to have a better diet. Healthy shit has the consistency of toothpaste.
It makes for an interesting conversation piece.
I had a roommate who left a squash inside a toaster oven on low heat overnight. The squash was carbonized all the way through. Charred on the outside, charred on the inside. Now that was a conversation piece.
Re: (Score:2)
Nah do an upper decker instead.
Re: (Score:2)
I had to google the term after watching an episode of Archer - it was the first time I'd ever heard it. I have literally never heard of anyone (not even about a friend of a friend of a friend...) doing something like that, and I've heard some pretty fucked stuff from my colleagues.
Is this an actual thing in the US?
Re: (Score:2)
Fibers are for data communication, you shouldn't eat them.
Re: This is news...? (Score:2)
Lighten up, Francis.
That is why Excel crashes all the time on OSX (Score:3, Insightful)
Re: (Score:2)
Oh well, guess I need better shielding.
ECC (Score:5, Insightful)
This is why ECC is used to protect memory and data busses. At least on the good stuff :-) . One of the issues is die shrink. As the minimum detail slze of the IC process gets smaller, the potential for radiation to flip a bit gets higher.
Silicon-on-sapphire is the main way to implement silicon-on-insulator, which is more protective of radiation bit flips and less likely to latch-up. But since these have historically been required only for space satellites, they have been horribly expensive. Imagine running an entire IC fabrication just to make a few chips. As there are more applications for rad-hard chips, the price could fall.
Re: (Score:2)
Re: ECC (Score:3)
We are already there:
http://www.pcworld.com/article... [pcworld.com]
http://arstechnica.com/gadgets... [arstechnica.com]
As the IBM article states they are working with Samsung and Global Foundries while the other article is about Intel that is 3 of the major chip fab companies stating they are moving to silicon-germanium hybrid crystal over pure silicon for exactly this reason. Also the fabs on a new process node take time to setup and they need to be ready before circuit design comes in to fab prototype batches so they are usually a coupl
Re: (Score:2)
Re: (Score:2)
I suspect the math works out the same as Shannon's noisy channel theorem [wikipedia.org]. And that as the chance of bit flips (noise) increases due to die shrinking, you can increase the error correction coding to compensate for it up to some theoretical limit.
e.g.. instead of ECC memory having one parity bit for every 8 data bits, you increase it to two parity bits per 8 data bits, and it can withstand a h
Re: (Score:2)
Now if only companies like Intel would actually provide
Yes, if only [wikimedia.org] ...
Re: (Score:2)
Costs money.
It takes up more power, more die space, you need more RAM chips, etc.
Re: (Score:2)
Re: (Score:2)
The situation for AMSAT is still pretty bad, as far as I've heard. As a radio amateur group (and one that has launched quite a few satellites as space hitch-hikers) they can't afford the good stuff, but they get some donated by NASA and some of the commercial satellite companies. Only a few years ago they were still using the 1802 as their main vehicle controller, as that was their main choice in silicon-on-sapphire CPUs. They get some donations of space-qualified solar cells. They scrub their memory continuously, They use no boot ROMS. The program is loaded entirely by hardware, and then the CPU is started.
Bruce, what do you mean by "...no boot ROMS.... loaded entirely by hardware" ?
preposterous! (Score:5, Informative)
When your computer crashes or phone freezes, don't be so quick to blame the manufacturer.
If my computer crashes or phone freezes, it's almost certainly the fault of the person who released the software without properly debugging it. Cosmic rays are very low on the list of reasons why your device has malfunctioned.
Re: (Score:2)
You are right in the lottery sense : if your particular phone or app crashes, it is very unlikely that it is due to cosmic rays. However, it might be likely that it happens fairly often around the world. This is similar to the lottery : it is unlikely that you will win, but it is likely that someone will win.
It's all a matter of cross-section of the devices actually. If we want to compare, the IPhone 4 (an old baseline, smaller than today's generation but close to most of the low-cost devices) measures 0.00
Re: (Score:2)
Low on the list, but certainly not nonzero. Given the increasing number of devices out there it's probably happening around the world with some regularity. There just isn't a way for most of us to properly measure or attribute the occurrences.
Say you're driving down the interstate and your cruise control shuts off, but you're sure you didn't bump the brake. Your $1.49 bag of chips rings up as $9.49 at the grocery store, but re-scans at the correct price after a void. A few pixels go blurry in an otherwise f
Re: (Score:2)
Re:preposterous! (Score:5, Funny)
Re: (Score:2)
Someday I will be able to completely debug a piece of software. It will be a very small piece of software, I am sure.
People discount the complexity that we face when attempting to fully debug anything.
Re: (Score:2)
NOT bringing down a passenger jet (Score:3)
Follow through the links: a cosmic ray caused problems, the jets misbehaved for a bit but the duplicated systems protected them from a crash - as they are supposed to after a malfunction.
Hmm. (Score:2)
Shouldn't "News for Nerds" be news to nerds?
Stock markets and the BOFH (Score:2)
Even though market participents are warned about this by exchanges, you do have to wonder, if it makes it into the BOFH excuse calendar, can you really take it seriously?
Re: (Score:2)
Oh, that and solar flares
Goddamit where was this ... (Score:2)
... during my IT career?
I could have used this as a dodge after I fucked something up in the system.
I did the sunspot thing back in 2012.
"Russia," seems to work well, though.
Yep. Cosmic rays. (Score:2)
I'm certain it's on the list [wisc.edu] somewhere.
Yep, Cosmic rays CAN cause problems (Score:2)
But much more frequently, problems are caused by somebody f**king something up. You shouldn't be looking to cosmic rays until you're pretty sure it's not just stupidity in action.
bullshit (Score:3)
Re: (Score:2)
No it isn't. Cosmic rays most definitely have an impact on your phone.
You can take basic precautions though. I find new phones come with small amount of EM shielding that blocks cosmic rays. As time progresses this shield gets weak and more and more CPU power is dedicated to it operating properly which also slows down your phone. However it is often fixed by performing a factory reset (which also resets and recalibrates the EM shielding) making your phone fast and cosmic ray resistant again.
Of course if you
Thousands of years, same surprises (Score:3)
Is anyone surprised that if you store things once, and reference the one place alone, that you get screwed on occasion?
Is the word "co-roberation" new? How about "validation", "authentication", "verification", and, oh, I don't know, "paper-trail"?
It's electronic information, not magic. The benefit of not carving into stone is that you can readily duplicate information into multiple places. Use it.
RAID.
Serious Computer Glitches Can Be Caused By IDIOTS (Score:2)
Yes, I know it's common, I use some software (from a very large company that was run by a guy you don't go hunting with) that when it hits a some input data with a negative integer IT ATTEMPTS TO ALLOCATE NEGATIVE MEMORY, and of course, crashes - but things that stupid should never happen (especially since it's supposed to deal with very noisy data). If it's out of range for a bit of code to work on then don't le
We've always known this. This is why we have ECC (Score:3)
We've always known this. This is why we have ECC memory on servers.
Re:We've always known this. This is why we have EC (Score:4, Informative)
It's also why systems on spacecraft such as the Space Shuttle had what's called the Data Processing System. It consisted of four systems with identical software and an extra one with the same hardware but a different implementation with the same goals. They checked each others' decisions, and a majority "vote" would lock out the differing system.
there is a product for this! (Score:2)
ZFS (Score:2)
Isn't this why ZFS exists?
We've known this since the 1980s... (Score:2)
We've known this since the 1980s...and the more dense/smaller the transistors get the greater the likelihood of it happening.
This is news, but it's literally from the previous century.
Re: (Score:3)
Re:@Intel: Why no ECC for consumer-grade processor (Score:5, Informative)
Actually, wouldn't cosmic rays be capable of flipping bits even in ECC memory and processors, thereby making the whole ECC thing useless?
No, this is what ECC is for. If a bit is flipped, you can detect it. If you have enough parity bits, you can even detect which bit is flipped, and correct it on the fly. Computation occurs as normal and an error shows up in the syslog.
Re: (Score:2)
Are people really less knowledgeable about computers now than they were in the 80's?
Re: (Score:2)
Are people really less knowledgeable about computers now than they were in the 80's?
If you mean on average, I think the answer is probably yes. More people know how to operate them now, but then, operating them has become orders of magnitude simpler.
Re: (Score:3)
Are people really less knowledgeable about computers now than they were in the 80's?
If you mean on average, I think the answer is probably yes.
If you mean on average out of the total number of computer users or programmers, then yes (they are less knowledgable), because that pool has increased by lots and lots.
If you mean on average out of all people, then no. I suspect there are far more people that know what ECC does now than did in the 80's, and the total population count hasn't gone up as much as that number, so there are more people on average, and in total, that know about the inner workings of computers.
I think there are just far more peopl
Re: (Score:2)
It's just shifted more to professional/hobbyist knowledge than something that every operator is required to know.
Isn't that implied by the site we're on?
Re: (Score:3)
Yes, absolutely! Have you never sat down with a IT graduate from the 2000's to figure out what they actually know about computer hardware?
Re: (Score:2)
That's a good argument for Gray code.
I have to take issue with the assumption that nothing clears errors better than a hard reset. There are very many known strategies for dealing with errors on a running system, and a reset only clears persistent and cumulative error, rather than transient ones. Since we can assume that your computer doesn't keep the same data in memory all of the time, most will be transient.
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
There probably are some, making bits for nuclear reactors and industrial, scientific, and medical users of neutron sources; but it's a niche
Re: (Score:2)
If it really is a problem, it could be easily dealt with at very modest cost by using extra memory bits for memory error detection/correction. Although as others point out, modern software is so buggy that it might not be worth the effort to actually improve the hardware a little.
Those who were around 25 years or so ago will remember that the lack of parity/ECC actually can be laid partly on Microsoft. Early PC
Re:Why not blame the manufacturer? (Score:4, Insightful)
There's something you can do about it. It's very easy, but you won't like it.
Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.
Re:Why not blame the manufacturer? (Score:5, Interesting)
There's something you can do about it. It's very easy, but you won't like it.
Make every component in triplicate. Everything in the CPU, everything in the RAM, everything in storage, etc. If the three aren't equal, go with the value shared by two of them and rewrite the different one with that value.
Not only is this not actually all that easy (all of your triplicate systems have to be clocked together in sync, you need a shitload of extra hardware to do the comparison, etc.) it's grossly unnecessary. Standard off-the-shelf error detection and correction can (and routinely does) handle radiation induced errors. It just costs a bit more, because it's a business-level feature. It doesn't matter if that MP3 of Taylor Swift gets mildly corrupted (might even sound better that way, zing), but it very much *does* matter if that bank account gets a flipped bit.
Re: (Score:2)
I wonder if this means it's actually cheaper to use 3 separate computers, on cheap off the shelf hardware, than one armored and extra redundant computer. For example, spacecraft guidance or for an autonomous car.
Re: (Score:3)
Re: (Score:2)
So the reason the Curiosity rover uses FPGAs with ternary logic (and just 2 computers if I recall) is to save on weight. If they were going to optimal cost efficiency they'd have redundant computers and do what the FPGAs are doing in firmware.
Re: (Score:2)
In that special case saving on weight is optimal cost efficiency. I doubt that any of the rover components cost as much as their percentage of the total weight divided by the cost to get the rover on Mars.
Re: (Score:2)
Technically correct...the best kind of correct. I was implicitly referring strictly to electronic component cost and development time costs, since there's only going to be a handful of Curiosity rover style projects per decade but there are many thousands of projects to develop safe computerized control systems for cars and robots and everything else.
Re: (Score:2)
it's grossly unnecessary
That depends on the application. I agree unnecessary for a general purpose computer, probably also unnecessary for servers.
However if your electronics store critical financial information or are safety systems in control of hazardous facilities then it becomes a bit of a different story. The comparison model is actually one that is adopted by many safety systems.
Re:Why not blame the manufacturer? (Score:5, Informative)
You know that several FPGA manufacturers offer this. Xilinx offers a method where this is done in software - when you do design synthesis, more than triple the gates are needed for every circuit allocated in the design. (I think it's done at a higher level - truth tables with the triple redundant bits are generated)
Some do it in hardware, so your design synthesis is the same but the actual software programmable subunits use ternary redundancy.
Re:Why not blame the manufacturer? (Score:4, Informative)
Adding one ECC bit per byte, yes. Adding one parity bit, no. ECC != parity.
Re: (Score:2)
Marginal extra cost, want to look up the difference in price between a Intel Core i7 extreme edition on an X99 board and the equivalent Intel Xeon where the difference between the processors is the ECC memory controller. There are a few low end mobile and embedded processors Intel do with ECC, but majority of their consumer range deliberately do not have it, it is a Xeon "feature" and the price tag that has.
Re:Why not blame the manufacturer? (Score:5, Informative)
Probably b'cos there is nothing that manufacturers can do about cosmic rays
Except that is not true. Electronic devices can be made more resistant to cosmic rays and other radiation. The easiest way to do so is to use depleted boron [wikipedia.org] instead of "normal" boron as a semiconductor dopant. Boron-10 has a very high neutron absorption cross section while Boron-11 has a very small cross section. Use boron that has been "depleted" of the B10 isotope, and you cut way down on your neutron induced SEUs.
Another obvious countermeasure is to use ECC memory, and memory scrubbing [wikipedia.org].
The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone if it meant one fewer reboot every decade or so?
Re:Why not blame the manufacturer? (Score:5, Informative)
Another obvious countermeasure is to use ECC memory ...
The problem is not that there is nothing that manufacturers can do, but that consumers aren't willing to pay the extra cost. Would you be willing to pay an extra $100 for your phone ...
ECC memory is not that much more expensive. It's been a few years since I built the desktop I'm using, but I included 16gb of ECC memory (4x 4gb DDR3 ECC KVR1333D3E9SK2/8G). At the time, I think it was around $60. The equivalent normal memory was only a couple bucks cheaper. If Samsung started using ECC memory in all their phones, the cost would be nearly the same with the volume they would be ordering/making.
FWIW, I did try to do the same comparison just now on newegg and, while it's a bit of a mess, the situation is nearly the same today:
$34 : Kingston 4GB 240-Pin DDR3 SDRAM ECC Unbuffered DDR3 1333 Server Memory Model KVR13LE9S8/4
$52 : Kingston 8GB (2 x 4GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) Memory Model KVR16N11S8K2/8
More expensive? Yes.
$100 more? Nowhere near that much.
Re: (Score:3)
A couple years later, the price of ECC RAM had dropped to only about 50% more than the cost of regular RAM.
Northbridge (and thus memory controller) is in CPU (Score:2)
ECC Memory isn't the only added cost, you also need a motherboard and processor that supports it.
For your information, ever since AMD's Athlon 64, most x86 compatible hardware has had its Northbridge *inside the processor package*.
That means that the memory controller is inside the package of your CPU.
The mother board is basically only traces that connect your CPU and the memory slots directly.
A glorified cable/connector.
(In practice, there is a bit more, regarding powering the RAM slots, etc. but you got the general idea : not much smarts in the motherboard between RAM and CPU.
Smarts is in the "Southb
Re: Why not blame the manufacturer? (Score:5, Informative)
ECC memory doesn't do anything to help when the bits that get flipped are in the CPU. Or anywhere else that isn't a RAM chip.
Except that the RAM has hundreds or thousands of times as many bits as a CPU, and Flash may have millions of times as many, and dynamic ram has smaller feature size, and is more susceptible to SEUs. So correcting RAM and Flash helps because that is where 99.9% of the problem is.
Even within the CPU, most transistors are used to implement cache, and cache can also be scrubbed (although not with just software).
Re: (Score:2)
I should think that it wouldn't be that hard to add parity to CPU registers, caches, etc
OTOH, I'm sure Intel could find a way to make the implementation obtuse and even further complicate their CPUs. And in any case, it's unclear to me what the device is supposed to do when it finds the number it is working with is wrong.
Re: (Score:2)
Probably b'cos there is nothing that manufacturers can do about cosmic rays, which are beyond even gamma rays in the electromagnetic spectrum in terms of wavelength and frequency.
Not the manufacturer, but the CTOs: just put all data centers into old mines. This could be a great business for rural areas.
Re: (Score:2)
Probably b'cos there is nothing that manufacturers can do about cosmic rays, which are beyond even gamma rays in the electromagnetic spectrum in terms of wavelength and frequency.
In addition to what others have pointed out, stop shrinking dies. The smaller a circuit is, the greater the risk that it will be impacted when hit by a neutron. Components from a decade or two ago are a heck of a lot more resilient against cosmic rays than today's components.
Sure, at lower speeds, but for a great many things you just need enough speed. Split out speed-requiring jobs on cutting-edge hardware, and run other critical services on more reliable hardware.
Re: (Score:2)
Your ECC RAM won't matter much if the cosmic ray hits the CPU registers. Or a cell in a block of your flash storage.
Re: (Score:2)
Your ECC RAM won't matter much if the cosmic ray hits the CPU registers.
Some modern CPUs have ECC cache RAM. Is it not possible to have ECC registers?
Or a cell in a block of your flash storage.
Filesystems can have ECC, too. And in fact, so can storage devices.
Re: (Score:2)
Your ECC RAM won't matter much if the cosmic ray hits the CPU registers. Or a cell in a block of your flash storage.
Also, your ECC RAM won't matter much if you get run over by a truck. So what? ECC RAM will help if there is a bitflip in your ECC RAM, that's what it's for and that's what the benefit is. It's not going to solve world hunger either, and nobody ever suggested that it would.
Re: ECC (Score:5, Interesting)
Re: (Score:2)
Maybe that's the case for now, but who knows what will happen with stacked 3D memory?
Odds (Score:5, Insightful)
The odds of a cosmic ray hitting your memory at the exact right spot to flip a bit are one in hundreds of millions. There are just enough computers out there that it happens from time to time. The odds of FIVE rays hitting just the right locations to flip four bits and a parity bit are, pardon the pun, astronomical.
Re: (Score:2)
Each of my systems has more than hundreds of millions of bits of RAM. Some of them have 128 thousand million bits. There are a lot of places to hit.
Re: (Score:2)
Odds are low, but the frequency is high. An oft-cited IBM study from the 90s determined that memory will get a cosmic ray bit-flip once per 256MB per month. So, an 8GB system will see about 32 bit-flips per month. Probably more with modern memory. Of course, as you mention, it's not likely that several would occur at the same time in nearly the same place.
That vast majority will be in unused memory, executable code that never gets executed, or even in code or data that, while corrupted, simply doesn't have
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: "Of course it can," says government (Score:2)
Accept what are being talked about here is not low frequency radiation but extremely higher frequency radiation, wavelengths smaller than gaps between atoms that are only stopped on that direct hit which if it happens to just the right atom on that added circuit or whatever. Now the are extraordinarily rare events it the probability of any single ray is calculated but are being constantly but by these rays all day every day making the probability of causing an issue somewhere on the plant quite high. There
Re: (Score:3)
Re: (Score:2)
What we're actually talking about is cosmic rays, which are matter particles (mostly protons), not any kind of electromagnetic radiation. Those generally slam into something in the atmosphere, producing showers of secondary particles. Occasionally some of these make it to the ground. The article mentions neutrons, but these seem to be mostly muons.
Of course Bruce Perens, to whom you replied, was talking about the radio waves from HAARP, which was mentioned by the OP.
Re: (Score:2)
The article mentions neutrons, but these seem to be mostly muons.
Neutrons are not muons. Nor are muons the problem in bit flips.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Really earlier than that, Fermi expected it and had equipment shielded and double-shielded when testing the first nuclear bomb. But we should not confuse cosmic rays and EMP.
Re: (Score:2)
We did precisely this for NASA as part of a systems we built and am very familiar...or was a long time ago...with radiation damage and failure modes to electronics in space. Sometimes the shielding can make things worse. Instead of going straight through a transistor, a collision can occur upstream sending a spray of other particles with the right energy to
Re: (Score:2)
Client: ... it crashed again! What's going on with the server? .... if you're lucky ...
Me: I've recently become a Herald of Galactus. I may be difficult to reach from now on
Re: (Score:2)
Tee hee.
The legal definition of Act of God does not itself admit to the existence of a deity. Just natural phenomena which are beyond human agency to predict or prevent.