Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Space Businesses Networking The Internet

Cisco Blamed A Router Bug On 'Cosmic Radiation' (networkworld.com) 145

Network World's news editor contacted Slashdot with this report: A Cisco bug report addressing "partial data traffic loss" on the company's ASR 9000 Series routers contended that a "possible trigger is cosmic radiation causing SEU [single-event upset] soft errors." Not everyone is buying: "It IS possible for bits to be flipped in memory by stray background radiation. However it's mostly impossible to detect the reason as to WHERE or WHEN this happens," writes a Redditor identifying himself as a former [technical assistance center] engineer...
"While we can't speak to this particular case," Cisco wrote in a follow-up, "Cisco has conducted extensive research, dating back to 2001, on the effects cosmic radiation can have on our service provider networking hardware, system architectures and software designs. Despite being rare, as electronics operate at faster speeds and the density of silicon chips increases, it becomes more likely that a stray bit of energy could cause problems that affect the performance of a router or switch."

Friday a commenter claiming to be Xander Thuijs, Cisco's principal engineer on the ASR 9000 router, posted below the article, "apologies for the detail provided and the 'concept' of cosmic radiation. This is not the type of explanation I would like to see presented to the respected users of our products. We have made some updates to the DDTS [defect-tracking report] in question with a more substantial data and explanation. The issue is something that we can likely address with an FPD update on the 2x100 or 1x100G Typhoon-based linecard."
This discussion has been archived. No new comments can be posted.

Cisco Blamed A Router Bug On 'Cosmic Radiation'

Comments Filter:
    • Hey, cosmic radiation is right here on page 27 of the BOFH excuse calendar. Today's it's day.
  • by Anonymous Coward

    I'm not saying it was aliens, but...

    It was aliens!

  • would be another explanation.
    • Re:Bad memory... (Score:5, Insightful)

      by Dutch Gun ( 899105 ) on Saturday September 24, 2016 @08:10PM (#52955297)

      Or perhaps something like a design flaw in memory [wikipedia.org] that's provable and repeatable, and has even been used for conceptual security attacks.

      Still, when you start looking at crash reports from millions of customers (I used to work on a fairly well-known MMO), you see stuff that simply shouldn't be possible, and you start wondering about things like cosmic radiation. We had to filter out what we figured were hardware-based errors due to overclocked CPUs, bad RAM, etc, or else you get flooded with impossible crash stacks.

      x = 3 * y; // Crash here! WTF?

      • x = 3 * y; // Crash here! WTF?

        Maybe y ~ 2^31 and the CPU doesn't support overflow...

      • by arth1 ( 260657 )

        x = 3 * y; // Crash here! WTF?

        Quite a few opportunities that doesn't involve external causes:
        Overflows.
        y is a pointer to an address that can't be read.
        y is a macro which causes a divide by zero or other exceptions.
        Using a C compiler that doesn't take C++ style comments

        • Lol, I knew I was going to get comments like that. Christ, it was just an example. Assume this is simple integer arithmetic in a well-defined range, ok? And I'm still trying to figure out what sort of C compiler that doesn't understand C++ comments would generate runtime code that crashes instead.

          Honestly, there are seriously whacked PCs in the world (especially badly overclocked gaming PCs) that try to argue that 1 + 1 == 3.

    • Quite right. And by way of actual observation on Cisco TAC engagement with a top tier Service Provider... both bad memory and cosmic radiation have surfaced in the debug and diagnostic phase of the trouble ticket. It is not easy to then get the customer to ignore mention of 'cosmic radiation' and make progress on the real issue (timeline, early 2000's). And guess what surfaces in the next service contract negotiation.

  • I wonder if this is a real thing on the surface. The Earth's magnetic sphere has a tendency to grab and divert most of these things, which manned spacecraft have a hard time maneuvering. Do they actually ever screw up processors on the ground? That's pretty crazy.
    • by Anonymous Coward

      The Earth's magnetic field is great for deflecting most of the stuff that come from the Sun, but cosmic rays, as in the stuff from outside the solar system, includes a lot of high to very high energy particles that are not much deflect by the same field. It can actually be worse in the atmosphere than in orbit too, depending on your exact setup. A single high energy particle directly hitting something in space might deflect a single atom, but a similar particle hitting the upper atmosphere deposits enough

      • Thank you. There seem to be a lot of parallels to space systems. I would imagine it is still as big of an issue on the ground because they do not put the same thought in protecting from it, as they do in orbit.
    • by sjames ( 1099 )

      I can't claim to know the cause, but I have seen one proven case of a bit flip affecting processing on a machine that ran fine before and after the incident. Since it was doing batch processing I was able to re-run the job with identical inputs. It never made the error again.

      Could have been a cosmic ray, alpha decay, power glitch that oddly didn't affect the other machines on the same circuit, who knows?

      • I'm definitely going to read up on this surface SEU more. I have always attributed things you mention to non-ECC ram in the past.
        • I have always attributed things you mention to non-ECC ram in the past.

          Non ECC ram has exactly the same number of bit-flips as ECC ram. It's just that ECC lets you know it happened (as would parity) and can fix it for you (if only one bit flipped).

  • by lsllll ( 830002 ) on Saturday September 24, 2016 @07:56PM (#52955247)
    If it's cosmic radiation, wouldn't it affect more than the ASR 9000? Or is that the only model without a lead case?
    • Re:Not buying it (Score:5, Interesting)

      by slowdeath ( 2836529 ) on Saturday September 24, 2016 @09:59PM (#52955665)

      Sorry, but cases such as this exist.

      Back around 1999/2000 I was with Cisco engineering on the GSR 12000 (the first Cisco service provider class router).

      We did send a system to a POP in Denver (altitude 5000+ ft) and saw on this system a statistically significant increase in recoverable memory ECC errors.

      When the affected board was returned to San Jose and retested (basically sea level) the errors could not be reproduced.

      So we returned the hardware back to the Denver POP, and the recoverable ECC errors returned. No amount of swapping memory DIMMs (various vendors) made a difference.

      Any satellite hardware designer will tell you that cosmic radiation is a big deal for satellite design. And lead shielding is not a cost effective option in space.

      • by dgatwood ( 11270 )

        Sure, it could be that. But it could also be:

        • A cleaning person plugging a vacuum cleaner into the power strip on the rack instead of into the wall outlet that's on an external circuit (combined with improper power filtering in the equipment).
        • Electrical noise caused by some other crappy piece of equipment in the rack (combined with improper power filtering in the equipment).
        • Errors caused by higher operating temperature.
        • Errors caused by emissions from natural Uranium or other radioactive elements in the so
        • A cleaning person plugging a vacuum cleaner into the power strip on the rack instead of into the wall outlet that's on an external circuit (combined with improper power filtering in the equipment).

          Even shitty chinese powersupplies filter this out to an acceptable level to make this a non issue.

          Electrical noise caused by some other crappy piece of equipment in the rack (combined with improper power filtering in the equipment).

          Even shitty chinese powersupplies filter this out to an acceptable level to make this a non issue.

          Errors caused by higher operating temperature.

          Unless the equipment has an appreciable difference in operating envrionment it would be insignificant. It's also one of the first things you do when checking failures is do a quick check of the installation equipment, especially if you're in a data centre or other envionmental controlled situation.

          Errors caused by emissions from natural Uranium or other radioactive elements in the soil.

          You mean just lik

      • But that is not cosmic radiation.
        That wold be occurring regardless where you are.

        My guess is a batch of the chips in those routers has contaminated silicon. The radiation is likely coming from inside. Does not need to a health risk high contamination, just a random increase in a phosphor isotope or something.

        • Sorry you are incorrect. The EXACT SAME hardware did NOT malfunction at sea level altitude, but when relocated to Denver at 5000+ ft above sea level it displayed a statistically significant increase in single bit ECC errors.This behavior has been studied by numerous organizations, including IBM, Sun Microsystems, and others. See this IEEE technical talk: http://www.ewh.ieee.org/r6/scv... [ieee.org]

    • It's not first time they've blamed an error on this... Cisco ACE modules had the same issue years ago, CSCsv52331.
    • Maybe the router was routing in the ISS
    • It is a reasonable explanation. Memory has parity bits. There are random faults from the various sorts of noise you can get in semiconducting circuits, but if you have some safety-net that will catch the occasional flipped bit. Your computer will be catching these sorts of errors all the time. The problem with cosmic rays is that they are very energetic, so they can pass through a lot of matter, but when they collide they generate a tight cone of ionising particles that knocks out electronics in a small re

    • It is calculated that the average personal computer will be hit by this about once a year. It's has been about the same since the 1960's. Cell size gets smaller, which reduces the chance, but number of cells gets greater, increasing the chance. The chips are about the same overall size, which is what makes the "target" area.

      But operating systems such as Windows crash much more often than that, so nobody notices unless they have high-reliability equipment and track faults.

      It's not an excuse or a fairy tale,

  • even if you have a strong support organization, one slacker responding with this to a customer, and the entire brand is tarnished.

  • by w0ss ( 530552 ) <{moc.ss0w} {ta} {noslracw}> on Saturday September 24, 2016 @08:01PM (#52955261)
    I work at a fortune 500 and I had to explain this to management just a few years ago on a Cisco 6500. It was a tough sell but I recall having a similar issue in the late 90's/ early 2000's with sun hardware so it isn't new. That was was even better to explain. The Sun's cosmic rays were causing the Sun's hardware to break!
    • On a related note, there is the infamous (in narrow circles) issue of the serial consoles on old Unix systems. Many had an option to "press any key for boot menu" on the serial console. The problem was that the serial consoles would get enough static interference to occasionally detect a character while this option was available, and it would halt the boot process. On a datacenter reboot (usually due to power loss), a handful of servers would never come up because of this. It was far more reliable to re

  • by oneiros27 ( 46144 ) on Saturday September 24, 2016 @08:02PM (#52955265) Homepage

    I'm guessing that they've read the BOFH [andrews.edu], but realized that there's much more reporting on solar-induced radiation ... so just decided to go with 'galactic' instead. .... completely forgetting that if this were the case, it would happen more frequently at high latitudes, due to the magnetosphere. And we'd also see a higher incidence rate after solar x-ray flares and solar particle events.

    (and the disclaimer: I work for the Solar Data Analysis Center, but I'm not a scientist, and don't speak for my place of work, etc, blah blah blah)

    • by Anonymous Coward

      As discovered by IBM back in the 70s, if it is a radiation induced upset, you'd see higher rates in places like Colorado vs Sea Level, and on upper floors of building vs lower floors.

  • by Anonymous Coward

    Sun Microsystems already pulled this bullshit back about 15 years ago... I don't really recall if it was a bad batch of processors or if it was bad non-ecc cache memory or whatever... but I do remember plenty of folks giving them a ration of shit and generally refusing to buy hardware from them after that... though once they fessed up to the problem and replaced all of our defective systems(and gave us a couple of free systems) we never had any further issues.

    • by Agripa ( 139780 )

      The SRAM structures used for integrated high performance processor cache are orders of magnitude more sensitive than discrete DRAM to radiation induced soft errors. Some of this is simply because the bandwidth is so high which exposes a greater capture area of logic. And so high performance processor cache has included parity and ECC protection for a long time.

  • Trying to clam acts of god to get out of being responsible?

    • It reminds me of people blaming compiler bugs for non-working code. While it does happen that a compiler generates incorrect code (I've encountered a few instances over the years), unless you either have reduced the problem to a minimal test case or examined the generated assembly and located the problem there, it's far more likely that it's a case of not digging deep enough to find a bug in your own code.

  • I work at a fortune 100, we're being delayed at the moment by software bugs in Cisco's routers. Their QA has completely gone out the window in the last few years, probably related to all the staff cutbacks. I expect we will start seeing Cisco losing market share if this keeps up.
    • by Anonymous Coward

      They will be losing market share... us included... we're a global company, with a size akin to a fortune 100. We're pulling the plug and moving to Juniper. Too much horse shit from Cisco. It's a fucking nightmare to get a quote, or even get an order filled CORRECTLY. We get the wrong shit sent all the time, and Cisco says the internal RMA process is so tedious, they tell us to throw the wrong equipment in the garbage or just keep it.

      They don't stock shit. Everything takes weeks. RMA'd equipment takes weeks

    • by seoras ( 147590 ) on Saturday September 24, 2016 @08:28PM (#52955375)

      One of the VP level Engineers (title is "distinguished" or something exalted like that) told me over lunch a couple of years ago that Chambers had said to him he wasn't interested in R&D. If there was a technology he needed, he'd buy it.
      The problem is that Cisco climbed to the top using IBM strategies and thinking which were focus on delivering "end to end" solutions to customers.
      They had no interest in box shipping. Those were just lego bricks and logistics. You can imagine how soul destroying that was to be a Cisco engineer.
      Bugs were a bonus to them as they sold annual maintenance contracts for roughly the same cost as the gear they sold.
      Now that the router/switch market has matured and commoditised they care even less about the quality of those boxes they have to ship.
      Their focus is entirely on the "service" level.
      They will eventually become another IBM. I was trying to think of a real tangible product that IBM made and sold just the other day. Do they?

      • I think they're still selling POWER-based systems.

      • by phantomfive ( 622387 ) on Sunday September 25, 2016 @02:33AM (#52956185) Journal

        They will eventually become another IBM. I was trying to think of a real tangible product that IBM made and sold just the other day. Do they?

        They sell more mainframes now than they did in 1970, believe it or not. One z13 can support 8,000 Linux VMs simultaneously. Cool looking box, too.

      • by swb ( 14022 )

        I seem to run across a fair number of places with AS/400 stuff. What's kind of interesting is the AS/400 stuff and people seem to run in this parallel universe IT department with their own staff somehow immune from the other pressures of the rest of the IT department.

        Once in a blue moon I'll hear mention that some kind of AS/400 update or installation is happening, so it's not like they're strictly legacy systems. And at longer term clients with AS/400 I occasionally see something new/different in the "AS

  • This makes me wish I was still working for them in IOS Engineering for the opportunity just to stir some shit.
    I'd have gone into the office on Monday morning with my head covered in tinfoil.
    I use to get some good laughs in the Cisco office, I seem to be getting more on the outside these days.
    Oh how the mighty have fallen....

  • Not bloody likely (Score:5, Informative)

    by techno-vampire ( 666512 ) on Saturday September 24, 2016 @08:34PM (#52955395) Homepage
    As FOLDOC explains [foldoc.org], Intel tested this idea decades ago by putting one board in a 25 ton lead safe and another outside to see if there was a measurable difference in bit rot. There wasn't. " Further investigation demonstrated conclusively that the bit drops were due to alpha particle emissions from thorium (and to a much lesser degree uranium) in the encapsulation material." They ended up redesigning the memory to be more resistant to the effect.
    • Given that alpha emissions are trivially blocked by something as thin as a sheet of paper I take that citation with a grain of salt.

      • "Alpha radiation" is always from a nuclear decay. That is how the discoverers "named" it.

        An Alpha particle, is a helium core, an atom without electrons, an ion.

        An cosmic alpha particle has energies that go far beyond your imagination. You don't shield them with a sheet of paper. Hence we gave them a different name "cosmic ray".

    • Interesting. Do you know how long ago that study was done? I"m curious if smaller manufacturing geometries have made newer processors more vulnerable.

      • by Agripa ( 139780 )

        Interesting. Do you know how long ago that study was done? I"m curious if smaller manufacturing geometries have made newer processors more vulnerable.

        The sensitivity of DRAM actually leveled off a few generations ago. I think what happened is that there is a minimum capacitance needed per DRAM cell so as the cells became smaller and the dielectric constant was increased to make up for it, the charge stored in a given volume became *greater* so an ionizing radiation impact spreading charge over a greater number of DRAM capacitors without enough charge to affect them individually.

        High performance SRAM used for integrated caches became more vulnerable and

  • Santa Claus spread chemtrails in the sky with which the easter bunny got stoned and confused causing the routers to crash!
    Hey, it's not impossible!

  • That was in his excuses rolodex.

  • I used to use that all the time. Now I'll have to think of something else..

  • Flips of a single bit in a memory or register are that few modern systems would run for long without error correcting memory. Even ECM has its limitations and most systems eventually crash/panic/blue-screen or whatever and require a reboot.

    The costs to improve error resilience go up rapidly and don't have a meaningful upper bound. My working trade off was to design for a mtbf comparable to how long I wanted to keep that job.

  • I remember in the 70s some memory manufacturer used a ceramic package that had a lot of thorium. Bad trouble.
    • I guess in this case it is "the same thing" ... the silicon from which they made some of the chips involved was not pure enough, or the material for doting was contaminated.

  • ...a Cosmic Brownie?
    http://cosmicbrownies.littlede... [littledebbie.com]

  • It shouldn't be a huge expense to build in some form of error correction to catch that sort of thing.

    • It shouldn't be a huge expense to build in some form of error correction to catch that sort of thing.

      Otherwise known as ECC memory?

  • My wife was looking over my shoulder when the "Cisco Blamed A Router Bug on 'Cosmic Radiation'" headline went by, and asked:

    "What's their next excuse? Global Warming?"

  • A White House health report addressing "partial data traffic loss" on Secretary of State Hillary Clinton contends that a "possible trigger is cosmic radiation causing SEU [single-event upset] soft errors." Not everyone is buying: "It IS possible for bits to be flipped in memory by stray background radiation. However it's mostly impossible to detect the reason as to WHERE or WHEN this happens," writes a Redditor identifying himself as a former [technical assistance center] engineer...
  • When I was a physics teacher I had an ongoing memory error problem with my Fujitsu Siemens laptop which led to frequent BSOD. I replaced the memory, and it still occurred. I then noticed the memory error happened frequently at work, but never at home. I wondered whether it could be a radiation issue, as I handled radioactive sources at my desk. I got my tech to do a leak check on my desk. It showed there was higher-than-background levels of radiation (can't recall whether alpha or beta) around my desk. Thi
  • by Theovon ( 109752 ) on Sunday September 25, 2016 @08:23AM (#52956707)

    There has been assloads of research on mitigating soft errors going back to the 1970’s. I’ve published some myself. There is no shortage of workable methods on masking transient errors in logic and bit flips in DRAMs. SEUs are a major problem for supercomputers, so their memory systems have sophisticated mechanisms for catching them.

    If Cisco is blaming this on SEUs, that just proves their incompetence, since they obvious didn’t spend 5 minutes with Google Scholar looking at hundreds of GOOD papers (in the top conferences and journals) on this topic. Seriously.

    PLUS, if something goes wrong, even if it IS a transient error, it’s FAR more likely to be a fixable bug than radiation. We had a weird bug in a DRAM controller whose state kept going invalid. We had to add another circuit to fix that. We *called* is a cosmic ray deflector, but the more likely causes, in order were (a) another bug we couldn’t find, (b) a timing violation caused perhaps by voltage or temperature fluctuation, or (c) crosstalk in the circuit. We would have kept looking, but this deflector circuit made it robust to hundreds of hours of slamming the memory system, so we let it go. (Also, it was graphics memory, so even if it did ultimately suffer a glitch some day, it would go unnoticed.)

  • I have had Cisco tell me this many times any time a router reboots from a parity error for over 15 years now, so they have been using this for a long time now.

  • by bradgoodman ( 964302 ) on Sunday September 25, 2016 @12:50PM (#52957765) Homepage
    It could indeed be possible. Aloha particles are well-know to be capable of causing bit-flips in capacitive memories (DRAM). This is exactly why we have things like ECC on memory pathways. That said - its not the only explanation. There are ways of testing this. For example, observing the general abundance and frequency of such particles in a bubble chamber, and attempting to corrolate to instances if error. Or placing equipment in a shilded enviroment and seeing if frequency of errors change. Long story short - it MAY be true - but if you want to draw a conclusion - you really have to offer more data to prove it.
  • My reaction when I first heard the "cosmic radiation" excuse for misbehaving electronics.
    With decades of experience in tech implementations in radiation fields I can personally attest to the fact that the radiation flux levels needed to cause reactions in electronics could only be high enough due to cosmic radiation at elevations higher than 20,000 feet. The levels need to be in Rad per hour rather than the microrad per hour that you get from cosmic radiation. (i.e. background at s

    • It only takes a single "cosmic ray" particle to flip a bit in memory. The readings that are averages, are no good for this.

      And all this has been known for about 50 years... ;-)

  • You just need the right gadget [wikipedia.org].
  • More likely an bug in the code that the NSA has inserted into all of their routers.

If you think the system is working, ask someone who's waiting for a prompt.

Working...