Forgot your password?
typodupeerror
Mars Programming Space Science

Programming Error Doomed Russian Mars Probe 276

Posted by Soulskill
from the to-infinite-loops-and-beyond dept.
astroengine writes "So it turns out U.S. radars weren't to blame for the unfortunate demise of Russia's Phobos-Grunt Mars sample return mission — it was a computer programming error that doomed the probe, a government board investigating the accident has determined." According to the Planetary Society Blog's unofficial translation and paraphrasing of the incident report, "The spacecraft computer failed when two of the chips in the electronics suffered radiation damage. (The Russians say that radiation damage is the most likely cause, but the spacecraft was still in low Earth orbit beneath the radiation belts.) Whatever triggered the chip failure, the ultimate cause was the use of non-space-qualified electronic components. When the chips failed, the on-board computer program crashed."
This discussion has been archived. No new comments can be posted.

Programming Error Doomed Russian Mars Probe

Comments Filter:
  • by LostCluster (625375) * on Tuesday February 07, 2012 @03:25PM (#38958015)

    We've got a contradictory summary here. Chip failure isn't a programming fault, it's a hardware problem. Stop confusing hardware and software you insensitive clod.

  • by Anonymous Coward on Tuesday February 07, 2012 @03:29PM (#38958097)
    Obviously the error handling routine was poorly written.
  • by invid (163714) on Tuesday February 07, 2012 @03:33PM (#38958163) Homepage
    Is it just me, or is it the responsibility of all software engineers to find the hardware problem in order to prove to people that the cause isn't software?
  • by billcopc (196330) <vrillco@yahoo.com> on Tuesday February 07, 2012 @03:51PM (#38958477) Homepage

    Okay, we still have a respectable though dwindling community of commenters, so can we please get rid of these editors who can't even be bothered to read four lines of summary text before posting ?

    The headline and summary do not make sense. Come on, we're supposed to be nerds, aka intelligent, focused, attentive knowledge aggregators.

    the fuck is wrong with this goddamned site?! These failures are starting to make Digg look good!

  • by vlm (69642) on Tuesday February 07, 2012 @03:52PM (#38958479)

    Fun to read the comments here. I've done embedded stuff and you need to be defensive. You can see at a glance who here has never done defensive programming before, or embedded or safety critical programming, all blaming the hardware. There's 3 states so you got 2 bits of input and a disallowed state comes in. Deal with it, don't just curl up and die and blame the hardware designer. There's a 12 bit A/D conversion result stored in two bytes, and there's a 14 bit number found there, deal with it don't just curl up and die and blame the ... . Theres a cycle start button and an emergency stop button and both are simultaneously on. Deal with it. You reboot a mission critical (or safety critical!) CPU and a minor auxiliary input A/D doesn't initialize, do you burn the plant down in a woe is me pity party because one out of 237 sensors aren't coming on line, or do you deal with it?

    Finally radiation is a statistical phenomena. There is no such think as radiation free. If they used non-rad hardened parts, its gonna crash maybe 10000 times more often. Thats OK, you program around that, assuming you know what you're doing. Radiation hardened does not equal radiation-proof. If there was a single bit error, or a latchup on a rad-hardened unit, with a poorly programmed control system it would have failed just as well, its just that a rad hardened chip would have made it a couple orders of magnitude less likely. A shitty design that has a 1 in 20000 failure rate due to better hardware instead of 1 in 2 is still a shitty programming design, even if the odds are "good enough" that it makes it most of the time with the better hardware.

  • by alienzed (732782) on Tuesday February 07, 2012 @04:27PM (#38959005) Homepage
    On the other hand, this demonstrates so aptly why they failed in the first place. "Yep, it's a software problem, because the hardware failed to run any after it was damaged."
  • by ChrisMaple (607946) on Tuesday February 07, 2012 @06:29PM (#38960487)

    Many chips are never designed to meet military or space specifications: the extra certification is very, very expensive and there are design compromises between performance and ruggedness. Furthermore, the testing you suggest for space qualification, if failed, results not in a mil-spec component but a component that has been destroyed by the test. In some cases, samples of a given batch are heavily tested to verify the batch, but those devices are considered damaged and not sold.

    Some rad hard type devices are of no interest to consumer design due to the poor performance caused by the compromises involved in achieving hardness. Rad hard devices aren't designed as often due to the small market, and the design is more difficult and takes longer, and certification takes time, too. Thus, the devices are older technology. Additionally, rad-hard parts (the actual transistors inside the ICs) are bigger physically than conventional devices, which also means they can be fabricated on older technology equipment. Thus, with respect to current commercial technology, space-qualified devices are often older technology.

Never try to teach a pig to sing. It wastes your time and annoys the pig. -- Lazarus Long, "Time Enough for Love"

Working...