Programming Error Doomed Russian Mars Probe 276

Posted by Soulskill on Tuesday February 07, 2012 @03:25PM from the to-infinite-loops-and-beyond dept.

astroengine writes "So it turns out U.S. radars weren't to blame for the unfortunate demise of Russia's Phobos-Grunt Mars sample return mission — it was a computer programming error that doomed the probe, a government board investigating the accident has determined." According to the Planetary Society Blog's unofficial translation and paraphrasing of the incident report, "The spacecraft computer failed when two of the chips in the electronics suffered radiation damage. (The Russians say that radiation damage is the most likely cause, but the spacecraft was still in low Earth orbit beneath the radiation belts.) Whatever triggered the chip failure, the ultimate cause was the use of non-space-qualified electronic components. When the chips failed, the on-board computer program crashed."

Programming Error Doomed Russian Mars Probe

This discussion has been archived. No new comments can be posted.

Search 276 Comments Log In/Create an Account

Comments Filter:

Re:Excuse me... not a programmer's fault. (Score:4, Interesting)

by Rary ( 566291 ) writes: on Tuesday February 07, 2012 @03:45PM (#38958359)

To follow up, the article saying that it was a chip failure is dated yesterday, while the article claiming it was a programming failure is dated today. Presumably, this is new information to shoot down the previous claims, but TFS (in typical Slashdot "editorial" style) fails to actually make that distinction, and puts both claims together as part of a single summary.

Re:Description Fail (Score:5, Interesting)

by expatriot ( 903070 ) writes: on Tuesday February 07, 2012 @03:47PM (#38958391)

The Planetary Society entry says that two modules failed and then the main computer crashed. Probably irrelevant if the computer crashed or not if there were significant failures in the electronics. Perhaps if the computer had kept going there woud have been some communication of what had gone wrong.
One of the commenters wrote "It is rather unlikely radiation caused the failure. Russians said the failure was due to an SRAM WS512K32V20G24M from White Electronics. This part is a module containing 4 CY7C1049 chips from Cypress and is actually screened. While the Cypress part is very susceptible to Latchup," No idea if this is true or not.

Re:Excuse me... not a programmer's fault. (Score:5, Interesting)

by icebike ( 68054 ) * writes: on Tuesday February 07, 2012 @03:54PM (#38958521)

Obviously the error handling routine was poorly written.
I'll assume your tongue was firmly planted in your cheek, and suggest a +1 Funny mod.
But on the chance you were serious, depending on where that chip was, it may have been beyond something manageable by software.
A chip in a power controller could take down any or all of the processor components, or render access to control circuits impossible.
The linked article also states
Everything was working well with the spacecraft immediately after launch, including deployment of the solar panels, until the command to start the engines was issued. When that did not happen, the spacecraft went into a safe mode, keeping the solar panels pointed to the Sun to maintain power.
How many times do you supposed they actually tested engine start IN THE SPACE CRAFT? I'm guessing ZERO.
non-space qualified parts being used in some of the electronics circuits. This is a design failure by the spacecraft engineers that might have been caught had they performed adequate component and system testing prior to flight. But they did not.

So design failure, due to radiation, prior to the craft getting near the strongest radiation belts. Unbelievable. Occam would be skeptical.
This sounds to me like some on-board internal source of radiation, or induction, or simple overload, fried a chip somewhere in some un-specified circuitry, most probably in the engine controls. This seems far more likely than an external radiation source given the shielding the physical design would provide.
I doubt space qualification made any difference at all. The window for space radiation in the brief time it was operational was small.
Rather I suspect under-spec parts, over voltage or high current draw, or internal shielding oversights.

Baloney (Score:5, Interesting)

by mbone ( 558574 ) writes: on Tuesday February 07, 2012 @03:58PM (#38958573)

What are the chances chips would fail in a 20-30 minute period just after launch but before Mars transfer orbit insertion ?
No, I bet this was a programming error, coupled with a near total failure to test the software.

Re:TFS - obviously written by a hardware guy (Score:5, Interesting)

by sconeu ( 64226 ) writes: on Tuesday February 07, 2012 @04:17PM (#38958847) Homepage Journal

You laugh, but how many of you low level guys had to work around buggy hardware?
I once sent a memo to my boss that I was doing the equivalent of "working around a burnt out lightbulb in software".
E.g.: How many hardware guys does it take to change a lightbulb? None, we'll just have the software work around it.

Re:Excuse me... not a programmer's fault. (Score:4, Interesting)

by crutchy ( 1949900 ) writes: on Tuesday February 07, 2012 @04:39PM (#38959169)

to my knowledge, only the Apollo Guidance Computer has ever truly achieved hardware failure tolerance. the Apollo 11 LM radar fault overloaded the computer, but was able to continue due to restart logic built into the AGC that was able to pick up critical tasks from where they were when the computer was restarted and drop non-critical tasks, and all with a very small fraction of the capabilities of current technology (although I think from memory they were able to fit 2 transistors on a single chip!). the AGC is really a marvel of (past) engineering and computer science. the reliability problem alone would be insurmountable with today's garbage. probably part of the reason why we haven't been back there since.

Re:So how much? (Score:5, Interesting)

by jd ( 1658 ) writes: <imipak@yah[ ]com ['oo.' in gap]> on Tuesday February 07, 2012 @04:40PM (#38959187) Homepage Journal

Space Micro [spacemicro.com] doesn't list the prices of their components or systems, nor can I find any from anyone else. Honeywell [honeywellm...ronics.com] don't list their prices either. Atmel seem to have dropped out of the field. Linear [linear.com] don't list the prices for their space-hardened stuff. Don't see any for BAE [baesystems.com] either, or Intersil [intersil.com]. Empire Magnetics [empiremagnetics.com] require a lot of personal data before they give you access to even the price classification information. Not the prices, just how they're classified.
You've got to allow for a year's worth of traveling outside of an atmosphere and then operating on Mars for the duration of the mission. This analysis of radiation for manned missions [esa.int] suggests you're looking at 3.5 mSv per day, then 20 rems per year [solarstorms.org] in most of the places of interest.
Converting everything to rads, it's 0.1 rads per mSv and 1 rad per rem, so that's 12.75 rads to get to Mars if you assume a year-long trip, plus 20 rads for the mission, so anything with a rating of less than 32.75 rads is pretty much guaranteed to fail. However, over the course of a two years, the odds of there being a [hps.org]solar flare [nasa.gov] are not insignificant. To be safe, you want resistance to a further 400 rad. 432.75 rad is within the tolerance of most of the space-hardened components (some components can be taken up to 1000 rad, others up to 10,000). However, the cheapest space components would NOT survive. You're talking high-end on the space scale.
I'm going to figure that the top-line components will cost 100x that of their conventional counterparts, due to the higher-level of precision and QA that are required. It might well be a good deal more. In Russia, you've also got to pay for smuggling decent-grade hardware out of the US, as all of this stuff will be under massive amounts of regulation.
My guess is that the cuts would have saved enough that those doing the cost-cutting could buy second homes in Switzerland.

Re:So how much? (Score:4, Interesting)

by autophile ( 640621 ) writes: on Tuesday February 07, 2012 @04:48PM (#38959281)

For want of a rad-hard chip, the board died.
For want of a board, the software couldn't cope.
For want of good software, the engine start failed.
For want of engine start, the probe died.
For want of a probe, the human race didn't detect the slimy aliens from Phobos and all perished in a hot and somewhat greasy fireball.

Re:TFS - obviously written by a hardware guy (Score:5, Interesting)

by garyebickford ( 222422 ) writes: <gar37bic.gmail@com> on Tuesday February 07, 2012 @07:22PM (#38961039)

Not even necessarily low level. I once had a weird intermittent problem in a PHP driven web system. After a couple of weeks of diagnosing (largely trying to find a case the could more-or-less reliably tickle the bug), it turned out to be an interaction of a bug in the Redhat version of that day (2001) with a bug in the particular CPU we were using. PHP code just happened to trigger it under certain conditions. Since the box was at Level 3, we had to drive an hour down there and replace the machine.
And long ago I worked on Perq workstations, which had a stack-machine CPU (the CPU was a 15x15 inch board filled with TTL). The expression stack was four chips. The system was designed around the chip spec - NEVER DO THAT!!! Chips can not be depended to go at exactly the design spec - some are slow, some are fast. As a result, every CPU had to be tested at installation with those four chips inserted in different locations, essentially in order of speed. If a fast one came after a slow one in the slots, the CPU would barf. Basically someone just kept swapping chips around until it worked.
We were just discussing some of the remarkable repairs done in software to accommodate problems in various interplanetary probes - truly amazing stuff.

Re:Excuse me... not a programmer's fault. (Score:4, Interesting)

by robot256 ( 1635039 ) writes: on Tuesday February 07, 2012 @07:59PM (#38961347)

Actually, darwin is kind of right. The difference between 120nm transistors and 45nm transistors is quite substantial. Between random radiation, natural wear due to thermal cycling, and period electrostatic discharges from handling and plugging in connectors, it is not surprising that the older chips are sturdier in general.
But he may have just invoked the "They don't make them like they used to" logical fallacy, because sure there are some 20-year-old SNES machines, but how many of them died 2 years after production? Compare that percentage to the figure for PS3's and you have your answer.

Re:So how much? (Score:4, Interesting)

by jd ( 1658 ) writes: <imipak@yah[ ]com ['oo.' in gap]> on Tuesday February 07, 2012 @08:08PM (#38961419) Homepage Journal
The links for International Rectifier, for those *#$% off with Congress and wanting to build their own damn Rover:
- Rad Hardened Single chip MOSFETs [irf.com]
- Rad Hardened Multi Chip MOSFETs [irf.com]
- Space-Rated DC-DC Converters [irf.com]
- Space-Rated Low RF Power DC-DC Converters [irf.com]
- Rad-Hardened Voltage Regulators (fixed) [irf.com]
- Rad-Hardened Voltage Regulators (variable> [irf.com]
- Rad-Hardened Gate Drivers [irf.com]
  Some of their other military/avionics stuff may be space-rated or rad-hardened but it doesn't say so.
Comment removed (Score:5, Interesting)

by account_deleted ( 4530225 ) writes: on Tuesday February 07, 2012 @09:06PM (#38961891)

Comment removed based on user account deletion

Re:Excuse me... not a programmer's fault. (Score:4, Interesting)

by EETech1 ( 1179269 ) writes: on Wednesday February 08, 2012 @05:25AM (#38964619)

I asked one of the main AVR designers from Norway if it was ok to set a configuration, or a constant in RAM during initialization and trust with 100% certainty that it would not change during operation. He said that even on the worlds cleanest power supply, and absent the presence of any EMI, he would still NOT recommend it.
If you run 10 AVRs for 1000 hours you will see bits flipped. Many times it only effects a RAM variable that is constantly being recalculated anyways, so it causes little if any disruption to the operation of the device.
It really sucks when its something critical like a timer counter control register.
If anyone would like to duplicate my testing, I'd be glad to send code, but all you have to do is set everything to a known value, and then read it over and over til it changes. It doesn't take as long as you think (or hoped) it would! It also gives you a good idea on how well your PCB takes care of your Micro.
Always check, and if necessary, reset your hardware configs during runtime! Those "all of the sudden it started acting up, so I turned it off and back on again and it was fine" problems just disappear!
I still remember the time my CON_0 register read 8! Although I'm sure it'll happen again, you'll never notice it!
Cheers

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Programming Error Doomed Russian Mars Probe 276

Programming Error Doomed Russian Mars Probe More Login

Programming Error Doomed Russian Mars Probe

Re:Excuse me... not a programmer's fault. (Score:4, Interesting)

Re:Description Fail (Score:5, Interesting)

Re:Excuse me... not a programmer's fault. (Score:5, Interesting)

Baloney (Score:5, Interesting)

Re:TFS - obviously written by a hardware guy (Score:5, Interesting)

Re:Excuse me... not a programmer's fault. (Score:4, Interesting)

Re:So how much? (Score:5, Interesting)

Re:So how much? (Score:4, Interesting)

Re:TFS - obviously written by a hardware guy (Score:5, Interesting)

Re:Excuse me... not a programmer's fault. (Score:4, Interesting)

Re:So how much? (Score:4, Interesting)

Comment removed (Score:5, Interesting)

Re:Excuse me... not a programmer's fault. (Score:4, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot