Programming Error Doomed Russian Mars Probe 276
astroengine writes "So it turns out U.S. radars weren't to blame for the unfortunate demise of Russia's Phobos-Grunt Mars sample return mission — it was a computer programming error that doomed the probe, a government board investigating the accident has determined."
According to the Planetary Society Blog's unofficial translation and paraphrasing of the incident report, "The spacecraft computer failed when two of the chips in the electronics suffered radiation damage. (The Russians say that radiation damage is the most likely cause, but the spacecraft was still in low Earth orbit beneath the radiation belts.) Whatever triggered the chip failure, the ultimate cause was the use of non-space-qualified electronic components. When the chips failed, the on-board computer program crashed."
headline fail (Score:3, Informative)
Re:Excuse me... not a programmer's fault. (Score:5, Informative)
In a report to be presented to Russian Deputy Prime Minister Dmitry Rogozin on Tuesday, investigators concluded that the primary cause of the failure was "a programming error which led to a simultaneous reboot of two working channels of an onboard computer [...] Likewise, cosmic rays and/or defective electronics are not the leading suspects behind Phobos-Grunt’s demise.
The summary is clearly bolting together two contradicting reports.
Re:headline fail (Score:2, Informative)
They probably just had someone ordering parts that didn't know to order mil spec (I'm assuming mil spec is fine for space stuff)
No, not even close. "Mil spec" is basically industrial grade with a little bit extended temperature range. Radiation hardened stuff is completely different ballpark.
Contradictions (Score:5, Informative)
The summary is so contradictory because it quotes from 2 articles, and each of them is completely different. One says that the parts were space-tested and fine, and the other says they were never space-certified and were definitely bad. The first one says instead that a software bug caused parts of the system to reboot. The second doesn't know what happened and just blames faulty hardware.
Re:So how much? (Score:4, Informative)
Long story short, they probably saved more than $5 for using a COTS part, but they probably lost the probe by the part not being radiation hardened.
Re:Excuse me... not a programmer's fault. (Score:3, Informative)
In that case, the primary CPU is already up and running; it's booting additional processors.
Re:TFS - obviously written by a hardware guy (Score:5, Informative)
Try this one on your hardware guys:
"The main purpose of software is to make hardware reliable".
Drives them nuts...
Re:Excuse me... not a programmer's fault. (Score:5, Informative)
I'm not a satellite engineer, but wouldn't it be easy enough to just install a lead shield around the PCB to protect from most radiation? As long as the shield's not too thick, it shouldn't add too much weight, especially compared to using older-technology chips that'll take up more board space.
Well, that depends. Even on Earth's surface, we have to use ECC in more demanding application. In LEO, you lose the protection of the atmosphere but you still have Earth's rather strong and large magnetosphere. But this was an interplanetary probe. Once you get out of the radiation belts, interstellar and intergalactic particles start hitting you. You can't protect from those with a lead shield of any reasonable size. Pretty much the only way is simply to make the chip simple, rugged and design it with components (transistors) large enough that a particle flying through won't bother you much. Or add redudnancy. Or both, if possible (that's the usual case).
Re:Worse than on the ground... (Score:4, Informative)
There's hardware to deal with that - a watchdog timer can reboot the system quickly.
Assuming the system comes back up with a working CPU and RAM, then the main computer should be able to work around bad peripheral or components on the bus. I think that's what the article is getting at.
On military aircraft, they use VM's to run the OS and software. Communicate between systems is passed synchronously and requires that each module know the state of the other modules. There is never an assumption that the other system will just work - all messages require acknowledgement and verification of results.
Re:Excuse me... not a programmer's fault. (Score:4, Informative)
There are many aspects to radiation hardness. Radiation can flip one or more bits, resulting in bad data or program crash. Radiation can cause latchup, which will last until power is cycled; if the design is bad, latchup can fry a part. Rad hard parts are designed to be resistant to latchup. Really bad radiation can damage a part that isn't even powered.
A laptop can live through bit flips, and with luck it can live through latchup, and be functional after power cycling. Spacecraft control generally has to be always on; power cycling in not an option. Thus the design requirements for spacecraft control must be much stricter.
Re:So how much? (Score:2, Informative)
I have worked (not long) as an electrical engineer in a team developing electronics for scientific instruments mounted aboard space probes, rovers, etc. This means interplanetary travel and operation, so this is the kind of place where you definitely want to use rad-hard components, unlike low orbit where you are still well within the magnetosphere. Phobos-Grunt orbit-boosting stage had no good reason to use hardened components.
Concerning prices: I have done some design/prototyping but I wasn't involved with the procurement process of flight-qualified rad-hard components, so what I know is from discussion with colleagues. First, lead times can reach one year, even for quite basic components. Then, the cheapest rad-hard discrete MOSFET from International Rectifier (which is basically the only rad-hard MOSFET manufacturer - there is no room for competition in such a small market as rad-hard components) is in the vicinity of 400 €. And this is no high-power transistor, but the closest equivalent (although with higher specs most often not needed) to the 2N2222, the most basic low-power, logic-level MOSFET ever that you can buy for a few cents. The price ratio is more around 1000 here...
Re:Excuse me... not a programmer's fault. (Score:5, Informative)
As another EE with experience in rad hard space qualified design, he's not being self-contradictory. He's spot on.
If your CMOS structures are prone to latchup in the presence of single high energy events, then shielding does you no good. The amount of shielding necessary would more than consume the entire payload mass budget. Adding insufficient shielding just creates showers of secondary particles, each with more than enough energy to cause latchup alone, therefore rendering you at a statistical loss compared to no shielding whatsoever.
With this in mind means designing the CMOS structure to make shielding unnecessary. For example, build your circuits on bulk insulators instead of bulk semiconductor.
Just because you can't understand it doesn't mean he's self contradictory. You just missed his point. And then attacked him.