Slashdot Log In
Mars Global Surveyor Died from Single Bad Command
Posted by
Zonk
on Sat Apr 14, 2007 04:33 AM
from the damn-space-bugs dept.
from the damn-space-bugs dept.
wattsup writes "The LA Times reports that a single wrong command sent to the wrong computer address caused a cascade of events that led to the loss of the Mars Global Surveyor spacecraft last November. The command was an orientation instruction for the spacecraft's main communications antenna. The mistake caused a problem with the positioning of the solar power panels, which in turned caused one of the batteries to overheat, shutting down the solar power system and draining the batteries some 12 hours later. 'The review panel found the management team followed existing procedures in dealing with the problem, but those procedures were inadequate to catch the errors that occurred. The review also said the spacecraft's onboard fault-protection system failed to respond correctly to the errors. Instead of protecting the spacecraft, the programmed response made it worse.'"
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
It wasn't a single wrong command (Score:4, Informative)
Of course, these things do happen. Al we can do is find out why, and stop it from happening again.
Re: (Score:2)
Re:It wasn't a single wrong command (Score:5, Funny)
I agree. Otherwise WWII was caused by Hitler's mom having one too many drinks the night she met his dad.
Parent
Re:It wasn't a single wrong command (Score:4, Funny)
How can you come up with such a woefully shortsighted and limited in scope analysis? Honestly. There are at least two theories to work under for the cause of World War II.
WWII was caused by a series of reactions several billion years ago between amino acids. Or it was started 5000 years ago when God created Eve for Adam. Everything else in between is just a smattering of minor details.
Parent
Re:It wasn't a single wrong command (Score:5, Interesting)
TFA:
MGS was well into bonus time in the sense that the original goals had been reached. The project was running on a reduced budget and this made a mistake inevitable. I can't help thinking that at a higher level this was considered to be a good thing. When you have new missions to run and a fixed budget to run them on you want your old missions to stop so that you can draw a line under it and go on to the next thing.
The last thing management want is to have to decide to shut the spacecraft down because they don't have the budget for operations on the ground. Reducing the budget is a way of inducing the shutdown.
Parent
Re: (Score:3, Informative)
A nice theory, but one that fails to coincide with the facts. NASA routinely shuts down missions for lack of budget.
Re: (Score:3, Insightful)
Emmentaler vs. Gruyere (Score:5, Insightful)
It's like saying that a mid-air collision occurred because two jetliners were assigned the same altitude and jetway in opposite directions at the same time. Yeah, but A) How they got that assignment is kinda complicated and B) any number of traffic control and collision avoidance systems have to fail too.
Re: (Score:2, Funny)
*cricket* *cricket*
What?
Re: (Score:3, Insightful)
Re: (Score:2)
So, yeah, like Ueberlingen, here you had a chain of events that results in a catastrophic failure. Was a "bad command" to blame? Only if the system has zero tolerance fo
That'll Teach 'Em (Score:5, Funny)
Re:That'll Teach 'Em (Score:5, Funny)
su..do...?
Parent
Re:That'll Teach 'Em (Score:4, Funny)
Parent
Re:That'll Teach 'Em (Score:4, Funny)
Parent
Oblig. (Score:3, Funny)
Re: (Score:2)
Re: (Score:2)
You nits (Score:3, Insightful)
*Design* flaw (Score:3, Insightful)
Re: (Score:3, Interesting)
Re:*Design* flaw (Score:5, Informative)
I looked at the actual report on the NASA website; it said "the spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current."
There was a temperature monitor on the critical, exposed component. Furthermore, the information from the sensor was used in a sensible manner: Li-poly/li-ion batteries can catch fire under some circumstances (see also: sony laptop batteries) so if your li-poly battery overheats while being charged you stop charging it (because you'd rather have a flat battery than an exploded battery).
After the craft stopped charging the battery it never started charging the battery again. The battery ran down and the craft stopped working.
The obvious question is: why didn't charging resume after the battery had cooled down? It might not have cooled down (as it was hot in the first place due to being exposed to the sun) or the system might have been waiting for a 'resume charging' command from ground control, which was never received as the high-gain antenna was in the wrong position.
Personally if I was designing a space craft I'd duplicate the (presumably quite small) onboard computer and radio hardware, because it seems quite common for software/electronics failures to result in loss of communications. Having two processors running different software, each capable of reprogramming the other one if it became broken, would seem like a sensible route to take.
Just my $0.02.
Parent
Re: (Score:3, Interesting)
Or maybe have a small back-up battery in the center of the probe where it cannot be heated by the sun even if the probe gets poin
Re: (Score:2)
Almost all spacecraft do have a large number of temperature sensors that are connected to the spacecraft telemetry system. They are used to detect equipment problems and thermal management issues. This has been standard procedure for many decades.
Re: (Score:3, Insightful)
And yet, it failed.
More robust == heavier (Score:4, Insightful)
So, which scientific experiment would you remove in order to put additional heat shielding? No, the thermal shielding and other protection systems are just right for a spacecraft that had to travel a hundred million kilometers.
What really failed was the ground-based software, that didn't have a good enough thermal model, and the technical support team. Equipment may fail, operators may commit errors, but there should be enough experienced engineers around to do a correct analysis to catch those errors. Downgrading of the engineering team is the true problem here. Look at what happened to Columbia. It blew up on reentry because of a failure that had happened on take-off, was caught on video, but not analyzed correctly.
NASA isn't alone in these failures, perhaps one could say they set the pace for the rest of the industry. The lack of a good thermal model is typical of a whole generation of engineers used to do everything in Excel. With the current CPUs one has at each desktop, it wouldn't be so hard to do a correct thermal model of the spacecraft, but it would imply in solving a system of partial differential equations in C++, something very few engineers are able to do, even when given an extensive library [diffpack.com].
Parent
Re: (Score:3, Interesting)
Give NASA a break (Score:2, Interesting)
I propose that next time NASA spend $150 million on the construction phase, which is just a slush fund for defense contractors anyhow, and then issue the l
Re: (Score:3, Insightful)
MGS mapped targetted parts of the surface of Mars at much higher resolution than any previous mission. Among other things, it was responsible for finding the gullies that are probably signs of water being expelled recently at the surface. The length of the mission allowed it to detect changes at these sites, suggesting the process is still occurring today.
What you really seem to be saying is that exploration of Mars by any means is a big waste
Re: (Score:3, Insightful)
As for your 'proposal'...did you just pull those numbers out of your ass? If anything costs increase over the years due to rising wages and inflation. $220 million was *dirt cheap* by space mission standards.
We waste far more money on subsidies and entitlements in the US than we spend on science like this.
Nope, those are the real numbers (Score:2, Informative)
I realize this was dirt cheap by space mission standards. A laptop encrusted with diamonds which costs $80,000 is dirt cheap by laptop-encrusted-with-diamonds standards. That *doesn't make it worth the money*. I know we waste far more than $40 million a year on many things -- and, logically, every one of them except one can be justified by "We waste more money on another program, don't cut *my* hobby horse!"
Its interesting that you draw the d
Re:Give NASA a break (Score:4, Insightful)
So divide that by 200 million (roughly) to get your share.
I got two quarters' worth. Heck, you can't get a comic book for fifty cents.
Parent
The actual report (Score:5, Informative)
* A modification to a spacecraft parameter, intended to update the High Gain Antenna's (HGA) pointing direction used for contingency operations, was mistakenly written to the incorrect spacecraft memory address in June 2006. The incorrect memory load resulted in the following unintended actions:
** Disabled the solar array positioning limits.
** Corrupted the HGA's pointing direction used during contingency operations.
* A command sent to MGS on November 2, 2006 caused the solar array to attempt to exceed its hardware constraint, which led the onboard fault protection system to place the spacecraft in a somewhat unusual contingency orientation.
* The spacecraft contingency orientation with respect to the sun caused one of the batteries to overheat.
* The spacecraft's power management software misinterpreted the battery over temperature as a battery overcharge and terminated its charge current.
* The spacecraft could not sufficiently recharge the remaining battery to support the electrical loads on a continuing basis.
* Spacecraft signals and all functions were determined to be lost within five to six orbits (ten-twelve hours) preventing further attempts to correct the situation.
* Due to loss of power, the spacecraft is assumed to be lost and all recovery operations ceased on January 28, 2007.
Re: (Score:2)
Re: (Score:3, Informative)
So hang on... they *overwrote* the memory which contained the contingency operations plan and the hardware limitations data for the solar array? Surely that's bad design, you shouldn't be able to overwrite something like that (Unless the hardware limits plan on changing mid-mission). NASA fault protection modules evidently don't do their job too well :-/
Actually, they had to correct a previous error by writing directly to memory. I believe that writing directly to memory is not a standard operating procedure. The PDF report linked by the GP states that:
Re: (Score:2)
Hmmm. When they went looking for it the MGS wasn't where they expected it to be. Hard to see how the failure mode they describe would have made it change its trajectory by a significant degree.
Re: (Score:3, Informative)
The preliminary official report is availiable from here [nasa.gov].
Thanks for the link. The report is only three pages long and very interesting to read. The cause (quoted below) is really stunning, I wonder what's the probability of this sequence of event to happen.
An old error strikes back! (Score:2)
In a tragic comedy of errors, NASA accidently sends the Mars Global Surveyor a confirmation to execute "con/con". Microsoft explains that this will be patched in TerraWindows (TM), and for the moment their only suggestion is to "...do the Microsoft '1,2 shuffle'; sigh heavily and do a hard reboot..."
John Dvorak has been contacted as a possible canidate to go manually reboot the Surveyor, but has yet to accept the proposition.
*ducks*
Impressive (Score:5, Interesting)
wrong parameter? (Score:5, Funny)
What's in a name.. (Score:3, Funny)
Somehow I find it reassuring that NASA employs someone called "Dolly Perkins". It has that warm cosy 1950's feeling of Golden Age Space Exploration. Now, if only we could get the astronauts named "Buck", "Rock", or "Trent".
Re: (Score:2)
Good code Is For Old People (Score:2, Informative)
In Soviet Russia perfect probe sends lens cap code back to you!
A wiki link to help with the lens part.
http://en.wikipedia.org/wiki/Venera_program [wikipedia.org]
The command (Score:2)
Fewer than I expected... (Score:2)
Batteries Overheating (Score:2, Funny)
Better, fast, cheaper - the reality (Score:4, Insightful)
So occasionally you get the stunning successes, E.G. the Mars rovers Spirit and Opportunity. Considering they were only supposed to last 90 sols and they're somewhere out to 1075 or more sols it means that the Steve Squyers is currently the start of NASA.
But more likely you get the devastating failures.
It's really sad that we blow a few billion a month on our little Iraq and Afghanistan ventures yet sciences take a back seat.
Typical multiple-factor catastrophe (Score:5, Interesting)
In one major aircraft accident I know a lot about, the (Airbus) jet crashed in part because it ended up being a tug of war between a human pilot and a robot autopilot that should have been disengaged, causing and up and down roller coaster ride. There were lots of other distracting things that were maybe wrong or maybe not, but a key part was the difficulty in knowing what state the machine was in.
It was a similar situation with this accident, it seems, and though the misuse of metric units caused another recent accident it appears that these incidents have elements in common. They are also made more probable it strikes me by funding pressures and also in the way that operating these systems involves radical commands while the systems also lack enough power to be self-aware enough to preserve themselves.
I am not going to do any more guessing because the people involved can probably figure it out themselves, and it seems that these combined factor accidents at least are not costing human lives, while they are adding to knowledge about how not to make the accident the next time.
In that regard my hope is that some of the money being spent on Mars can be used to improve autonomous robotic systems to reduce accidents both on Mars and on Earth.
Re:Bad command or filename (Score:5, Funny)
Parent
Re:Bad command or filename (Score:5, Funny)
You just made a beautifully appropriate commentary on a common fixture of my childhood. Dude.
Parent