Debugging The Spirit Rover 390
icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"
Space Technology (Score:5, Insightful)
Here on earth we can't even build cars that require no maintainance and last more than 10 years.
Re:Space Technology (Score:2, Insightful)
What's the big deal?? (Score:4, Insightful)
If it was the hardware that got fried and they miraculously fixed that, I would understand but this was just a software glitch.
I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.
As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.
The proper fix... (Score:3, Insightful)
The only real bug was the inability of the system to properly handle running out of file entries (or more specifically, consuming too much RAM as the number of file entries increased). However the software should have never have stressed the filesystem to that degree in the first place.
Dan East
Re:Space Technology (Score:5, Insightful)
Hindsight (Score:5, Insightful)
Granted mainstream media have to keep their coverage dumbed down if Joe Public are going to read it. But what really bugs me is the lack of follow-up. We hear about poorly understood events as they are unfolding, then never heard about them later when they are completely understood.
A recent example is the gangway between ship and shore at the QM2's drydock. It collapsed killing lots of people, an investigation was launched. Why did it collapse? At the time it wasn't known. I'm sure it's known now, but there's been absolutely no followup.
This article about the rover is great not so much because of the level of detail but because it reports on an event with the benefit of hindsight.
Re:do they use SSH ? (Score:1, Insightful)
would have to have access to a every large radio atennae, like the one atop a volcano in Hawaii.
Re:do they use SSH ? (Score:5, Insightful)
Pinging mars-rover with 32 bytes of data:
request timed out
request timed out
request timed out
64 bytes from mars-rover: icmp_seq=0 ttl=64 time=32400ms
If it has anything to do with current internet protocols, it would be UDP.
Re:Space Technology (Score:4, Insightful)
They make alot of money from loyal customers. But I admit my 13 year old 91 honda civic with 140k miles is getting on my nerves with repair costs. WOuld a 91 ford escort still be running today? I think not.
I will buy only Toyatas and Honda's for that reason.
It amazes me consumers are too stupid to read consumer reports and buy cars on looks. Repair costs for things like Cadallacs and BMW's are not cheap for TCO! Yes consumer products have TCO too and we and not just businesses should look at that as well.
What the article doesn't say (Score:5, Insightful)
Ran out of flash disk space. No, really. (Score:2, Insightful)
They fixed it by telling it to boot without using the flash (safe mode
We have a big 1TB NetApps server where I work, and we have so much disk space that people get lazy and don't delete files or archive old projects, then they get really confused when jobs fail, not thinking disk space until checking everything else first. But it happens, and it's usually surprisingly hard to debug (they check a lot of other things first, sometimes even upgrading tool versions!). It's really kinda funny, in an expensive and mildly embarassing way that the Spirit had the same problem.
Lucky Hack? (Score:5, Insightful)
The outcome does not strike me as a "Lucky Hack." They made the system flexible, that flexibility got them into some trouble, and it's also what got them out of it. Anyone else agree?
Re:Space Technology (Score:5, Insightful)
Sort of like Debian.
Cutting edge ain't always what it's cracked up to be.
KFG
Re:What's the big deal?? (Score:4, Insightful)
Re:What's the big deal?? (Score:5, Insightful)
As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.
There are some fundamental differences, my friend:
Re:What's the big deal?? (Score:5, Insightful)
Spirit was in a constant reboot cycle, and the fact that they could even communicate with it long enough to bypass the problem was an accomplishment (and lucky).
It would be more similar to your remote data-center machine suddenly going offline and you have no idea why, and you are unable to ssh to it, and you fix it by running through potential scenarios and finding that the problem could have been due to mounting a certain partition, then discovering that there's an exploit in ICMP that allows you to hack to kernel so it doesn't mount that partition.
Lucky Hack? (Score:5, Insightful)
Basically, they rebooted from a recovery image (sent via radio) and then proceeded to do low-level fixes on Flash memory and they a chkdisk. If I do something similar via recovery disk or CD, I don't get a lot of people telling me that it was a "Lucky Hack" that I could boot off of CD!!!
One reasonable anology (Score:1, Insightful)
Just in case of a worst case scenario, always make sure you have physical access to the machines.
Re:What's the big deal?? (Score:5, Insightful)
There is a significant lesson to learn, here .. (Score:3, Insightful)
Re:What's the big deal?? (Score:3, Insightful)
That's just it - consider the stress those rovers are enduring or might encounter: subzero tempatures down to -200f, out-of-the-blue (red?) sandstorms, gamma radiation, and who knows what else out there that could suddenly fsck with the systems or scramble internal data ? Your average Dell rack will never have to deal with any of those things.
The scariest thing I've heard in a while (Score:0, Insightful)
Yes, see my post (Score:3, Insightful)
Great trick for ssh administration (Score:5, Insightful)
sleep 600 && reboot &
Now if your risky maneuver makes the ssh session unusable, just wait 5 minutes for the machine to reboot.
This is great for fiddling with firewalls by remote control... through the firewall.
Oh... You say you're not using a POSIX-like system? That's not supported. Sorry.
Re:Oh, sure... (Score:5, Insightful)
"We're ordering this brand new hardware that you've never tested before. Can you guarantee it will never crash?"
"Will this database server handle the load of our brand new project?" (without an accurate growth estimate)
"A server 2000 miles away just went down. What happened?" (no ping, no nothing) Hmmm.. Power/NIC/CPU/CPU fan/hard disks?
It really sounds like they did some decent advanced planning on those probes, but from other stories I read, the were shooting for 90 days of reliability, which in itself was a hard one to do. What if it turns the antenna the wrong way and looses connectivity? What if it gets hit by lightning? What if it falls in a hole? (go Beagle!)
Sure, relate this to your web server colocated somewhere you're not. Cross your fingers, hold your breath, and hope there aren't a few fatal systems failures, or a bit of human error. I've been responsible for a bit of that in the past, but at least my equipment wasn't a few million miles away.
They didn't just randomly delete stuff (Score:5, Insightful)
Using the low- level commands, about a thousand files and their directories -- the leftovers from the initial launch load -- were removed.
I think that means they deleted the useless stuff they wanted to delete anyways but didn't get to delete before the crash. I also remember news about science data from before the crash that was received after they got the rover working again.
As for how critical it is, well yeah, it seems the rover didn't need the contents of the flash file system. The operating system and other software was in the same flash memory but I assume that any sane designer would put in some hardware write protect interlock that's not easy to defeat accidentally.
Re:NASA should have simulated... (Score:3, Insightful)
You also realize that NASA did do a test mission, right? They built a test rover and put it out in a desert somewhere. They used the mission to test the hardware, test the software, and to help train the team.
Re:What's the big deal?? (Score:5, Insightful)
Please! Back in the day people would write programs on paper, mail them in an envelope to a computing center somewhere, and get results weeks later.
THAT was pressure not to fuck up.
What we can learn: (Score:5, Insightful)
Re:OT:lots of mem of an embedded system (Score:1, Insightful)
If there was any revisionist crap here, it was the defining of M, G, and so on to be used in base 2 (1024) in the first place. Those are standardized units!
Re:What's the big deal?? (Score:4, Insightful)
Re:do they use SSH ? (Score:4, Insightful)
Re:What's the big deal?? (Score:4, Insightful)
It was nothing like what you described, just a VERY well designed system (though it would have been somewhat better had the system been able to go straight to "safe mode" after the initial critical error (running out of memory))
Did the people with mod points RTFA? Score 5 Insightful?
And no, I'm not new to
Re:WindRiver's fault (Score:4, Insightful)
Years ago, when JPL was designing the Mars Pathfinder mission, they asked Wind River to do an "affordable" port of VxWorks to the RAD6000 (a radiation-hardened RS6000), and they agreed. Since the computers on the two MERs are very similar to the computer on the Mars Pathfinder lander, it makes sense that they'd use the same OS that they used on the MPF lander.
I would think the fact that JPL knows VxWorks very well by now would be a major factor in deciding to use VxWorks for the MERs.
Re:Verifying the software !!! (Score:1, Insightful)
Perhaps with hindsight it might have worked a little differently but you can't forsee every combination of events and in fact the software worked flawlessly allowing them to recover in exactly the way they designed it.
If this had been sitting on a desk next to them they probably would have sorted this out in 20 minutes but they obviously they needed to do this in a methodical and careful way to verify what they thought this problem was and they needed to do this over a very slow and very delayed link which was being affected by the problem. Which is no doubt why it took a few days.
Re:Ran out of flash disk space. No, really. (Score:2, Insightful)
"What we have here, is failure to communicate."
Mal-2
Hmmmm (Score:4, Insightful)
Funny, that's how it was explained to me by my computer science teacher my freshman year in high school. He said, "The problem with computers is that they do exactly what we tell them to."
Re:Discovered a system log ? (Score:2, Insightful)
allocation errors are the easiest to predict. even if you don't handle them gracefully (which often can be near to impossible), most of the time you can log them. of course, a reliable, redundant log facility is one the most crucial components of such a system...
writting this from my armchair, of course i can't really judge their competence and claim i could have done better. still, the article makes me suspicious.Logging should not be limited ? (Score:4, Insightful)
I have worked on projects in which there was simply too much logging going on that you couldn't tell head from toe anymore. When a problem arrived, scanning the logfiles proved very cumbersome indeed. Every developer had his own stuff logged, which sometimes proved interesting, sometimes proved utter crap (noone wants to know variable XYZ is increased by 1 for 24943 times).
You should develop a well-thought logging strategy that increases the logging verbosity on a problem-basis, not simply log everything that happens and hoping you get some useful information.
Except just one thing: (Score:4, Insightful)
Rule 3: Never ignore the return value from open.
Re:What we can learn: (Score:1, Insightful)
Re:What we can learn: (Score:3, Insightful)
Re:The proper fix... (Score:2, Insightful)
But this was probably intentional. The flight to Mars is a long one, so there is plenty of time to test while the rover is in transit. Before launch, you need to make sure that the hardware works and is reliable. Since they can upload new versions of software, they can do much of the testing after the launch. This is one of the things that allowed them to hit aggressive launch windows.
This looks like it was less a technical failure and more a communications failure. Other rover operations were dependant on the utilities running to clear up flash space. When that did not happen on time, the right people were not told and so they assumed there would be more space available.
Re:What's the big deal?? (Score:3, Insightful)
Certainly the JPL/NASA guys are smart, experienced with other probes, and have massive resources backing them up. But they also have some heavy odds against them.