Debugging The Spirit Rover 390
icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"
Oh, sure... (Score:5, Funny)
As a former co-worker (hi, jwalker!) used to say when people tried to draw ridiculous analogies, "It's exactly like that...only different."
Re:Oh, sure... (Score:5, Insightful)
"We're ordering this brand new hardware that you've never tested before. Can you guarantee it will never crash?"
"Will this database server handle the load of our brand new project?" (without an accurate growth estimate)
"A server 2000 miles away just went down. What happened?" (no ping, no nothing) Hmmm.. Power/NIC/CPU/CPU fan/hard disks?
It really sounds like they did some decent advanced planning on those probes, but from other stories I read, the were shooting for 90 days of reliability, which in itself was a hard one to do. What if it turns the antenna the wrong way and looses connectivity? What if it gets hit by lightning? What if it falls in a hole? (go Beagle!)
Sure, relate this to your web server colocated somewhere you're not. Cross your fingers, hold your breath, and hope there aren't a few fatal systems failures, or a bit of human error. I've been responsible for a bit of that in the past, but at least my equipment wasn't a few million miles away.
Re:Oh, sure... (Score:4, Informative)
There is a low gain omni-directional antenna that can be used as backup. Infact I think they use it most of the time for commands and just use the high-gain for data transfer back to Earth. Which makes sense, they never need to send large amounts of data to the rover.
No lightning has ever been detected on Mars. Tho it's not impossible, it is very very unlikely. No proper observations of the night side of Mars has been done tho, so they may just be missing it.
And Opportunity did fall into a hole
How'd they do it? (Score:5, Funny)
Re:How'd they do it? (Score:5, Funny)
Re:How'd they do it? (Score:5, Funny)
Just another day in the life of a sys admin!
Re:Oh, sure... (Score:4, Informative)
Actually though, it's not too bad an analogy. While Earth based servers aren't absolutely unreachable like SPirit, they are often remote, and there are expenses associated with visiting them in person.
Various schemes now exist to help deal with that. Many boards have a small management processor (bmc, server management board, IPMI, whatever) that is used for remote diagnostics and reconfiguration when the main board won't even boot.
Meanwhile, LinuxBIOS supports two complete BIOS images. One 'old reliable' that once working is never changed, and one that can be upgraded freely. Coupled with a watchdog card or timer, it's decently managable in the field. That work is continuing.
Meanwhile, IBM is pushing the 'blue button' that forces a software reload from an image partition.
In that sense, the problem is strongly analogous. Most of us will not, however, encounter the exact problem that Spirit had, though some embedded device developers just might.
Re:One reasonable anology (Score:5, Informative)
Have a hardware watchdog. If the machine is lost or confused, it reboots itself.
Have it come up in a known state, fire off a few broadcast packets to the sysadmins, and run sshd but basically nothing else. Stay there for a minute or so.
If nobody's tried to log in and halt the boot process, carry on booting. With luck the problem was transient. Worst case the problem still exists, you reboot, and the admins get another chance to log in.
From the description of how they got Spirit back, it looks like this is exactly how it was set up.
Who'da thunk it!!
Re:One reasonable anology (Score:3, Interesting)
Re:One reasonable anology (Score:4, Interesting)
well, this presupposes that what caused the problem in the first place also didn't mess up the hardware watchdog as well.
Nothing's perfect. It also presupposes that the sun didn't explode and vaporize the Earth and that God didn't get ticked off and squish it with his thumb, So What?
A watchdog is a VERY simple device. A simple countdown timer, a control register with associated address decode, etc. It's quite unlikely to fail. When the timer hits zero, it strobes reset. Any access to the port address resets the countdown timer.
Some dual processor boards are even set up to alternate which is the boot processor, so they can come up with a single failed CPU.
There is always some sort of problem that precludes recovery. No amount of software or clever design can help you if the device is destroyed. However, that doesn't mean don't even try.
Local Debugging (Score:3, Funny)
Re:Local Debugging (Score:5, Funny)
I dont know about learning much.... (Score:4, Funny)
They didn't just randomly delete stuff (Score:5, Insightful)
Using the low- level commands, about a thousand files and their directories -- the leftovers from the initial launch load -- were removed.
I think that means they deleted the useless stuff they wanted to delete anyways but didn't get to delete before the crash. I also remember news about science data from before the crash that was received after they got the rover working again.
As for how critical it is, well yeah, it seems the rover didn't need the contents of the flash file system. The operating system and other software was in the same flash memory but I assume that any sane designer would put in some hardware write protect interlock that's not easy to defeat accidentally.
well (Score:4, Funny)
Like this? (Score:4, Funny)
Remote debugging? (Score:4, Funny)
Re:Remote debugging? (Score:5, Funny)
"Oh, you've got the on-site warranty, huh? Ok, first thing you have to do is ship it to South Dakota. .
Oh, hey, looks just like Mars.
KFG
rebooting on mars... (Score:5, Interesting)
Re:rebooting on mars... (Score:3, Funny)
I wonder how many Microsoft salesmen were pushing for putting WinXP on it..
lots of mem of an embedded system (Score:5, Funny)
Re:OT:lots of mem of an embedded system (Score:3, Interesting)
Re:lots of mem of an embedded system (Score:3, Informative)
A highly respected embedded OS.
YAW.
Space Technology (Score:5, Insightful)
Here on earth we can't even build cars that require no maintainance and last more than 10 years.
Re:Space Technology (Score:5, Insightful)
Re:Space Technology (Score:5, Insightful)
Sort of like Debian.
Cutting edge ain't always what it's cracked up to be.
KFG
Re:Space Technology (Score:3, Funny)
Ring, Ring, Ring....
"Welcome to the Mars Rover answering system. For English press 1, Para Espanol prensa 2"
BEEP
"You selected English. To leave a message for Spirit press 1. To leave a message for Opportunity press 2"
BEEP
"You selected Spirit. Transfering now." CLICK "I'm sorry, Spirit is unavailable at this time. To leave a message press 1. To return to the main menu press 2"
BEEP
"Hi this is the Spirit rover. I can't come to the
Re:Space Technology (Score:4, Informative)
And frankly, if your car isn't lasting ten years, then you bought junk in the first place. Of the four cars I've owned, not one has had a lifetime of less than ten years. Three of them were already older than that when then they came to me, and none lasted me less than four years. (Other than the one that got re-possesed, but I had that one three years.) But then I invest in regular maintenance, don't leadfoot, etc...
Re:Space Technology (Score:4, Insightful)
They make alot of money from loyal customers. But I admit my 13 year old 91 honda civic with 140k miles is getting on my nerves with repair costs. WOuld a 91 ford escort still be running today? I think not.
I will buy only Toyatas and Honda's for that reason.
It amazes me consumers are too stupid to read consumer reports and buy cars on looks. Repair costs for things like Cadallacs and BMW's are not cheap for TCO! Yes consumer products have TCO too and we and not just businesses should look at that as well.
Re:Space Technology (Score:3, Informative)
Re:Space Technology (Score:4, Interesting)
What you can do is make it require less maintainence, make that maintainence cheaper to perform, and make the car last until you hit something really hard so long as you maintain it. You should be able to hand your car down to your kids.
Other than that you're bang on though.
I wonder what we can learn from that about maintaining our computers?
KFG
do they use SSH ? (Score:5, Funny)
All it takes is a transmitter out in the middle of nowhere africa or some island
I wouldnt worry about signal jamming though as that will probably be discovered easily.
Re:do they use SSH ? (Score:5, Insightful)
Pinging mars-rover with 32 bytes of data:
request timed out
request timed out
request timed out
64 bytes from mars-rover: icmp_seq=0 ttl=64 time=32400ms
If it has anything to do with current internet protocols, it would be UDP.
Re:do they use SSH ? (Score:4, Funny)
I realize that Mars is a long way away, but how many routers do you think exist between here and there?
Re:do they use SSH ? (Score:4, Insightful)
Pissed Martians (Score:5, Funny)
What's the big deal?? (Score:4, Insightful)
If it was the hardware that got fried and they miraculously fixed that, I would understand but this was just a software glitch.
I routinely reboot and reprogram machines in our data-center that is 2000 miles away from me.
As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.
Re:What's the big deal?? (Score:5, Funny)
As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.
You are too humble, friend. What you do routinely and without thinking, is nothing less than a miracle of modern science. A miracle that you take part in every day. And because of men like you, we don't have to rely on the abacus anymore. We sent a pentium to the Moon, and soon, Mars will be colonized by G5s. America salutes you, for all the things that you do.....
Like a rock! I was strong as I could be be!
Ooooooohh! Like a rock!
Re:What's the big deal?? (Score:4, Insightful)
Re:What's the big deal?? (Score:5, Insightful)
As long as all hardware components are working and there is connectivity to the machine, it doesn't matter whether the machine is a few miles away or a million miles away.
There are some fundamental differences, my friend:
Re:What's the big deal?? (Score:4, Insightful)
Re:What's the big deal?? (Score:5, Insightful)
Spirit was in a constant reboot cycle, and the fact that they could even communicate with it long enough to bypass the problem was an accomplishment (and lucky).
It would be more similar to your remote data-center machine suddenly going offline and you have no idea why, and you are unable to ssh to it, and you fix it by running through potential scenarios and finding that the problem could have been due to mounting a certain partition, then discovering that there's an exploit in ICMP that allows you to hack to kernel so it doesn't mount that partition.
Re:What's the big deal?? (Score:4, Insightful)
It was nothing like what you described, just a VERY well designed system (though it would have been somewhat better had the system been able to go straight to "safe mode" after the initial critical error (running out of memory))
Did the people with mod points RTFA? Score 5 Insightful?
And no, I'm not new to
Re:What's the big deal?? (Score:5, Insightful)
Re:What's the big deal?? (Score:5, Insightful)
Please! Back in the day people would write programs on paper, mail them in an envelope to a computing center somewhere, and get results weeks later.
THAT was pressure not to fuck up.
Re:What's the big deal?? (Score:3, Interesting)
(hint - draw a diagonal line across their top edges so you can get them in order again quickly.)
Some people seem to no know why "batch" files were so-called, it seems.
YAW.
Re:What's the big deal?? (Score:5, Interesting)
Re:What's the big deal?? (Score:3, Insightful)
That's just it - consider the stress those rovers are enduring or might encounter: subzero tempatures down to -200f, out-of-the-blue (red?) sandstorms, gamma radiation, and who knows what else out there that could suddenly fsck with the systems or scramble internal data ? Your average Dell rack will never have to deal with any of thos
Bud Light Presents...Real Men of Genius. (Score:4, Funny)
Only YOU can fully appreciate the difficulty of running a format c: command, while swilling a room temperature can of Red Bull.
"Hey this stuff is hard now!"
While NASA is too preoccupied with things like farway rovers, you take your vocational tech school fueled arrogance directly to the place where it will make the absolute least possible impact: A Slashdot discussion thread.
"Loggin' on now!"
Your unique eye for obviousness allows you to sling turds of obtuseness every which way, and then brag about how you were RIGHT as soon as one of your pronouncements hit true - regardless of how many times you were wrong before.
"See I told you sooooooo!!"
And if some idiot rocket scientist has the unmitigated gall to not bow down to your obvious Geniusdom, you unleash your fury down upon him with all the tenacity and mercilessness of a rabid pit bull with a tender buttock locked in its jaws.
"Total anonymity!"
So keep clicking away, oh Marauder of the Mousepad. Because when the results you so desire finally come about years from now, you can say it was because YOU demanded it."
"How come they haven't fired that dumbass head of NASA yet yet?"
(Bud Light Beer, Anheuser Busch, St. Louis Missouri.)
Uh-oh (Score:5, Funny)
Sounds like NASA forgot to empty the rover's recycle bin. =)
Re:Uh-oh (Score:3, Funny)
Regards,
Steve
Re:Uh-oh (Score:3, Funny)
Re:Uh-oh (Score:3, Funny)
Trash can? More like a neutron star, 'cause anything you put in it is totally and absolutely gone.
The proper fix... (Score:3, Insightful)
The only real bug was the inability of the system to properly handle running out of file entries (or more specifically, consuming too much RAM as the number of file entries increased). However the software should have never have stressed the filesystem to that degree in the first place.
Dan East
Re:The proper fix... (Score:5, Funny)
When you can write an embedded operating system that can gracefully and automatically recover from every possible thing that might ever go wrong, perhaps you should send your resume to NASA.
Re:The proper fix... (Score:5, Informative)
The rovers were extensively tested before launch. For example, NASA took about 100000 pictures with the test panoramic cameras under varying conditions to see how they would react. NASA put a test rover on a tilting platform to see how far over the rover tilt before it capsized, to find out at what angle the electric motors could no longer drive the rover up a hill, etc.
This limitation of the filesystem was known about ahead of time. If you had read the article, you'd have known that. They had a utility to clean out the rover's filesystem, but a storm at the Deep Space Network site that was supposed to transmit it prevented the second half of the utility from being uploaded to the rover. And before you say anything else, the article also mentioned that the people involved had thought of this possibility ahead of time.
Hindsight (Score:5, Insightful)
Granted mainstream media have to keep their coverage dumbed down if Joe Public are going to read it. But what really bugs me is the lack of follow-up. We hear about poorly understood events as they are unfolding, then never heard about them later when they are completely understood.
A recent example is the gangway between ship and shore at the QM2's drydock. It collapsed killing lots of people, an investigation was launched. Why did it collapse? At the time it wasn't known. I'm sure it's known now, but there's been absolutely no followup.
This article about the rover is great not so much because of the level of detail but because it reports on an event with the benefit of hindsight.
Re:Hindsight (Score:3, Informative)
Re:Hindsight (Score:5, Interesting)
I suggested we allow j-students to substitute math or hard science minors in place of the foreign language requirement. Most graduates of college foreign language programs don't translate at a level any higher than Babelfish. It seems wasteful to force people to spend so much time learning a language that most will never use, when that time could be more productively spent introducing them to the languages of math and science, which they will undoubtedly use in the future. We'd get better reporting that way, and isn't that what going to j-school is all about? Science and technology are too important to our day-to-day lives and governance to be left to illiterates.
Re:Hindsight (Score:3, Funny)
What the article doesn't say (Score:5, Insightful)
Re:What the article doesn't say (Score:4, Interesting)
Mod this "redundant" (Score:5, Informative)
'How do you diagnose an embedded system that has rendered itself unobservable?'
The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.
This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.
(well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)
Lucky Hack? (Score:5, Insightful)
The outcome does not strike me as a "Lucky Hack." They made the system flexible, that flexibility got them into some trouble, and it's also what got them out of it. Anyone else agree?
Yes, see my post (Score:3, Insightful)
All these worlds.... (Score:5, Funny)
Yeah, that was HAL's excuse too.
Seriously, hats off to all the JPL programmers. Proving to the Martians that there is indeed intelligent life on Earth, very intelligent.
Remote debugging pet peeve (Score:5, Funny)
Peter.
Lucky Hack? (Score:5, Insightful)
Basically, they rebooted from a recovery image (sent via radio) and then proceeded to do low-level fixes on Flash memory and they a chkdisk. If I do something similar via recovery disk or CD, I don't get a lot of people telling me that it was a "Lucky Hack" that I could boot off of CD!!!
NASA Rocks! (Score:5, Interesting)
Seriously though, the key lessons to take away from this are.
1) Gather all of the clues you can.
2) Take those clues and build a model.
With luck and care, the model should get you closer to what may have gone wrong. And in this case it apparently did just that. Now that's geek cool!
BTW, I know that generally you want to prevent this sort of thing from happening. But in reality most software ships with bugs and launch windows to Mars are non-negotiable.
Remote safe mode (Score:4, Interesting)
There also needs to be a way to load bootstrap code remotely. For instance, having a TCP/IP enabled BIOS be able to run TFTP or some other protocol to load a netboot floppy image. Then you could give it a LILO command instructing it where to find a boot image, preferably one on a server in the same hosting center.
whoops (Score:5, Funny)
Damn! Left the floppy in!
Could an earthbout 'twin' computer help? (Score:5, Interesting)
From the article "[The] transmission that uploaded the utility was a partial failure: Only one of the utility program's two parts was received successfully. The second part was not received, and so in accordance with the communications protocol it was scheduled for retransmission on sol 19." NASA could have simulated a half failed transfer on the twin copmuter on earth, and then watched carefully using traditional debugging tools to make sure the failed transmission didn't cause a software failure (which it did).
Again, from the article "The data management team's calculations had not made any provision for leftover directories from a previous load still sitting in the flash file system." However, if they had a twin computer system to watch, they would have seen that the failure occur on earth as it did in space. Debugging a system you can hook a serial debugger to is bound to much easier than debugging a system a million miles away.
Re:Could an earthbout 'twin' computer help? (Score:4, Informative)
Re:Could an earthbout 'twin' computer help? (Score:3, Informative)
When Spirit was turned around on it's lander, they tested the moves on it's twin here, hence the long delay getting off the lander.
Brilliant! (Score:4, Funny)
-SF
There is a significant lesson to learn, here .. (Score:3, Insightful)
Does Microsoft know about this? (Score:3, Interesting)
First wxWindows [slashdot.org], now Vx-works?
Great trick for ssh administration (Score:5, Insightful)
sleep 600 && reboot &
Now if your risky maneuver makes the ssh session unusable, just wait 5 minutes for the machine to reboot.
This is great for fiddling with firewalls by remote control... through the firewall.
Oh... You say you're not using a POSIX-like system? That's not supported. Sorry.
Verifying the software !!! (Score:4, Informative)
Re:Verifying the software !!! (Score:5, Interesting)
Software verification is essentially mathematically proving the software....
I've been hearing how great formal verification is since I started this gig. Three decades later, it's still not what Yourdon and his buddies thought it would be. When the first computer scientists were budded from mathematics departments, their mathematical discipline allowed them to do wonderful things, some of which we're still catching up with. But it also gave them some disturbing habits, the worst of which is the insistence that formal verification is the best way to write code, and anyone not doing so must be a fool.
Formal verification is a powerful tool, but as you say, it is expensive and applies to only a limited set of problems. If it were so cheap and so widely applicable, we'd be using it everywhere.
We've poured decades of funding into formal verification, but the useful tools keep coming from other avenues of research. I think it's time to stop beating the formal verification drum.
The only reason nasa got it back to work (Score:3, Informative)
What we can learn: (Score:5, Insightful)
Re:What we can learn: (Score:3, Interesting)
JPL (Score:3, Funny)
Hmmmm (Score:4, Insightful)
Funny, that's how it was explained to me by my computer science teacher my freshman year in high school. He said, "The problem with computers is that they do exactly what we tell them to."
Discovered a system log ? (Score:4, Interesting)
Those guys are running a very expensive experiment, are logging it and they have no idea what and where they are logging??
Logging should not be limited ? (Score:4, Insightful)
I have worked on projects in which there was simply too much logging going on that you couldn't tell head from toe anymore. When a problem arrived, scanning the logfiles proved very cumbersome indeed. Every developer had his own stuff logged, which sometimes proved interesting, sometimes proved utter crap (noone wants to know variable XYZ is increased by 1 for 24943 times).
You should develop a well-thought logging strategy that increases the logging verbosity on a problem-basis, not simply log everything that happens and hoping you get some useful information.
Launching with incomplete code is common (Score:5, Interesting)
I'm sure the rovers did the same thing... Develop the launch/cruise software before you launch (and of course try to get as much of the entry/landing code done as you can!), and then uplink the final code before it's needed. Therefore it doesn't surprise me one bit that the JPL engineer knew there were shortcomings in the launch software.
Hell, I develop BIOS for servers and we do it all the time. The BIOS image we give the hardware engineers for initial bringup is usually *way* short of features that will be there when it actually gets used by the customers!
you, too, can have this capability on earth... (Score:4, Informative)
It's not that hard to pull off off this sort of seemingly amazing remote recovery with pure off-the-shelf tech if you plan for it in advance and are willing to pay a modest premium.
You need remote serial console access -- ideally including firmware/bios serial console access -- and remote power cycling, controlled by a small embedded system, either in separate units (APC masterswitch, terminal servers) or as part of the system unit (common on Sun gear as "LOM"/"ALOM"/etc.; some of this is also creeping into x86 mobos). All this lets you regain control of the system remotely.
Then it becomes a matter of hardening the system to let you recover from various other insults. Never let go with both hands: Mirrored disks (protecting against hardware failure) and multiple bootable partitions (protecting against software or human error) can both be used; netbooting is also a nice capability to have when you've got a bunch of servers in the same place.
Disclaimer: I bet you can do much of the above with other people's gear, but I work for Sun and I know it works for me...
not really... (Score:4, Informative)
Another factor in this is the safety of the flash ram. It is rad-hardened and built with tons of extra error correction which again, requires years of testing and special design considerations. And is extremely expensive.
Re:only 120 megs ram? (Score:5, Informative)
The point is to use well known and well tested hardware. The whole point of Mars Pathfinder was to develop a system whose design could be re-used for other Mars landers and rovers.
Lastly, what exactly are you going to do with greater flash capacity? The point of having any flash memory on the rovers at all is not for long term storage, but rather just to hold onto data until it can be transmitted to Earth, after which it gets deleted.
Despite what some idiot posted a few posts up, they did NOT run out of room on the flash drive. Rather, the problem is more akin to running out of i-nodes. Mounting the flash filesystem, reading all its metadata and whatnot, took up more RAM than was allocated for it, due to the high number of files it had to deal with (most of which were accumulated on the way to Mars, and were going to be deleted).
Re:NASA should have simulated... (Score:4, Informative)
Re:NASA should have simulated... (Score:3, Insightful)
You also realize that NASA did do a test mission, right? They built a test rover and put it out in a desert somewhere. They used the mission to test the hardware, test the software, and to help train the team.
Ran out of INODES. No really. (Score:5, Informative)
To me, if this were a Unix-like system, it sounds like they ran out of inodes [webopedia.com]. Running out of inodes is very different than running out of disk space.
If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.
Re:Ran out of INODES. No really. (Score:4, Funny)
That's why I always keep a spare bag or two of inodes on hand, just in case. They're small so they don't take up too much space in the closet. I store them next to those f-stops I used to use for photography.
Re:Ran out of flash disk space. No, really. (Score:3, Informative)
After a few days on Mars, they were starting to fill up the flash, so they planned to go ahead and delete the old la
Re:Ran out of flash disk space. No, really. (Score:4, Informative)
It was the inability to build the RAM-based directory structure of the files in the Flash memory.
Why couldn't they build the directory structure? They had too many files, the size of the files doesn't matter here, only the number of files.
In other words, they ran out of RAM, not Flash.
Exercise left for the readers: Why can a Unix file system that is out of inodes have much less than 100% disk usage and still not be able to create a file?
Re:WindRiver's fault (Score:3, Informative)
Re:WindRiver's fault (Score:4, Insightful)
Years ago, when JPL was designing the Mars Pathfinder mission, they asked Wind River to do an "affordable" port of VxWorks to the RAD6000 (a radiation-hardened RS6000), and they agreed. Since the computers on the two MERs are very similar to the computer on the Mars Pathfinder lander, it makes sense that they'd use the same OS that they used on the MPF lander.
I would think the fact that JPL knows VxWorks very well by now would be a major factor in deciding to use VxWorks for the MERs.
Except just one thing: (Score:4, Insightful)
Rule 3: Never ignore the return value from open.