Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Space Bug Wireless Networking Science Hardware

Debugging The Spirit Rover 390

icebike writes "eeTimes has a story on how the Mars Rover was essentially reprogrammed from millions of miles away. 'How do you diagnose an embedded system that has rendered itself unobservable? That was the riddle a Jet Propulsion Laboratory team had to solve when the Mars rover Spirit capped a successful landing on the Martian surface with a sequence of stunning images and then, darkness.' The outcome strikes me as an extremely Lucky Hack, and the rover could have just as likely been lost forever. Are there lessons here that we can use here on the third rock for recovery of our messed up machines which we manage from afar via ssh?"
This discussion has been archived. No new comments can be posted.

Debugging The Spirit Rover

Comments Filter:
  • Mod this "redundant" (Score:5, Informative)

    by Penguinshit ( 591885 ) on Sunday February 22, 2004 @02:06AM (#8354221) Homepage Journal

    'How do you diagnose an embedded system that has rendered itself unobservable?'

    The way you do this is by having an exact duplicate of the remote system so you can set up a test with conditions as close to those under which the remote system is currently operating. You can then do a series of carefully controlled test solutions to determine the optimum prior to trying it on the "live" system.

    This is the way I set up all my production systems and, barring catastrophic hardware failure (self-immolating disks and a router which just folded when its power supply burped) I've had perfect uptime.

    (well, ok.. there was that one time, late at night, when I typed "reboot" in the wrong window.. but that happens...)

  • not really... (Score:4, Informative)

    by rebelcool ( 247749 ) on Sunday February 22, 2004 @02:17AM (#8354270)
    on projects such as this, the design specs would've been frozen several years ago, and then would've been conservative for the time, using proven technology.

    Another factor in this is the safety of the flash ram. It is rad-hardened and built with tons of extra error correction which again, requires years of testing and special design considerations. And is extremely expensive.

  • Re:Space Technology (Score:2, Informative)

    by afidel ( 530433 ) on Sunday February 22, 2004 @02:21AM (#8354290)
    Just gave a 93 Ford Taurus to my brothers fiance, runs great and in the 5 years I owned it I had to replace a seal on the radiator and that was IT other than oil and gas. My current car is 99 Taurus with 158K miles and I haven't put a dime into it other than oil and brakes, need to do spark plugs as the fuel economy has gone down this winter and that's the most likely cause =)
  • Re:Hindsight (Score:3, Informative)

    by Jeremy Erwin ( 2054 ) on Sunday February 22, 2004 @02:31AM (#8354327) Journal
    I'm sure there will be at least some mention of the results of the investigation when it is completed and various persons are prosecuted. In the meantime, here's a relatively recent article [yahoo.com] on the investigation into the collapse.
  • by updog ( 608318 ) on Sunday February 22, 2004 @02:34AM (#8354337) Homepage
    The fact that they filled up the flash memory with too many files that were accumulated during the cruise phase of the mission between earth and mars was something that they should have known would happen. Apparently you didn't read the article. Because of a communication failure, a utility that was supposed to delete the old files didn't get completely uploaded. The utility was scheduled for retransmission, but the filesystem filled up before it got re-transmitted.

  • by Anonymous Coward on Sunday February 22, 2004 @02:51AM (#8354383)
    Uhmm... we DID build a 'twin' of the rover, hardware and all. Give us a bit more credit, will ya? :-P What you may not realize is that exposure to radiation on the surface of Mars, solar wind while in transit and other factors such as thermal expansion / contraction, etc. are slowly degrading the rovers in nondeterministic ways. It is not nearly as simple as 'running the commands in the testbed' at JPL to diagnose any problems which occur.
  • by Gogo Dodo ( 129808 ) on Sunday February 22, 2004 @02:51AM (#8354384)
    They do have a twin system here, but having one here isn't quite the same as the two on Mars. You can't replicate everything on the two Mars rovers such as the science data files.

    When Spirit was turned around on it's lander, they tested the moves on it's twin here, hence the long delay getting off the lander.

  • by dorko ( 89725 ) * on Sunday February 22, 2004 @02:51AM (#8354385) Homepage
    If you RTFA you will realize that I'm not lying in the least when I say that, effectively, they ran out of flash-based "disk" space!
    Well, I did read the article and I wouldn't say it quite like that. The article says: "Spirit attempted to allocate more files than the RAM-based directory structure could accommodate." Furthermore, the article says that the low-level file manipulation commands "worked directly on the flash memory without mounting the volume or building the directory table in RAM ."

    To me, if this were a Unix-like system, it sounds like they ran out of inodes [webopedia.com]. Running out of inodes is very different than running out of disk space.

    If you think runing out of disk space can be hard to trouble shoot, try running out of inodes.

  • Re:The proper fix... (Score:5, Informative)

    by KewlPC ( 245768 ) on Sunday February 22, 2004 @03:15AM (#8354449) Homepage Journal
    Score: -1, Didn't Read Article

    The rovers were extensively tested before launch. For example, NASA took about 100000 pictures with the test panoramic cameras under varying conditions to see how they would react. NASA put a test rover on a tilting platform to see how far over the rover tilt before it capsized, to find out at what angle the electric motors could no longer drive the rover up a hill, etc.

    This limitation of the filesystem was known about ahead of time. If you had read the article, you'd have known that. They had a utility to clean out the rover's filesystem, but a storm at the Deep Space Network site that was supposed to transmit it prevented the second half of the utility from being uploaded to the rover. And before you say anything else, the article also mentioned that the people involved had thought of this possibility ahead of time.
  • by vinit79 ( 740464 ) on Sunday February 22, 2004 @03:30AM (#8354489)
    What really surprises me is that NASA did not verify the software. Software verification [google.com] is essentially mathematically proving the software. It is tedious and expensive but we are talking about NASA and the Mars. Infact even beloved MS formally verifies device drivers [microsoft.com] before use ( believe it or not !!) If the original program was correct they wouldnt have to reupload it and the entire problem ...gone.
  • by KewlPC ( 245768 ) on Sunday February 22, 2004 @04:05AM (#8354557) Homepage Journal
    You realize that the onboard computer is basically the same one as used on the Mars Pathfinder lander, right? Same CPU, same amount of RAM, even the same OS. I wouldn't be surprised if they used the same (or similar) circuit diagrams for certain things.

    The point is to use well known and well tested hardware. The whole point of Mars Pathfinder was to develop a system whose design could be re-used for other Mars landers and rovers.

    Lastly, what exactly are you going to do with greater flash capacity? The point of having any flash memory on the rovers at all is not for long term storage, but rather just to hold onto data until it can be transmitted to Earth, after which it gets deleted.

    Despite what some idiot posted a few posts up, they did NOT run out of room on the flash drive. Rather, the problem is more akin to running out of i-nodes. Mounting the flash filesystem, reading all its metadata and whatnot, took up more RAM than was allocated for it, due to the high number of files it had to deal with (most of which were accumulated on the way to Mars, and were going to be deleted).
  • by PipoDeClown ( 668468 ) on Sunday February 22, 2004 @04:13AM (#8354577)
    is because when the batteries got drained the os went into a stable "safe mode" state. If they made a long lasting powersupply this project was doomed(.f) and they never found out what the real problem was.
  • Re:WindRiver's fault (Score:3, Informative)

    by KewlPC ( 245768 ) on Sunday February 22, 2004 @04:17AM (#8354590) Homepage Journal
    Actually, they used VxWorks because it was the same OS used for the lander on the Mars Pathfinder mission. Since they were using the same CPU and same basic computer design as the Mars Pathfinder lander, they probably figured, "Why not use the same OS?"
  • by SiliconEntity ( 448450 ) on Sunday February 22, 2004 @04:28AM (#8354607)
    Here's what happened according to the article. They launched the ship with an OS image in flash, and soon realized that they needed to update it. So shortly after launch they sent another complete OS image. They knew they'd have to delete the first image, but they didn't do it right away. At that point there was plenty of room in the flash memory so having two OS images was not a problem.

    After a few days on Mars, they were starting to fill up the flash, so they planned to go ahead and delete the old launch OS image, its directories and files. This is a complicated process so they uploaded a special program to do it on Sol 15. And apparently they informed the rest of the team that the memory would be free and available after that point, so the rest of the team made plans to start filling it up with pictures.

    However, the upload on sol 15 failed, and was rescheduled for sol 19. Now, here's the big mistake (which the article glosses over): They forgot to tell the rest of the team that all that memory wasn't going to be freed up as planned, not for a few more days. So instead, Spirit is moving around now, taking lots of pictures, storing them in flash, and all the people involved with that think they have plenty of room. Little do they know that they are running out of flash space. Finally, the morning of Sol 19, shortly before the memory cleaning program was going to be sent down, it happened. The flash memory was exhausted. This triggered a sequence of events which put the craft into a failure loop.

    The big problem here, then, was the failure on the part of the group which was supposed to clean out the launch OS image to tell the rest of the team that it wasn't going to happen as scheduled, so the memory wasn't going to be available. It wasn't really Murphy's Law, but rather a failure to communicate among the team. This is an institutional problem which will hopefully be fixed.
  • Re:Space Technology (Score:3, Informative)

    by Zakabog ( 603757 ) <john&jmaug,com> on Sunday February 22, 2004 @04:39AM (#8354628)
    I have a 90 ford mustang you insensitive clod. Still runs strong today, has like 107,000 miles on it and I'm sure it'd destroy your civic in a race ;-P. The only money I've really been spending is on a tune up, and new tires (old tires were crappy and leaking air.) And besides when someone buys a Cadillac or BMW (and god damn it it's Toyota, what the hell is Toyata) they don't care about the price. When you're going to spend $30,000 on a "cheap" BMW 3 series you're not gonna care that it's going to cost you x amount more than a cheap japanese car.

    Cadillacs I don't really know too well, but I know a BMW doesn't need a whole lot of repairs. Most german cars are VERY well built. Much better than japanese cars too. And what good's a car that'll last you forever if you don't like the piece of shit in the first place. I just bought a new car (My mustang's in NY, my sister drives it now, my grandmother didn't like the idea of me driving across country in a 1990 Mustang with 300+ rwhp, on such long straight roads, top speeds 145 btw), I could have gone with a VERY cheap Honda Civic, it would probably last me most of my adult life but why would I want such a piece...? I bought a fully loaded 2004 Nissan Sentra SE-R SpecV, it's a quick car, with low insurance and great looks. I wouldn't have bought anything less, I didn't look into TCO at all, it didn't really matter to me. I don't want a car that'll last me forever if I don't like it. And most people let the dealer pick out the car they want, they don't really realize it but they don't care, the average person wants to get from point A to point B, and the salesman is gonna try to sell them a car that costs a lot of money, not caring about the life of the car.
  • by Anonymous Coward on Sunday February 22, 2004 @04:48AM (#8354645)
    1. It's not a known broken OS. It's an OS that doesn't have any failsafe to protect against running out of storage, and user error caused it to allocate too many files. The people who were keeping track of old files from a failed transfer weren't talking to the guys that allocated new files, so nobody knew how many files were actually allocated and they ran out.

    2. That's not what "begs the question" means. http://skepdic.com/begging.html [skepdic.com]

    3. Based on 1 and 2, it is proved by example that you=monkey puppet.
  • by Anonymous Coward on Sunday February 22, 2004 @04:57AM (#8354683)
    IIRC VxWorks' native filesystem is FAT16. These are not the same flash chips you'd put in your camera, rather radiation hardened components built to insane reliability specs. The batteries in the rover will probably fail before the flash memory.
  • by dorko ( 89725 ) * on Sunday February 22, 2004 @05:15AM (#8354722) Homepage
    [T]hey are running out of flash space. ... The flash memory was exhausted.
    No, no, NO!

    It was the inability to build the RAM-based directory structure of the files in the Flash memory.

    Why couldn't they build the directory structure? They had too many files, the size of the files doesn't matter here, only the number of files.

    In other words, they ran out of RAM, not Flash.

    Exercise left for the readers: Why can a Unix file system that is out of inodes have much less than 100% disk usage and still not be able to create a file?

  • by zcat_NZ ( 267672 ) <zcat@wired.net.nz> on Sunday February 22, 2004 @05:15AM (#8354723) Homepage
    If you're really worried about your remote server being unreachable, here's what I would suggest doing:

    Have a hardware watchdog. If the machine is lost or confused, it reboots itself.

    Have it come up in a known state, fire off a few broadcast packets to the sysadmins, and run sshd but basically nothing else. Stay there for a minute or so.

    If nobody's tried to log in and halt the boot process, carry on booting. With luck the problem was transient. Worst case the problem still exists, you reboot, and the admins get another chance to log in.

    From the description of how they got Spirit back, it looks like this is exactly how it was set up.

    Who'da thunk it!!
  • by You're All Wrong ( 573825 ) on Sunday February 22, 2004 @06:05AM (#8354785)
    Vx-Works

    A highly respected embedded OS.

    YAW.
  • by Dan East ( 318230 ) on Sunday February 22, 2004 @08:19AM (#8354966) Journal
    I did read the article, and my comments are completely accurate. Unfortunately you must not have made it to the 3rd paragraph, and neither did the mods that modded you up and me down.

    The problem was discovered after launch. The first few fixes made the problem worse by stressing the filesystem even further.

    It doesn't matter that they were trying to fix the problem. THAT WAS NOT MY POINT. The problem should have been identified and fixed before the craft was launched.

    Yes, they may have taken "around" 100000 pictures. Does that mean they sequentially stored every picture in an actual rover file system? I get the impression they were only testing the cameras or the capture software, not the holistic system.

    Did they first simulate filling the filesystem with files generated during the actual trip to mars? Apparently not, because the system would have failed if they had actually put the rover software through a launch to end of mission simulation here on earth when the software was developed.

    Dan East
  • by edesio ( 93726 ) on Sunday February 22, 2004 @08:48AM (#8355014)
    It seems to have two differente flashes: a larger for new files and a smaller one for programs. This would make it easier to manage.

    "...Separately, about 230 Mbytes are used to implement a flash file system..."
  • Re:Oh, sure... (Score:4, Informative)

    by FrostedWheat ( 172733 ) on Sunday February 22, 2004 @09:05AM (#8355049)
    What if it turns the antenna the wrong way and looses connectivity? What if it gets hit by lightning? What if it falls in a hole? (go Beagle!)

    There is a low gain omni-directional antenna that can be used as backup. Infact I think they use it most of the time for commands and just use the high-gain for data transfer back to Earth. Which makes sense, they never need to send large amounts of data to the rover.

    No lightning has ever been detected on Mars. Tho it's not impossible, it is very very unlikely. No proper observations of the night side of Mars has been done tho, so they may just be missing it.

    And Opportunity did fall into a hole :)
  • Re:Oh, sure... (Score:4, Informative)

    by sjames ( 1099 ) on Sunday February 22, 2004 @11:22AM (#8355459) Homepage Journal

    Actually though, it's not too bad an analogy. While Earth based servers aren't absolutely unreachable like SPirit, they are often remote, and there are expenses associated with visiting them in person.

    Various schemes now exist to help deal with that. Many boards have a small management processor (bmc, server management board, IPMI, whatever) that is used for remote diagnostics and reconfiguration when the main board won't even boot.

    Meanwhile, LinuxBIOS supports two complete BIOS images. One 'old reliable' that once working is never changed, and one that can be upgraded freely. Coupled with a watchdog card or timer, it's decently managable in the field. That work is continuing.

    Meanwhile, IBM is pushing the 'blue button' that forces a software reload from an image partition.

    In that sense, the problem is strongly analogous. Most of us will not, however, encounter the exact problem that Spirit had, though some embedded device developers just might.

  • by Anonymous Coward on Sunday February 22, 2004 @11:26AM (#8355481)
    No, if sleep finishes successfully it'll reboot. If you kill sleep, it'll exit with some big code (on Linux it would be 130). sleep exiting with code 130 will cause && to not execute the consequent.

    Ergo, if everything works and you don't want to reboot anymore, you just do a little % followed by a little ctrl-C and it's all good.

    Incidentally, sleep 600 will make the machine sleep for 10 minutes, not 5 minutes, as the OP said :)

  • by sommerfeld ( 106049 ) on Sunday February 22, 2004 @11:58AM (#8355642)

    It's not that hard to pull off off this sort of seemingly amazing remote recovery with pure off-the-shelf tech if you plan for it in advance and are willing to pay a modest premium.

    You need remote serial console access -- ideally including firmware/bios serial console access -- and remote power cycling, controlled by a small embedded system, either in separate units (APC masterswitch, terminal servers) or as part of the system unit (common on Sun gear as "LOM"/"ALOM"/etc.; some of this is also creeping into x86 mobos). All this lets you regain control of the system remotely.

    Then it becomes a matter of hardening the system to let you recover from various other insults. Never let go with both hands: Mirrored disks (protecting against hardware failure) and multiple bootable partitions (protecting against software or human error) can both be used; netbooting is also a nice capability to have when you've got a bunch of servers in the same place.

    Disclaimer: I bet you can do much of the above with other people's gear, but I work for Sun and I know it works for me...

  • Re:Space Technology (Score:4, Informative)

    by DerekLyons ( 302214 ) <fairwater@gmaLISPil.com minus language> on Sunday February 22, 2004 @01:51PM (#8356235) Homepage
    That's the thing that amaze me. Any technology having to do with space seem that much more advanced.

    Here on earth we can't even build cars that require no maintainance and last more than 10 years.
    Most of the stuff in space that lasts ten years usually has no moving parts, which is what generates much of the maintenace requirements on your car. Nor does it have parts to get fouled, corroded, or otherwise mucked up by the enviroment of or the operation of the car.

    And frankly, if your car isn't lasting ten years, then you bought junk in the first place. Of the four cars I've owned, not one has had a lifetime of less than ten years. Three of them were already older than that when then they came to me, and none lasted me less than four years. (Other than the one that got re-possesed, but I had that one three years.) But then I invest in regular maintenance, don't leadfoot, etc...
  • by rossumtech ( 597772 ) on Sunday February 22, 2004 @03:50PM (#8356875) Homepage
    Here's a link to the NASA press release describing all the details to that fix of the Galileo orbiter. I remembered it because I sometimes work at JPL and walked into a lab where a JPL-er was packing up what looked like a home-brew old time reel-to-reel tape player. It turned out that it was the sister device to the Galileo flight system and the guy I was talking to was one of the brains who had figured out the fix! JPL press release [nasa.gov]
  • Re:do they use SSH ? (Score:3, Informative)

    by Phil Karn ( 14620 ) <karn AT ka9q DOT net> on Sunday February 22, 2004 @05:05PM (#8357224) Homepage
    As challenging as the links are, they are very well modeled; the signal-to-noise ratio can usually be accurately predicted to a fraction of a dB. This allows the telecom team to confidently schedule downlink sessions at the highest data rate that the link can handle without a significant risk of data loss.

    Because very strong forward error correction coding is used, the link tends to be "brittle"; as long as you stay just under the maximum allowable data rate, it will work perfectly. So a lot of work goes into making those accurate link predictions.

    But data can still be lost if the signal-to-noise ratio takes an unexpected dip. The most likely cause is rain at the earth station site, as the weather is not as easily predicted and water is a strong absorber of X-band radio energy. Most of the DSN sites are in deserts for just this reason. But even if data is lost, it can be retransmitted later as it is stored on the rover until explicitly deleted.

  • by Phil Karn ( 14620 ) <karn AT ka9q DOT net> on Sunday February 22, 2004 @05:35PM (#8357398) Homepage
    An earlier example was Voyager 2. This spectacularly successful mission almost didn't make it even to Jupiter. Its primary command receiver failed, and the AFC (automatic frequency control) in its backup also failed. That meant the receiver was listening only to a single frequency with almost no tolerance for error. And the precise frequency was a function of component drift, which was in turn mainly a function of receiver temperature.

    The failed components never recovered, but JPL was able to work around it. They constructed an elaborate thermal model of the spacecraft to predict the precise temperature (and therefore the operating frequency) of the command receiver. Everything but the kitchen sink went into this model: the effect of attitude on solar heating, the self-generated heat from the electronics, the effect of turning various instruments on or off, the time lags due to structural heat capacities, everything. And it has worked fine ever since.

    JPL doesn't get nearly the credit they deserve for their track record in rescuing missions from seemingly fatal failures like these. There's still a pervasive public myth (sustained by the human space flight side of NASA) that only humans in space can fix things when they break. But they seriously overestimate the astronauts' abilities, and they greatly underestimate what a bunch of really smart people can often do from the ground.

  • by swillden ( 191260 ) * <shawn-ds@willden.org> on Sunday February 22, 2004 @10:04PM (#8358928) Journal

    You still haven't learned the lesson. Those are not errors or mistakes, they are malfunctions. A properly designed computer system can easily detect malfunctions

    Guess you'd better get over to NASA and set up a series of lectures so that you can impart your vast expertise and wisdom.

    but that same system will happily execute any human-designed code containing massive errors.

    Interesting that you point out the code as being human-designed. Who designed the hardware? God?

    You're just the kind of computer geek I abhor, always looking for excuses instead of solutions to your own mistakes.

    And you're just the kind of self-assured idiot who amuses me endlessly with your clueless but oh-so-confident assertions.

    In the real world, hardware defects do exist, some designed into the hardware, others induced by external effects or damage. Software errors are certainly far more common, but that's mostly just because there's vastly more software.

    Even without the effects of space travel, hardware contains flaws and, indeed, much of the job of low-level software is to work around those flaws. It's not uncommon for a significant percentage of the code in a device driver to be dedicated to working around various hardware defects.

    Anyone who's spent considerable time working around custom and embedded computing hardware knows that defects often turn out to be *both* hardware and software-based. Insignificant hardware bugs interact with insignificant software bugs to produce major problems. Hardware defects aren't limited to those environments, either. Spend a little time searching the LKML archives for "ACPI" and reading what you find, or even just look through the Linux kernel configuration help and see how many configuration options you find that implement softare hacks to work around problems with particular pieces of hardware.

    When you factor in the rather unique and harsh operating environment of this hardware and software, and consider the amount and depth of testing that certainly went into the development process, it's not in the least bit unusual that the programmers should be surprised that the flaw was purely a software error. If I'd been in those engineers' shoes, I also would have expected something far more complex. I'm sure they went into it, quite reasonably, assuming that some hardware component had failed and that they were going to have to implement a software workaround.

    I'm sure the prevailing sentiment when they finally discovered the actual nature of the problem was "Hallelujah! This is something we can fix!", not "Uh, oh, I can't blame this on anyone else." That's certainly how I would have felt, anyway.

The moon is made of green cheese. -- John Heywood

Working...