Curiosity Rover On Standby As NASA Addresses Computer Glitch

Curiosity Rover On Standby As NASA Addresses Computer Glitch 98

Posted by samzenpus on Sunday March 03, 2013 @02:55PM from the fixing-the-glitch dept.

alancronin writes "NASA's Mars rover Curiosity has been temporarily put into 'safe mode,' as scientists monitoring from Earth try to fix a computer glitch, the US space agency said. Scientists switched to a backup computer Thursday so that they could troubleshoot the problem, said to be linked to a glitch in the original computer's flash memory. 'We switched computers to get to a standard state from which to begin restoring routine operations,' said Richard Cook of NASA's Jet Propulsion Laboratory, the project manager for the Mars Science Laboratory Project, which built and operates Curiosity."

Curiosity Rover On Standby As NASA Addresses Computer Glitch

This discussion has been archived. No new comments can be posted.

Search 98 Comments Log In/Create an Account

Comments Filter:

Glitch or flash memory failure? (Score:5, Interesting)

by AmiMoJo ( 196126 ) * writes: on Sunday March 03, 2013 @03:03PM (#43062771) Homepage Journal

Are we talking a temporary issue that can be resolved by re-flashing the memory in question or is one of the cells damaged in some un-recoverable way? Either way there are solutions but the latter is far more serious.

Re:Glitch or flash memory failure? (Score:5, Interesting)

by Brett Buck ( 811747 ) writes: on Sunday March 03, 2013 @04:15PM (#43063109)

I wonder why this is not something that is kept up to date anyway. I can see keeping B an update or two behind A to prevent a single programming error taking both of them down. But after you are satisfied with A's software load, why keep B so far back-level that transition takes so much time. And since the computers are said to be identical, why the desire to move back to A?

I can easily imagine this happening, I work on a very similar, perhaps nearly identical spacecraft (that's just a tad mode critical AND expensive than this thing...) and we haven't necessarily maintained this. You underestimate the overhead associated with generating the necessary uploads.
The reason they probably want to go back to the Prime is that their failure isolation system database is keyed to using the prime units only, and to alter it to start on the "B" side and have it switch back to "A" is prohibitive, or at least easier to get around by switching back to A. This last is also something we do in the rare case of a temporary failure. There's less good justification to doing it than leaving the backup program image alone but having to completely retest the entire redundancy management system for a new configuration is generally avoided. If it fails hard, it doesn't really matter, since there's no Prime to switch back to.
Brett

Remote fixes always a hair raiser (Score:5, Interesting)

by Celarent Darii ( 1561999 ) writes: on Sunday March 03, 2013 @04:23PM (#43063153)

I once had to fix a server some 6000 km away due to a corrupted disk. Doing pdisk and modifying fstab over ssh and then a reboot. You just check and recheck to make sure you did it right and just hope you get a ping a few minutes later.
Can't imagine how these guys feel. 45 min ping and it isn't like they could ask someone to go turn it off and on again.
Good luck to the guys working on this.

Re:Gotta love Armchair Quarterbacks and their simp (Score:5, Interesting)

by Tablizer ( 95088 ) writes: on Sunday March 03, 2013 @04:57PM (#43063329) Journal

The Galileo Jupiter atmosphere probe actually had a parachute-related part put on backward. It almost ruined the mission. They got lucky and the shaking from atmospheric drag eventually shook the high-altitude parachute off the bad lock barely in time before it could have damaged the probe.
Doesn't hurt to ask, although knowing more about the hardware may allow you to give more specific advice, such as "part X could be put in backward and still mostly work without early detection according to simulation Y."

The design is very robust (Score:5, Interesting)

by chalker ( 718945 ) writes: on Sunday March 03, 2013 @05:01PM (#43063353) Homepage

Check out the official rover press kit for a summary of the computer design (http://mars.jpl.nasa.gov/msl/news/pdfs/MSLLanding.pdf) Page 42 in particular:
"Curiosity has redundant main computers, or rover compute elements. Of this “A” and “B” pair, it uses one at a time, with the spare held in cold backup. Thus, at a
given time, the rover is operating from either its “A” side or its “B” side. Most rover devices can be controlled by either side; a few components, such as the navigation camera, have side-specific redundancy themselves. The computer inside the rover — whichever side is active — also serves as the main computer for the rest of the Mars Science Laboratory spacecraft during the flight from Earth and arrival at Mars. In case the active computer resets for any reason during the critical minutes of entry, descent and landing, a software feature called “second chance” has been designed to enable the other side to promptly take control, and in most cases, finish the landing with a bare-bones version of entry, descent and landing instructions.
Each rover compute element contains a radiation-hardened central processor with PowerPC 750 architecture: a BAE RAD 750. This processor operates at up to 200 megahertz speed, compared with 20 megahertz speed of the single RAD6000 central processor in each of the Mars rovers Spirit and Opportunity. Each of Curiosity’s redundant computers has 2 gigabytes of flash memory (about eight times as much as Spirit or Opportunity), 256 megabytes of dynamic random access memory and 256 kilobytes of electrically erasable programmable read-only memory.
The Mars Science Laboratory flight software monitors the status and health of the spacecraft during all phases of the mission, checks for the presence of commands to execute, performs communication functions and controls spacecraft activities. The spacecraft was launched with software adequate to serve for the landing and for operations on the surface of Mars, as well as during the flight from Earth to Mars. The months after launch were used, as planned, to develop and test improved flight software versions. One upgraded version was sent to the spacecraft in May 2012 and installed onto its computers in May and June. This version includes improvements for entry, descent and landing. Another was sent to the spacecraft in June and will be installed on the rover’s computers a few days after landing, with improvements for driving the rover and using its robotic arm."
And according to a release they issued after landing, both computers receive the same updates and are running the same software (not a version or 2 behind like others have suggested): http://mars.jpl.nasa.gov/news/whatsnew/index.cfm?FuseAction=ShowNews&NewsID=1305 [nasa.gov]

Re:Remote fixes always a hair raiser (Score:4, Interesting)

by evilviper ( 135110 ) writes: on Sunday March 03, 2013 @07:52PM (#43064059) Journal

I once had to fix a server some 6000 km away due to a corrupted disk. Doing pdisk and modifying fstab over ssh and then a reboot. You just check and recheck to make sure you did it right and just hope you get a ping a few minutes later.
It's called out-of-band management. You can bring up a server from bare metal with no working OS installed. Damn near every server out there comes with at least ipmi, and often DRACs/iLos/RSAs with some additional features. All you need to do is give the OoBM interface an IP address (perhaps a DHCP reservation) and you're good to go.
Even if you're running on desktop-class hardware, you can still fake OoBM pretty well with a serial port. Linux/BSD/etc., will bring-up the serial port as the console as soon as the bootloader starts up, if configured to do so. And if the disk has failed, or otherwise your bootloader doesn't work, hopefully your bios is set to PXE boot, and your pxelinux configuration will give you a serial console as soon as that kicks-in. Throw-in magic sysrq to allow you to reboot a system that's not responding, and you've got something reasonably close to OoBM just about free. You could also supplement this with a watchdog timer and make things even more reliable.
But as cheap as server-class hardware is, and the ubiquity of ipmi, it's probably not worthwhile going the cheap route.

Re:Glitch or flash memory failure? (Score:5, Interesting)

by BradleyUffner ( 103496 ) writes: on Sunday March 03, 2013 @09:29PM (#43064475) Homepage

They sent the update once, didn't they?
Wait till you are satisfied it worked, and shunt it over to computer B.
I'm fairly sure that they purposely keep the computers out of sync to avoid a single bug taking out both systems. If I recall, it actually has 3 computers, 2 of them have identical hardware that run different versions of the same software, and a 3rd computer based on completely different hardware running yet another software package. Each system is able to assume command of the mission and issue updates to the other systems.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Curiosity Rover On Standby As NASA Addresses Computer Glitch 98

Curiosity Rover On Standby As NASA Addresses Computer Glitch More Login

Curiosity Rover On Standby As NASA Addresses Computer Glitch

Glitch or flash memory failure? (Score:5, Interesting)

Re:Glitch or flash memory failure? (Score:5, Interesting)

Remote fixes always a hair raiser (Score:5, Interesting)

Re:Gotta love Armchair Quarterbacks and their simp (Score:5, Interesting)

The design is very robust (Score:5, Interesting)

Re:Remote fixes always a hair raiser (Score:4, Interesting)

Re:Glitch or flash memory failure? (Score:5, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot