Reformatting a Machine 125 Million Miles Away 155
An anonymous reader writes: NASA's Opportunity rover has been rolling around the surface of Mars for over 10 years. It's still performing scientific observations, but the mission team has been dealing with a problem: the rover keeps rebooting. It's happened a dozen times this month, and the process is a bit more involved than rebooting a typical computer. It takes a day or two to get back into operation every time. To try and fix this, the Opportunity team is planning a tricky operation: reformatting the flash memory from 125 million miles away. "Preparations include downloading to Earth all useful data remaining in the flash memory and switching the rover to an operating mode that does not use flash memory. Also, the team is restructuring the rover's communication sessions to use a slower data rate, which may add resilience in case of a reset during these preparations." The team suspects some of the flash memory cells are simply wearing out. The reformat operation is scheduled for some time in September.
Hey, Bob, this is Jim (Score:5, Funny)
We're gonna need you to go out to the rover and reboot it. Yeah, it got stuck. You should probably leave ASAP.
Deploy the Paperclip! (Score:2)
Comment removed (Score:3)
Re: (Score:2)
It's running on solar power, that's how it lasts 10 years. Though the rechargeable battery must be tough to take so many recarchings.
Ideally, you have redundant systems for such a situation, where you can take one of them down and use the other to do the booting, formatting, programming, as if there were a user sitting right next to it. They say it has a flashless mode of operation, but the way I think of it, as in a regular PC, with a BIOS, you can reformat the harddrive without booting off of and using th
Re: (Score:1)
Wow. Talk about missing an obvious joke and over-thinking the response. Seriously epic *WHOOSH*
Re: Simple fix (Score:2, Funny)
Ass-burgers.
Re: Simple fix (Score:1)
Re: (Score:2)
Re: (Score:2)
what's this step 4??
Press the reset button.
Who the hell designed this stuff?
And I thought I was cool... (Score:2)
When I reboot machines in Asia or UK/EU using IPMI from the US.
Re:And I thought I was cool... (Score:5, Funny)
And I thought I was cool when I reboot servers around the world thinking I am rebooting mine.
Err, if you're a system admin.. (Score:5, Funny)
... you're not cool. Period. Sorry.
Re: (Score:1)
Everybody is a system admin when linux is involved, and I like it that way. But I digress.
alias halt='echo Use shutdown instead'
alias reboot='echo Use shutdown instead'
Re: (Score:2)
Err, if taking a server offline, no matter the reason, is a serious problem, then you are not a good - or properly funded - sysadmin.
sad (Score:2)
can parent be modded funny not insightful? Insightful is too depressing...
Do unto others...
Re: (Score:1)
Typing out "Period" makes you look retarded.
Re:Err, if you're a system admin.. (Score:4, Funny)
Re: (Score:1)
If there is a problem and need to call "support" (Score:1)
do they get sombody in or from India?
Re: (Score:2)
I'll be glad to help you with that Sir.
Re: (Score:1)
Re: (Score:1)
Sometimes when I sound mocking, ironic and sarcastic, I'm actually serious, as in ironic-ironic, or sarcastic-sarcastic. A lot of Americans simply smack the phone down on Indian tech support, saying gimme somebody who speaks English. I patiently listen to them struggle through it.
Re: (Score:2)
Send someone (Score:2)
ECC? (Score:5, Funny)
They didn't do any ECC on the flash memory? I thought these people were rocket scientists.
Re: (Score:3, Insightful)
As it happens, for flash, read errors are often transient. A better model than DRAM style ECC is to treat it more like a disk drive with checksums on each block. If you get an error, reread the block. And if you have a problem writing a block (e.g. the readback is wrong), just use a new block. Surely you've noticed that your USB thumbdrive gradually gets smaller with time as blocks wear out. (In space hardware, back in 2000, wear leveling was done manually.. still is as far as I know.. there's no nice rad
Re: (Score:2)
Re: (Score:2)
NOR Flash does not normally use ECC and has reliability closer to that of EEPROM than NAND Flash.
Re: (Score:2)
Flash is another form of floating gate memory. Wouldn't the known long duration performance of EPROM and EEPROM apply?
Re: (Score:2)
The rocket scientists did their job ten years ago. They're working at McDonalds now.
Re: ECC? (Score:1)
This would make an interesting movie plot where they have to recall all the older, laid off rocket scientists working at McDonald's and bagging groceries at the supermarket to reboot an idle probe on a far away planet because it's the only one that can be repurposed to save the earth from an asteroid impact. But only the old guys know the hardware and can reprogram the firmware.
Yeah I'm a laid off old guy. Get off my lawn!
Re: ECC? (Score:2)
And add in the volunteer group that decided to save the project, working out of an abandoned McDonald's.
Oh, wait....
Re: (Score:2)
Odder than usual, I mean.
Re: (Score:3)
Well, in their defense, ECC on the flash memory isn't exactly rocket science.
Re: (Score:1)
If you're so smart, why aren't you advocating using BCH codes or Reed Solomon codes or some form of forward error correction code over code and data stored in flash so random bit errors in flash won't affect the code that is stored in the flash? What is your super clever alternative?
Re:ECC? (Score:5, Insightful)
You're a poster child for Dunning-Kruger [wikipedia.org]: some random on the Internet who thinks he's smarter than the folks who designed a Mars rover that lasted over 10 years past its 90-day expected life.
Re: (Score:2)
Re: (Score:2)
>You can detect single bit errors with a simple parity bit
You can detect (2^32-1)/(2^32) of every possible failure pattern with a CRC. With a combination of a multiple bit error correction algorithm (with most correction schemes n bits can be corrected with 2n redundant error correction bits) and then the CRC can be used to tell if you correctly corrected the data.
Re: (Score:2)
Re: (Score:2)
Most of the hardware cost is the launch vehicle, not the rover.
Most of the people (salary) cost is the people working on the data generated (all accross the universities around the world who analze the data and write papers), not the designers.
Underspeccing it wouldn't have saved much.
There's one that breaks this rule, the JWST. Just the endless redesigns have gobbled up so much money, I don't believe there will be enough Science generated by it to cover the build & launch costs.
Re: (Score:2)
You're a poster child for Dunning-Kruger: some random on the Internet who thinks he's smarter than the folks who designed a Mars rover that lasted over 10 years past its 90-day expected life.
Not too often but occasionally the stupid get lucky and in some perverted way lack of knowledge and consideration of detail can lead to better outcomes.
After awhile one has to admit having to be careful when you transmit for fears it would even be possible for commands to be misinterpreted or designing something which knowingly continually writes to flash memory using DOS era FAT filesystems is not a winning play no matter how much you throw the reliability arguments at the wall and expect them to stick.
And
Re: (Score:2)
>You're a poster child for Dunning-Kruger
Actually I am an engineer who has designed many error correction circuits for communication and storage systems. I think I know how much I know about error correction systems, which is plenty for this conversation.
While the statement was made in Slashdot jackass style, the question is legitimate. Why didn't they do any or more ECC on the flash that is failing. There is probably a perfectly fine answer like "We knew the expected error rate and It was designed to la
Re: (Score:1)
On the other hand, it is all still working. It reboots occasionally. My computer does that. By reformatting, they will map out any bad sectors, which is probably the issue, and it'll run for another 10 years. Sounds like a smart technology tradeoff to me. Use cheap, off the shelf hardware, and KISS it to death. Write a special driver, or build special hardware to do ECC, and you end up with a bug that causes the system to freeze in an unrecoverable way.
Re:Remote management (Score:5, Informative)
Not really...
The chances are that "reformat" isn't what we think and includes one of more of:
1) Rewriting cells and allowing wear-levelling and sector-replacement to take place, and make bad sectors as bad.
2) Write-testing and manually avoiding those sectors that don't perform as expected.
3) Rewriting all the critical storage functions to avoid the already-known bad sectors.
It's the kind of thing that anyone can play with. Not saying it's not risky on a remote device, but BadRAM etc. patches have been in places for years and that's a way to run Linux on machines with faulty ***RAM****, not just long-term storage.
Many years ago, a bad sector on your hard drive was something you found out with scandisk (or previous tools) and then it was marked as bad and that was the end of that. Your PC wouldn't use it and so long as it wasn't the boot sector, that was the end of that. It was only the "creeping" bad sectors, where you got more bad sectors over time, that would really worry anyone.
I imagine that it's not at all difficult to make sure that multiple boot sectors were in place if you really wanted to but why bother? The chances are billions to one. Chances are this hardware has MUCH better fault tolerance and multiple hardware watchdogs, firmware, and boot attempts to make sure it eventually gets back up SOMEHOW.
There's a reason that even FAT stores two copies of the allocation table, why Linux ext filesystems store multiple copies of the superblock, etc. They come from a legacy where the occasional bad sector wasn't a problem and where 20Mb of hard drive cost more than the computer did so it was better to cope with the fault than just tell people to buy a new one. And their predecessors were (and still are) mainframes with hardware that's just that fault-tolerant in the first place anyway.
It's not at all hard to write a filesystem that can cope with not only damage, but even recurring damage. You've seen PAR files presumably? The same could easily be done on a filesystem-level basis (and I imagine, somewhere, already is for some specialist niche).
It's not that big a deal once they KNOW that's the problem. The biggest problem is that they only "suspect" that's the problem.
Re: (Score:2)
Re: (Score:1)
Hah, I remember running the DOS debugger, poking into a certain address in the memory to access the MFM BIOS, then you could do a low level format where you could enter the sectors to mark as bad. Those were the days...
Re: (Score:1)
"g=c800:5." :)
Hah, I almost remembered that one, good old Seagate controllers. I had the 800 but not the rest.
Re: (Score:1)
I always thought that the disk controller should do idle scrubbing. Are there any modern SATA disks that do this?
No, the drives themselves don't do this because it pulls the head away from where the host wants/expects it to be. This would result in a lot of unexpected thrashing. If scrubbing is to be done, it is best done by the OS as a background task.
Re: (Score:2)
You've seen PAR files presumably? The same could easily be done on a filesystem-level basis (and I imagine, somewhere, already is for some specialist niche).
While all hard drives now do their own Hamming error correction (or something better), RAID2 is the same idea for "raw" storage that doesn't: you write explicit ECCs to redundant volumes to allow recovery from both drive loss and bad sectors.
RAID5 with modern drives gives all the same resiliency, as the drives do the block-level ECC themselves, so you never see RAID2. But for a pile of flash memory, that's the filesystem-level equivalent of PAR files.
Re: (Score:1)
You mean like RAID-5? Because RAID-5 was part of the inspiration for the PAR2 format.
Alternative Title (Score:5, Insightful)
Re: (Score:2)
Re:Alternative Title (Score:5, Interesting)
Not modem reset. The filesystem on Spirit had bunch of temp files and other stuff from the Earth-Mars flight, and apparently it just ran out of inodes. So basically they had to remote into whatever constitutes a bootloader with 20 mins of latency and remove some of the no-longer-needed files.
See http://science.slashdot.org/st... [slashdot.org]
Re:Alternative Title (Score:4, Insightful)
I would imagine that the system probably boots itself off of a ROM chip that has a routine for receiving data from Earth and storing it in RAM and then flashing that data onto the flash chip.
If the rover does not boot from ROM then it is a miracle that it hasn't bricked itself yet.
Re: (Score:2)
I wonder if the ROM would actually be a floating gate ROM instead of mask ROM or fuse based PROM in which case it would be more like EPROM or NOR Flash.
Does anybody even make mask ROM or fuse based PROM any more?
Re: (Score:2)
I checked and it is EEPROM. And there are two EEPROM:s, I presume those are for redundancy in case one gets zapped.
2.5 billion? (Score:2)
I dunno so much these days. Its 10 years old and got a few miles on the clock plus collection for the new owner would be an issue. On the plus side vandalism won't be a worry. For a few centuries anyway.
Alternative Title (Score:3)
Re: (Score:2)
Re: (Score:2)
And the state of the hardware. Some unknown number of systems on the real curiosity are degraded to the point of malfunctioning; And they have little to no way of exactly measuring what and where.
Opportunity. Curiosity is on the other side of Mars, nuturing holes in its wheels and looking for cats to kill.
Re: (Score:2)
Never attribute to malice that which is adequately explained by stupidity.
Is it running Windows? (Score:3)
Is it?
Re:Is it running Windows? (Score:5, Informative)
You're probably joking, but the OS is VxWorks.
2014-- year of Linux on Mars? (Score:2)
???
Why is it not trivial? (Score:1)
Why didn't they plan ahead for this sort of operation in the beginning, making it painless and 'reliable' ( as possible ).
Re: (Score:2)
Who says they didn't?
Re: (Score:3)
Why didn't they plan ahead for this sort of operation in the beginning, making it painless and 'reliable' ( as possible ).
That's a joke, right? We are talking about one of the two rovers [xkcd.com] that was sent to Mars on a mission planned to only last 90 days. They didn't see "flash memory wearing out from use" as a contingency they needed to plan for.
Re: (Score:1)
You're a poster child for Dunning-Kruger [wikipedia.org]: some random on the Internet who thinks he's smarter than the folks who designed a Mars rover that lasted over 10 years past its 90-day expected life.
Re: (Score:2)
Put up or shut up: who are you, really?
Re: (Score:2)
If you're going to troll and be full of shit, save us both time and say so up front.
Re: (Score:2)
Boring troll is boring. Put your back into it, boy.
Re: (Score:2)
I don't believe a word of it.
Re: (Score:2)
Further, if this is you:
http://www.fiero.nl/cgi-bin/fi... [fiero.nl]
You're conclusively an idiot. Only an idiot believes in homeopathy.
Re: (Score:1)
Sorry, not the same person. No "nickname" is overly unique in the online world these days. Its not 1980 anymore.
But i dont want to blow my karma.
Re: (Score:2)
Yeah, bullshit. Your nickname has enough entropy that it's exceedingly unlikely this is not you.
Re: (Score:1)
Nah, its not. But believe what you wish, it is a free country. ( assuming a US citizen here )
Re: (Score:2)
You really need to level-up your trolling. This is 101-level shit, son.
Re: (Score:1)
You are more than welcome to get a court order and demand IP addresses, then compare them. I'm sure both places log IP+posts.
Re: (Score:2)
*snore*
Re: (Score:1)
Don't need to prove anything to a potty mouthed coward.
anything starting with "why didn't they just..." (Score:3)
shoot the asker?
Re: (Score:2)
Re: (Score:2)
Re: (Score:1)
Ya, i figured that out, a bit too late.
Re: (Score:1)
What kind of handwaving armchair wannabe are you?
One that plans ahead well enough that this would not considered 'news'. Instead it would be just SoP.
Assumptions (Score:2)
Re: (Score:3)
I believe you're assuming that the flash used on a rover that went to mars, and encounters all kinds of crazy radiation, is in some way similar to the crappy OCZ thing you stuck in your PC 10 years ago.
Re: (Score:1)
You're a poster child for Dunning-Kruger [wikipedia.org]: some random on the Internet who thinks he's smarter than the folks who designed a Mars rover that lasted over 10 years past its 90-day expected life.
Re: (Score:1)
Assigned a macro to this reply, haven't you? Clever boy, want a medal?
Re: (Score:1)
Don't forget, we don't hear what the techies are talking about. What we're hearing is what the techies told to the PR guy distilled down to a journo, being summarized in The Register (!) and some other soft-tech sites, finally an inaccurate summary on the frontpage of Slashdot.
I wouldn't be surprised if it were just a "fsck.ext4 -cc" (I know it's not an ext4, it was't even released when Opportunity soft-crashed and bounced around on Mars nor it runs Linux).
Failing flash cells? (Score:2)
Re: (Score:1)
It was designed to last 3 months and failed after 10 years. ;)
If OCZ was involved, it'd be the other way around.
I"d hate to be the guy (Score:2)
I'd hate to be the guy a) pitching this operation at the change control meeting, and b) the guy signing off on this change.
It worked on Spirit (Score:5, Interesting)
they had to do this type of thing on spirit shortly after it arrived on mars..
read more here: http://trs-new.jpl.nasa.gov/ds... [nasa.gov]
or the PDF linked therin here http://trs-new.jpl.nasa.gov/ds... [nasa.gov]
its got all sorts of awesome details.
We commanded a shutdown, which terminated the
current communication window, and the loss of signal occurred at the predicted time. Fifty minutes later, we commanded a beep at 7.8125 bps to alert us if the shutdown command did not work, and much to our disappointment, the beep was received!
really a fun read. ..im guessing theyll be doing a lot of similar stuff
stressful job (Score:1)
Capacitors (Score:1)
Sort of like ... (Score:2)
Now if only we could get a Martian to IM during the process: "Yes. The little red LED is blinking ....."
The Real Rover Problem Explained (Score:1)
Re: Protip (Score:1)
They should have used what genius?