Mars Rover Spirit Back Online 386
Skyshadow writes "Just in time for the arrival of its twin, the Spirit Mars Rover is back in working order. Programmers at the JPL have traced the problem to the rover's flash RAM, which it uses to maintain its filesystems. They are using a ramdisk in the rover's RAM to bypass the bad flash memory, and are working on a workaround for the bad flash. Good news, but the rover is still potentially weeks away from full operational status."
They found the problem (Score:5, Funny)
Weeks away? (Score:5, Funny)
You mite listen to Jimmy, But you can't hear Jimmy (Score:5, Funny)
Wonder wehre they got they flash ram [newegg.com] from?
--
Warranty (Score:5, Funny)
I think they should return the bad flash part to where they got it and exchange it for a new part... although getting the memory back to the store by the 30 day warranty might be a little difficult.
I hope they bought the extended warranty.
Re:Warranty (Score:5, Funny)
Re:Warranty (Score:3, Funny)
Re:Warranty (Score:5, Insightful)
The nasa report mentioned the problem seems to be revolving around the software that accesses the flashram. It could be filesystem corruption, or a physical problem with the flash ram itself, or even a broken interface to the flash ram. It's about the equivalent of having a machine a thousand miles away and just seeing that a certain drive won't mount, at the moment. Finding out whether there's a problem with the SCSI card it's connected to, or the drive itself, or a filesystem corruption, or a head crash... that comes in the next few weeks
Radiation hardened Flash (Score:5, Informative)
Maybe just maybe... (Score:2, Funny)
heh... /. was right! (Score:5, Interesting)
Re:heh... /. was right! (Score:4, Funny)
Re:heh... /. was right! (Score:5, Insightful)
Now let's all sing the company song...
Re:heh... /. was right! (Score:5, Funny)
"Oh, say can you see..."
Re:heh... /. was right! (Score:5, Funny)
Thousands of
Re:heh... /. was right! (Score:3, Insightful)
The epitome of remote administration (Score:5, Interesting)
Some people may take this sort of thing for granted, but I for one find it remarkable that we can essentially reboot and perhaps even fix a system that is on a whole other planet.
Just wait until we have Interplanetary, Interstellar, Intergalactic Remote Desktop. I'm only half-joking.
Re:The epitome of remote administration (Score:5, Funny)
Re:The epitome of remote administration (Score:5, Interesting)
You joke, but newer servers can do this remotely too.
We have a bunch of Compaq servers at work, and one of the really cool features of the remote administration software is that you can send a virtual floppy image to the machine from anywhere in the world that can open a web browser connection to the server's remote administration board.
A few months ago one of our servers in Denver died, and I had to boot it up in Windows 2000's command prompt only safe mode... but the local admin password had never been written down. I was able to make virtual floppy images of a tool that resets the local admin password, send them over the wire, and boot off of them from the remote administration system.
Okay, it's not fixing a super-expensive robot on another planet, but I thought it was pretty cool.
Re:The epitome of remote administration (Score:2, Funny)
Well, okay, I didn't slap him, but I wanted to. Badly.
But on your response -- that works. I mean, if you're doing something that you could just about do on another planet, it should count. Maybe not so glorious, but still.
Re:The epitome of remote administration (Score:3, Interesting)
Mmmm... hackalicious.
(I've actually used a similar remote kvm system with lights out boards but until you write it down it just doesn't sound that risky!)
Re:The epitome of remote administration (Score:2)
Ended up having to call someone who worked at the machine room to track down the crashed system and restart it
Re:The epitome of remote administration (Score:2, Informative)
kernel.panic = 120
That will tell the kernel to reboot itself 2 minutes after a panic. It has saved me in the past before
It's a good thing the Spirit had an F8 key (Score:4, Funny)
You think that's neat (Score:5, Informative)
They fixed it. The fact there was a lisp REPL running on the spacecraft helped.
That's cool:
(unwind-protect
(progn (do-science)(talk-to-earth))
(wait-in-repl-for-earth))
Re:You think that's neat (Score:4, Interesting)
A quote from his site: "It is incredibly frustrating watching all this happen... I can't even say the word Lisp without cementing my reputation as a crazy lunatic who thinks Lisp is the Answer to Everything"
I feel his pain. I was introduced to Lisp not too long ago, and within a short time, a Lisp-derived language (Dylan) became my favorite. I also found that many of the features I loved from Python were very Lisp-y in nature. Now, I see Java and C# either neglecting all the knowledge garnered from the Lisp-family of languages, or reinventing it --- badly. The features in C# 2.0 have either been in Lisp for decades (lambdas, closures) or are not necessary in Lisp (iterators, enumerators --- which, btw, are theoretically not necessary in C# 2.0 either because of lambdas and closures!) This new "Xen" (or X#) language Microsoft Research is pushing takes a great idea (extending the language to fit the problem domain) that has been a part of Lisp for decades, and chops it off at the knees. Instead of having proper macros, so you can extend the language to fit *your* problem domain, they hack support for a single problem domain (back-end business programming) into the language itself!
That said, the Lisp community is to blame as well. Part of the reason people stop listening the moment somebody says Lisp is that the Lisp community is *so* rabid and *so* unyielding. Especially some high-profile members who are highly respected within the community despite the fact that they are completely obnoxious and lack any human sense of manners.
Re:You think that's neat (Score:3, Interesting)
---
I bow down to your ignorance, oh mighty King of the Cluless!
Seriously, though, please research lambdas. They don't just save typing. They are *everything*. All of computation can be described just with lambdas of a single parameter. Everything else is just syntax suger. If you ease one restriction of the lambda calculus (no side-effects), lambdas can do procedural code, functional code, and even object-oriented programm
We learn from our mistakes... (Score:5, Interesting)
You know how NASA works. The Space Shuttle running on 486's and whatnot. I understand the science behind that reasoning, as sad as a 66 MHz processor seems to us geeks nowadays, but I wonder if MRAM will prove more flexible and stable for future space missions.
Re:We learn from our mistakes... (Score:3, Informative)
Remote nonsense (Score:3, Insightful)
No you're not. All these Mars glitches are exactly why real space exploration entails sending an actual carbon-based unit, not a glorified laptop.
Consider that an interstellar probe will take years to receive updated instructions. By which time, any fix will probably be irrelevent. Plus if they're more than 30 light-years away (practically next door by galactic standards) they guy who sent out the i
Re:The epitome of remote administration (Score:3, Funny)
Snicker. Meet the new iron, same as the old iron.
So basically... (Score:5, Funny)
Where is the redundancy? (Score:3, Interesting)
Re:Where is the redundancy? (Score:5, Insightful)
Also keep in mind that this isn't a $5 flash ROM chip. When you consider the hostile environment, the testing, the power, and the fuel required to get everything to Mars, that flash ROM probably cost at least fifty thousand dollars.
Re:Where is the redundancy? (Score:2)
Re:Where is the redundancy? (Score:2, Funny)
I laugh at your puny triple redundant systems!! They should have QUADRUPLE redundant systems!!!
Re:Where is the redundancy? (Score:2)
I don't know anything about the architecture of the computers on the Rover(s), but I suspect when the term "Flash RAM" is used, they are talking about the redundant Flash memory, the mux/demux and arbitration circuitry. This means that if something on the Flash memory subsystem fails, it is simply described as a "Flash RAM" problem. I would suspect that the Flash memory would be consid
Re:Where is the redundancy? (Score:2, Informative)
Using redundant low reliability components is the cheap office solution, not the space exploration solution.
Re:Where is the redundancy? (Score:2)
Seriously, you just shouldn'
Re:Where is the redundancy? (Score:2, Interesting)
Re:Where is the redundancy? (Score:3, Insightful)
I think you folks all missed the point completely. They have full dual redundancy on EVERYTHING in the MER program. Not only are the computer systems somewhat an issue, there's little issues like landing in one piece, etc etc. to that end, they built 2 full systems, packaged them on 2 different rockets, and fired them off a month apart from each other. this gave full dual redundancy to every system and every component, from the initial launch igniters, to ev
2 years ago, back at NASA R&D... (Score:5, Funny)
Engineer 2: Yah, sure... Hey, remember that employee last month who got laid of within a week?
Engineer 1: Who? Vincent?
Engineer 2: Yeah, Vinnie... With the Italian accent?
Engineer 1: Yeah, him. What about the guy?
Engineer 2: Well, he has this offer on cheap RAM we just CAN'T resist!
Engineer 1: Really now? But-
Engineer 2: Look, our budget is already comparable to social welfare. We need to save some loot.
Engineer 1: Fair enough, buy the crap and hand me the other twisty-turny thingy over there? I need to screw on this name tag reading... "Spirit"?
Engineer 2: Look, it's either that or my wife's name.
Re:2 years ago, back at NASA R&D... (Score:3, Insightful)
If only we spent so much on NASA. They only get 12 billion.
Monday morning quarterback (Score:5, Insightful)
Re:Monday morning quarterback: RTOS tradeoffs (Score:4, Informative)
Actually, they might have protected memory if they use VxWorks AE RTOS/Tornado Tools 3.0 [findarticles.com]. Spirit uses VxWorks, but I don't know what version they used or when they had to commit to a particular version of VxWorks.
Also, as the article mentions, memory protection adds overhead and can affect real-time performance. Hard real-time software cannot afford to have a complex layered structure and lots of conditional code that adds unpredictable delays. For that reason, many really real-time applications run very close to the hardware (for better or for worse.)
Re:Monday morning quarterback: RTOS tradeoffs (Score:5, Insightful)
This is the conventional wisdom, and in my experience, this particular nugget causes more embedded and real time software projects to fail than any other.
First off, on a modern PowerPC processor, memory protection (that is, without virtual memory support) can be implemented very cheaply. If you can do it just with the IBAT/DBAT registers, it should be a constant-time overhead, which is good enough for hard-real time. Oddly enough, I can't find a single reference on the net that measures the cost of memory protection alone on a modern CPU. Anyone? Anyone?
Secondly, though the rover certainly may have some software components that have hard-real time requirements, that doesn't mean that every single line of code does. Typically, less than 1 percent of the code in a real time system is hard real time. In that case, you can run the real-time code in ISRs, or perhaps in a dual-mode system, like RT-Linux, or in high-priority kernel threads (as with QNX). In any of these situations, you can run all the rest of the code in protected memory space.
Re:Monday morning quarterback: RTOS tradeoffs (Score:2)
Maybe I'm crazy but the systems I run that have ECC are incredibly stable even when using Windows. My Alpha got 100+ days uptime with daily use on Windows NT4. I have an old Xeon that easily did 40 days, and I shut it down by mistake.
Re:Monday morning quarterback: RTOS tradeoffs (Score:2, Interesting)
I don't know that much about VXWorks, but I heard that one of its main assets is having a very small tight multitasking kernel.
They were able to regain the system, despite loss of a major computational component. Remotely. Through a debug link. That sure says a helluva lot for the robustness of the OS and how they configured it.
Good job, JPL.
Re:Monday morning quarterback: RTOS tradeoffs (Score:4, Insightful)
With VxWorks you can often get away without any filesystem because all the code is linked together in one big monolithic file. Separate tasks are not separate files (although you can have loadable object files).
Yes, AE does provide memory protection domains, but it still doesn't clean up after a task dies. Sure, you can free the memory, but not open files, semaphores, pipes, or other things. Malloc in AE is improved over the braindead implementation in standard VxWorks, but it still has a long way to go. For example, it can't free up open file descriptors, semaphores, or other items associated with a task because a task usually isn't associated with it. So if you have a task that acquired a semaphore and dies, that semaphore will never be released.
Hell, Wind River couldn't even get malloc right! Their malloc has got to be the worst implementation I've ever seen! They place free blocks in sorted order (smallest to largest) in a linked list after attempting to combine a new free block with neighboring free blocks. The next time you allocate, it walks the entire linked list until it finds a block large enough! In our case we wound up with tens or even hundreds of thousands of small blocks causing our watchdog timer to kick in because malloc became impossibly slow. AE improves this to use a tree instead of a list, but it still fragments. I ripped out the Wind River implementation and replaced it with Doug Lea's dlmalloc and all our malloc problems were solved, and the fragmentation went from tens of thousands of fragments to only a few dozen.
For an RTOS being pushed for networking it isn't very good there either. It comes with an ancient BSD TCP/IP stack. If you have a device and want to see if it runs VxWorks, just run nmap against it. If it says TCP sequence number guessing is trivial, you can bet it's probably running VxWorks.
In todays world, VxWorks doesn't cut it any more. Any complex project should choose a real OS like QNX or even embedded Linux over VxWorks. For realtime, Linux usually isn't very good, but Timesys appears to have solved that problem nicely.
VxWorks isn't even that good at realtime. Usually you can't get any better resolution than half the system tick rate (usually 10ms), so you can't get better than 20ms of resolution in many cases.
I've also heard many rumours that Wind River is dropping AE, or at least not pushing it. We're not the only ones to have been burned by it. I've heard of only one other company that used it, and they were also burned. I think it was a startup that went out of business.
In VxWorks, all tasks share the same memory space. Think of every "task" as really a thread and you get the idea. In other words, if a "task" dies, the only way to clean up the system is to reboot.
Also, VxWorks doesn't scale. The more tasks you have, the slower it runs (i.e. no O(1) scheduler). And with the shared memory, the more complex the code, the harder it is to debug and develop a stable system.
QNX would have been a much better solution. In QNX, the core OS is very small, and if a task dies it can easily be restarted. In QNX, everything is a task with memory protection. The TCP/IP stack is separate from the core OS, for example, as are all the other drivers. If a driver crashes, it won't take the OS with it. Context switching in QNX is also very fast, faster than VxWorks even though memory protection is involved.
-Aaron
Re:Monday morning quarterback: RTOS tradeoffs (Score:3, Informative)
As a VxWorks programmer for the last 5 years, I can honestly say VxWorks is a PoS that is losing market share at a tremendous rate to the likes of embedded Linux and QNX. Wind River decided to spend tons of money buying add-in products like Routerware instead of improving their RTOS. It was
Re:Monday morning quarterback (Score:3, Informative)
Because they're fucking HUGE.
The uCLinux kernel for 68k which is more compact than SPARClite, but maybe less so than x86, is 512K [uclinux.org].
That's a stripped-down kernel with no MMU support and the special uClib C standard library designed to take less space.
I'm working on a digital camera with 512K of flash and 8MB of SDRAM. That flash is divided into 7 64K sectors and 8+16+16+32K little sectors.
Huh? Flash? (Score:2)
Re:Huh? Flash? (Score:2, Informative)
Title: Rad Hard Flash Technology Abstract: The highest density radiation hardened non-volatile (NV) memory currently available is a 256 kbit EEPROM based on SONOS technology. One of the major limitations in developing rad hard NV memory has been the cost in bringing up the NV technology in a dedicated rad hard process facility, especially when weighed against the limited market size. One way to bring radiation hardening to an advanced electronic
Static Discharge? (Score:5, Interesting)
Re:Static Discharge? (Score:2, Insightful)
I'd hope that the RAM is in a shielded box given the amount of radiation it's getting from the sun and the rest of space.
Could be Soft Errors caused by Alpha particles though - depends on the technology used in the flash - unlikely, but possible...
Re:Static Discharge? (Score:3, Interesting)
The sequence of events that lead up to this was, IIRC,
1. Rover extends arm ready to take a grinder to a rock.
2. Contact with Rover lost due to bad weather in Australia.
3. Rover bad.
So it had just moved part of its structure closer to the rock just before this happened.
Cosmic rays... (Score:5, Interesting)
Re:Cosmic rays... (Score:5, Interesting)
Maybe he was gung-ho about anti-radiation redundancy because he already knew the likely problem of the Spirit. Who knows?
- sm
Re:Cosmic rays... (Score:3, Informative)
So I think the rovers electronics are well protected from at least the Suns radiation. I think Mars is 1.3AUs from the Earth, making it 2.3AUs from the Sun, so it should receive less than a quart
It's simple actually (Score:3, Informative)
You have two or more running in parallel. While one is running, the next reloads from ROM. When it's loaded and synchronized, you switch to it, and load the next one. You do that in series, over and over, so you're only using any particular FPGA for a couple of seconds at a time, and their configurations are constantly being refreshed. It's a very simple idea that can
Software / Hardware Breakthrough? (Score:5, Insightful)
I do seriously wonder if these types of projects will tell us anything more than esoteric wonders of Mars, but from a strictly engineering standpoint, perhaps it's worth it after all.
Re:Software / Hardware Breakthrough? (Score:4, Informative)
The Full Story (Score:5, Informative)
Nice (Score:5, Interesting)
Somebody here on Slashdot nailed it... (Score:2)
You're all so damn smart. Sometimes I don't think I'm not worthy of posting here.
Re:Somebody here on Slashdot nailed it... (Score:5, Funny)
Yeah, in the future NASA should just submit an Ask Slashdot whenever something goes wrong..
Re:Somebody here on Slashdot nailed it... (Score:3, Funny)
IANANE (I am not a NASA engineer), but.....
Inquiring minds want to know, (Score:2)
flash ram is known to fail on writes after a while (Score:2, Interesting)
Was NASA writing to that flash or just reading? A ram drive in flash sounds like it will access/write thousands of times a ?minute? This should wear it out quickly.
Good News Everyone! (Score:2)
Well, bad news anyway. Bad flash? Maybe it was the solar storms. Can't they knock out flash, at least in space?
Steal SOME (Score:5, Funny)
Seriously, can you imagine the first manned expiditon seeing the Beagle Jacked up, tagged , up on little martian cinderblocks, All that and we already got a head start on building martian cities
Re:Steal SOME (Score:2)
Seriously, can you imagine the first manned expiditon seeing the Beagle Jacked up, tagged , up on little martian cinderblocks, All that and we already got a head start on building martian cities
What, like this [jacco2.dds.nl]? (from a comment [slashdot.org] in an earlier article).
Information on the MER hardware. (Score:5, Interesting)
From what ive pieced together the MER system is something like this:
One RAD6000 powerpc cpu.
Connected via probably compact pci to 128 mb of ecc sdram.
256 mb of flash. No info on what make of flash, but likely Intel since they are the biggest. There was some info from the press conference that there are actually two flash chips and that the flight software is redundantly stored on each. So does this mean that there is actually 128mb of redundant flash? Also it was said that they had problems even with the redundancy, could they possibly have overwritten something? We all know that even a redundant raid does not stop filesystem corruption.
No information on how the flash is connected, parallell / serial? How the redundancy works?
Btw, I guess flash is rather radiation hard since they require 10 - 20V to erase / write.
Re:Information on the MER hardware. (Score:2, Informative)
RAD6000 6U Compact PCI [baesystems.com] page at BAE Systems.
It's not great, but there are more detailed links around the BAE website.
It doesn't list how the FLASH is connected; that's not a standard built-in on the RAD6000 computer. I would guess, hung off the FPGA interface device, but I don't know that for sure.
Re:Information on the MER hardware. (Score:2)
As far as I know, the biggest in Flash RAM is AMD, with the Atmels and Winbonds coming distant second. And Intel is among them.
Wrong colours again... (Score:2, Funny)
They should stick with purple next time.
Good news but lost time (Score:2)
Knowing about the problem before the twin lands is probably a good thing because they might anticipate the problem.
But if it takes weeks to fix the solar panels on the lander will be degrading in the martian atmosphere. The will miss the down time for Spirit's task list.
It must be so frustrating to sit on a possible fix and wait for a communication window, or computer response to see if you're right.
Salute the Helpdesk (Score:5, Funny)
last photo from Spirit (Score:5, Funny)
Re:last photo from Spirit (Score:2)
Why are they sending its twin so early? (Score:2, Interesting)
Re:Why are they sending its twin so early? (Score:2)
Relative positions of Earth and Mars (Score:3, Insightful)
Shame on my fellow American who said we should strip Beagle 2 and leave it up on cinderblocks. If Beagle is ever discovered to have soft landed, I would think the only proper thing to do
Flash? (Score:2)
Re:Flash? (Score:2)
Opportunity (Score:4, Interesting)
that line from armageddon comes to mind... (Score:5, Funny)
Re: Technically... (Score:3, Informative)
Oh, and he has another quote I liked too:
But maybe I just like it because thats how I tend to fix things too ;)
can't go shopping (Score:2)
As someone else said (Score:5, Funny)
(Posted by Jane Slee and John Stracke in separate usenet postings.)
Thank God (Score:2, Funny)
Nasa TV (Score:4, Informative)
We need open source rover software (Score:4, Interesting)
Besides robot exploration software would be handy right here. It would be neat to be able to send a research bot out in the deserts, deep oceans and jungle canopies of the world. Machines can go where we can't.
Individually you can be damn annoying sometimes, but I'm constantly amazed and delighted by the collective intelligence of the /. pack.
Hear! Hear! I couldn't say it any better! (Score:4, Interesting)
It would be nice to be able to have some folks at JPL throw down the source code and engineering schematics and say to the geek/space/engineering community at large "We have a problem here and could use your suggestions to see if we can get this fixed."
This (the mars missions) is obviously a big hit, as measured by replies on Slashdot, the number of hits on the website at JPL, stories in mainstream media, and other reasonable metrics to gague popularlity of a project. I'm sure that there are several geeks out there that wouldn't mind digging into the source code.
The only reason I could see the engineers not wanting to do that is to open themselves up to obvious scrutiny for poor engineering and coding. (Whadda you mean the global variable named temp is the only variable. We also have temp2, temp3, and temp4. What do the numbers mean in those mean? You can get it from context, can't you?) That and some people just aren't used to allowing other into their "domain".
Being 100% funded by public money should also be further reason for why this should be opened up. I also totally agree.
Cut it out! (Score:4, Insightful)
Re:Not "online" at all... (Score:3, Insightful)
Nicely karma-whored. That's the link from the article. :)
Re:Follow the status? (Score:3, Funny)
This is like a reality tv show, I love Nasa Tv!
With the exception that this is actually real...
Re:Follow the status? (Score:3, Funny)
I've noticed that a few people stand facing the cameras a lot, gesticulating wildly as if talking about something important.
I also saw one guy go from reading a magazine and sipping a martini to furiously typing away at a keyboard as the camera panned across the room!
Re:Redundancy? (Score:2)
Sheesh... so you are going to condemn NASA based on a blurb on Slashdot? I don't suppose that you considered the possibility that the blurb is inaccurate or leaves out some critical details?
Re:Checksums (Score:5, Informative)
There are two separate flash memories on Spirit. At the moment, part of the problem is software which can read part of the flash memories as some of the operational software which is kept in flash ram seems to be coming up before the system reboots.
The system is rebooting no matter which flash memory is being accessed, it has the same bug both ways, so the flash ram itself looks to be OK, but the interface between the flash ram and the software looks to be causing resets.
Even if there were more backup flashrams, it looks like they'd still have this problem. Perhaps many, all on different controllers, and even an entire backup computer would have prevented this. at 100watts total power available for the rover, an entire extra computer may be a bit much to have fit. But then sending two rovers would also negate problems, and thats just what they've done
It seems most likely at the moment, according to NASA, that the family of components that are involved with the hardware addressing of the flash memories looks to be where the problem is.
Re:The mission is not yet out of danger (Score:4, Funny)
Opportunity will most likely have the same problem since they are twin brothers and had an identical build process.
I quote from my post [slashdot.org] a couple of days ago:
Parent: So even if Spirit gives up the ghost, her kin can carry on the flame (albeit in a less interesting location).
Me: Not if the problem is due to a design fault. That's the drawback of sending multiple identical probes: if one is intrinsically fucked, they all are.
I now bask, contented, in the glow of my own brilliance....