ISS Computer Failure 289
A number of readers wrote us with news of the computer problems on the International Space Station. Space.com has one of the better writeups on the failure of Russian computers that control the ISS's attitude and some life-support systems. Two out of six computers in a redundant system cannot be rebooted. The space shuttle Atlantis may have its mission extended until the problem is fixed. A NASA spokesman was optimistic that the problem can be resolved; worst-case scenario would be for the shuttle to evacuate everyone onboard the ISS. Engineers are working on the theory (among others) that the failure may have been triggered by new solar panels installed earlier in Atlantis's mission.
DFMEA (Score:5, Interesting)
Hopefully they're starting with their DFMEA documentation... "guessing" at the problem and having "theories" is probably not a good way to go. Also, it's apparently a common-mode failure, which you shouldn't have in a safety-critical system; generally this is avoided by having different computer hardware and/or completely different code to do the same tasks.
Quite unfortunate that it seems like systems engineering is lacking in more and more disciplines recently, although I suppose it makes good systems engineers more valuable.
My list for this would be something like: "Computer doesn't boot." Possible reasons: "No Power", "Insufficient power", "Corrupt memory", "Broken circuits", etc. Then you go down that tree further and find the root cause. The most disturbing thing is that they had such a major common-mode failure...whatever happened to the "no single points of failure" mantra?
* sigh *
Hey, here is a crazy idea (Score:3, Interesting)
No, no--I know is sounds crazy. But hear me out. Maybe we could actually pursue something NEW--you know, dare to violate that 30-year-old sacrosanct NASA policy of just repeating themselves over and over again and wasting trillions of $ on contractors and grandiose promises which never amount to squat.
Just a thought.
How bad a worst case? (Score:3, Interesting)
The stated worst case scenario is that the ISS will need to be evacuated, but if the remaining gyros are being overwhelmed, might the station enter an unrecoverable spin state before the problem is resolved?
Re:Hey, here is a crazy idea (Score:5, Interesting)
It's kinda like finding out your house you're current building will cost twice as much as normal.
Do you just leave it half finished and abandon it or do you keep pumping money into it?
Re:System Wide Reboot? (Score:3, Interesting)
Re:DFMEA (Score:5, Interesting)
These problems are not easy to diagnose when you have hands on capability leave alone 200 miles above Earth.
I do hope that it is sorted out swiftly and the ISS and its occupants remain safe.
Re:OS? (Score:5, Interesting)
No.
On NASA's manned space equipment you will find no software that is not controlled by NASA. These folks don't just run a few tests. They spend thousands of dollars per SLOC in testing. They actually mathematically prove their software's correctness. Perhaps the Russian agency's quality isn't quite as high, but I still doubt their (or anyone else's) systems onboard the ISS have any OS at all. Most likely they are all custom embedded systems.
I'd council against jumping to conclusions about the cause of this solely based on the Russian origin of these systems. I remember a lot of people did that with the early Ariane crash [embedded.com] based on it being written in Ada, and ended up looking pretty silly when the problem turned out to be some ported code that wasn't rewritten properly for the new platform.
Re:How bad a worst case? (Score:5, Interesting)
Evacuating ISS is always a last resort, because should something happen to it while unoccupied, it'd be a total loss. We won't have another shuttle ready for a month or so, and I believe the Russians just recently did a Soyuz exchange, so there'd be no quick return, even if the problems were fixed. With attitude control in question, it could become too unstable for even a shuttle or Soyuz docking to occur.
Just for the record (Score:5, Interesting)
The first piece of the space station was Zarya, the Russian control module that was launched into orbit November 20, 1998. A few weeks later, on December 4, 1998, the U.S. module Unity was launched into space. On December 7, 1998, the two modules were connected.
That makes the ISS just over 8 years in service.
How old is Atlantis?
Space Shuttle Atlantis has completed 27 flights, spent 220.40-days in space, completed 3468 orbits, and flown 89908732 miles in total, as of September 2006. Atlantis visited visited MIR in 1997!
Atlantis is 23 years old as of last April. 21 years in service. More than twice as old as the ISS.
Now, tell again - which is the real bucket of bolts? ISS or Atlantis?
NASA uses 30-year old UNIX derivative (Score:3, Interesting)
Re:OS? (Score:1, Interesting)
Try here [kasperd.net].
Re:OS? (Score:2, Interesting)
The kernel holds a lot of information, such as which processes are running, memory allocation, drivers etc. For a true in-place switchover to a new kernel (i.e., all programs keep running as if nothing happened), all that information has to be copied over.
The other option is to load the new kernel image to memory, shut down all processes and unload drivers, jump to new kernel and start a standard initialization. That would be the same as doing a 'shutdown -r', except that the new kernel is loaded by the old kernel instead of by the BIOS.