Forgot your password?
typodupeerror
Space

Space Station BSOD 254

Posted by michael
from the reinstall-life-support-[Y/N]? dept.
Lostman writes: "CNN has an article that details a computer glitch that has occured at the international space station. The problem disrupted all communication from the command computers on the station. Although NASA knows that this was because an onboard server had crashed, the cause of this was not immediately known." See also space.com, the BBC, or NASA's status update. NASA is using Windows for most of their computing functions, as mentioned here.
This discussion has been archived. No new comments can be posted.

Space Station BSOD

Comments Filter:
  • by Anonymous Coward
    considering that we have NO idea what servers actually crashed!

    This just gives credence to my theory that the Troll High Council is the Slashdot editors themselves.

  • It was in the news a while back that they took a plyer and DVDs abord the ISS. What region were they? Or did only the US crew get to watch movies?
  • I wouldn't want to oversimplify whatever situation is going on up there. Somehow, I doubt that they are running Windows on their main computers, but stranger things have happened. A quick skim through the crew logs shows that they have had problems with the network before. I wouldn't be terribly surprised if some of the computers are among the cheapest bits on the station.

    At least there is a separation for life support and some of the other more critical systems (though you'd think that satellite tracking and rudimentary communication would be separate as well..)

    This reminds me of a couple of things. I recall that one time, the space shuttle didn't launch because a bunch of computers (8 or 9) detected some sort of fault, and called the launch a `no-go.' There was another computer made by a different company that was looking at the same data, but it put up a `go' status. It turned out that the other computers were wrong -- the situation had indeed been a `go.' The parallel here is that the station has three command and control computers that are basically identical, and apparently running the same software. The software is probably a single point of failure..

    I wonder if the problem is because they are running some sort of monolithic application that can pretty much do everything. It's probably better to have a number of individual processes -- that way, if one thing crashes or goes completely nuts, the operating system can prevent them from knocking out other processes.

    I also heard from one report or another that the issue was with connecting to the database -- another potential single point of failure.

    Hmm.. Maybe we can still blame this on Microsoft ;-) I think a lot of people have succumbed to the idea that software failure is normal, and that there isn't anything you can do about it.. That's definitely an attitude that should change..

    Of course, if you have mostly-good software interacting with mostly-good hardware, some really bad things can occasionally happen, as we've seen with the hard disk corruption problems that have been cropping up with Linux 2.4 and VIA motherboard chipsets..
    --
  • I think they've actually been having trouble for a few days. The test of the arm had already been pushed back due to software troubles..
    --
  • For those young-uns in the audience, I can explain Michael's reference in the department ("reinstall-life-support-[Y/N]?"). It's an allusion to DOS's Choice command [easydos.com]. To get the prompt he gave as an example, you'd use:
    • CHOICE /T:Y,5 Reinstall life support
    The question mark and the key choices of [Y/N] appear by default (though the key choices are configurable). And, the "/T:Y,5" bit configures CHOICE to automatically default to "Y" if no key is chosen within 5 seconds (I figured that would be kinda helpful, given that it is life support we're talking about here).

    P.S. I'm looking for a new job in Web Development. I invite you to check out my portfolio [vt.edu] of hand coded HTML / JavaScript / CSS.

    Alex Bischoff
    ---
  • by Eg0r (704) on Thursday April 26, 2001 @06:26AM (#264478)
    They probably use one of them $10million, computer controlled, robot arm [nasa.gov] to press the ISS mainframe's reset button from earth.

    Oh... wait a sec! :-)

    ---

  • by leoc (4746) on Thursday April 26, 2001 @09:23AM (#264482) Homepage
    The Canadian company [mdrobotics.ca] who built this new robotic arm is using a space-hardened 386/387 system with all custom software, including the operating system.

    There is no mention what OS the thinkpad in the picture is running. For all we know that might be the "server" they are talking about... http://www.mdrobotics.ca/rws.htm [mdrobotics.ca]

    The web site runs linux [netcraft.com], though... :)

  • The Linux machine they were using was only for a single experiment, studying plant growth, I believe.


  • Can you imagine what MS marketing will make out of this if this turns out to be a Linux box ? (they have been aboard shuttles, so why not on the station).

    My point here is that mentioning MS now, should absolutely not be considered MS bashing, but rather just mentioning the obvious. That MS server *do* crash for no apparent reason, a fact that you can't find mentioned anywhere on this site. [microsoft.com]


    --
    Don't use nuclear weapons to troubleshoot faults. [cryptome.org]

  • My immediate thought was that it was a Solaris based server. I remember when I first heard that IBM Thinkpad 760's were going on the ISS ( I use a 760 still ) and being a OS/2 advocate I wanted to see what OS was going up there. Sure enough they put Windows on the x86 laptops but Solaris was on the mission critical systems and Windows was just for email and other comm uses. I think there was a NT server though. It was the for the Windows laptops. If this was a Solaris based server and they didn't put much effort into the redundancy issues then Sun really should take most of the heat since my experience is that they really don't go down much at all. Now if it was a Microsoft OS based server and they still didn't do much redundancy work then shame on NASA for using a product KNOWN to not stay running for very long.

    Note: When we were building a Solaris based system for Atlas V launch systems ( used Java too ) we had a OS configuration/hardware issue that had all the Microsoft advocates chanting about using Windows. Then I mentioned that this was the first OS based issue we've had in the entire development effort. They shut up. Funny how it's common and accepted for Windows to screw up and management doesn't care. Because Microsoft apps don't run on *nix systems they want to rip it out at the first chance....Ignorance or what! )

    I hope we find out because this kind of PR will only force the offending parties to do better work next time. Unless it really is Microsoft, they'll say NASA needs to put up a new space station running Windows 2000 or heXPee. IMHO.

    LoB
  • At least he will be leaving. Microsofts operating systems and software isn't leaving any time soon. I'll bet there won't be too much of a delay in pulling it all out and replacing it with Linux after they realize how disruptive IT really is.

    My guess is that it'll take about 3 crashes before the server is replace with Linux or Solaris. Probably 5-10 issues with the client machines before many of those go.

    Then again they could just not use them and just have a screen saver running. They'll still have to reboot them every few days or so but that could be put into the regular schedule. ;)

    LoB
  • by Dino (9081) <jd_dino@@@yahoo...com> on Thursday April 26, 2001 @05:41AM (#264490) Homepage
    I intervewied at Boeing for doing Space Station networking work.....here's the surprising part, the Space Station is all run off of 386s!!! They do most of the low level programming in assembly to squueze out as much performance as possible.

    It totally blew my mind. This was about 14 months ago.
    ---------------------------
  • Let's just hope they have the original 'install disks' ready and at hand when they come to swap out anything that breaks.
    And a good comms method to get their new install key.
  • and if this whole crash is NT related like the article suggests, why hasnt anyone pointed out that you CANNOT run NT on a 386

    For the same reason they haven't pointed out that you can't run NT on a 486; because it's not true.

    -
  • There was a bit of news a couple of years ago about some weenie at NASA who issued an edict that only Windows systems should be used. Of course, all of those tried-and-true applications that were successfully running on Macintosh, UNIX, and other systems were destined for the trash can after that order was issued. Looks like our space program is now beginning to see the fruits of that wise decision.


    --

  • From a quick scan of the logs, I found:

    SHIP'S LOG 29 DEC [nasa.gov]

    We are apparently out of memory space on the disk, although we're not sure exactly how NT manages its memory.

    SHIP'S LOG 22 FEB [nasa.gov]

    At about 2200, we were reconfiguring some mail files which, with a lot of help from Windows NT, got put in the wrong place during the backup procedure.

  • Quote: (Roger Baker, marketing manager for CAE Electronics Ltd. in Quebec)

    "NT played no role in the Yorktown?s LAN crash, Baker said."

    Surely you read the paragraphs that immediately follow your quote:

    Some outside observers, however, said they are not convinced NT is blameless.

    It still boggles the mind that any divide by zero error on NT would cause a system to crash, let alone" 27 end-user terminals, said Gil Young, corporate network engineer for a systems integration firm in Orlando, Fla. "I don't care what operating system, computer or application I'm using, I should be able to type in a zero and expect the computer not to crash, especially if that zero is to represent a closed valve."

  • by Dr.Dubious DDQ (11968) on Thursday April 26, 2001 @10:57AM (#264500) Homepage
    [...]shake the ISS around until the US system thought it was out of control and went into what is called Free Drift Mode.

    Great...so the ISS is really a giant pinball machine with one of the flippers locked up, so we need to get it to go "TILT" and shut down so we can reset it? :-)


    ---
  • Microsoft did a study of NT 4.0 downtime causes, and the results were split about evenly between "Hardware/Drivers", "Internal OS Problem", with quite a bit of "Administrator Error" thrown in.

    So, on NT4, at least, 99% of BSODs were not caused by hardware or driver problems. More like 50% of the non-preventable stuff.

    For more information, you'll have to dig out the 1999-era copy of InfoWorld where this was published.
    --
  • by IntlHarvester (11985) on Thursday April 26, 2001 @08:28AM (#264505) Journal
    Netware 3.12

    Yeah, memory protection is for wusses.

    Seriously, tho, in a former life as a network guy in the early 90s, I saw far more NetWare ABENDs than I've saw NT Bluescreens. It was generally OK file+print, but if you tried to run any slightly non-standard NLM (AppleShare, OS2 namespace, backup software, btrieve, CD-ROM drivers, etc) you had to keep your fingers crossed. I guess that goes to show if you keep a product in maintenance for 10 years or more, anything can become rock stable.
    --
  • by IntlHarvester (11985) on Thursday April 26, 2001 @08:45AM (#264506) Journal
    XFree86 drivers run as root and have full access to your systems memory. Poorly coded user space X drivers could easily crash your system.

    NT servers don't use the Nvidia drivers and aren't expected to do things like optimize video playback. They generally run a rather generic unaccellerated SVGA driver. I've seen lots of bluescreens on servers, and none of them that I recall could be traced to the video drivers. There's the usual SCSI and NIC driver issues that could crash any OS, and for a long time in the NT 4.0 series, there was some issue in NTFS.SYS that caused systems to fall over.

    I'll accept that it's somewhat stupid to have a mandatory GUI on a server, but I don't think this is the stablility issue that the NT-haters club makes it out to be. NT has/had plenty of larger reliablity problems.
    --
  • It sounds really rather scary to me. Apart from the fact that three redundant computers going down at once just should NOT happen - if Endeavour hadn't happenedto be docked, they'd have no voice/date uplink /at all/.

    Three redundant computers did not, actually, go down. ONE of the Command and Data Handling computers shut itself down, and Cmdr. Helms was unable to shunt functions it performed through the other two computers on the first day of troubleshooting. So, only one was actually down; the other two were part of the problem, or part of the solution, depending on your point of view, but they were not actually "down".
    ----
    lake effect [lakefx.nu] weblog
  • It's too late now, but at least this will be in the story when it gets archived.

    There are more than 100 computers on the space station, just counting built-in. Indeed, each individual experiment rack -- about the size of an apartment fridge -- will include its own computer and custom software written for that experiment, all intended to link into the ISS network for data transmission and science interface. Many of the racks in Destiny (and future modules like Columbus and Kibo) provide station functions such as robot arm control, and each of these has its own computer as well.

    But the core functions are called CDH (Command and Data Handling) [honeywell.com], including everything from navigation to turning the lights on and off: really, it's just the network infrastructure. Cabling is Thinnet. These computers are provided to NASA under contract by Honeywell, and are called MDMs, for Multiplexer/Demultiplexer. Think of a rack-mount swappable-processor system and you'll be close. These run the RTOS (Real Time Operating System) called VxWorks [windriver.com] (from Wind River) -- the same RTOS used on the successful Mars Pathfinder mission, and custom software written by Honeywell and specific system vendors using Matrixx [windriver.com] from the same vendor.

    The crew use laptops, and there are quite a number of them judging by photographs, many seemingly permanently linked into one or more MDM functions. Since the MDMs have no other interface to the crew, this makes sense. The laptops that link to the MDMs use Sun Solaris and a custom client that provides data feedback and a semi-graphical user interface, depending on function. These laptops go by the generic name PCS (Portable Computer System) [spaceref.com] and conform to specifications set during the mid-1990s. The PCS model in use is the IBM Thinkpad, and contrary to popular belief, these models have evolved along with the Shuttle and Station programs -- just more slowly than the commercial market. Models need to be constructed with higher-quality components and undergo flight qualification. The laptops available to Expedition One were (I believe) at least Pentium I-MMX class machines.

    Some of these laptops are dual-boot with Windows NT on the other partition. Windows NT does have a function on the space station, but it is in no way linked to the command and control systems as outlined above. The major purpose it serves seems to be e-mail, but probably also record-keeping and recreation in the form of games or playing portable media such as CDs or DVDs. (There is also a built-in DVD player in one module that the astronauts can gather around for "movie night".) Windows NT can behave perfectly well when given a known, well-defined set of hardware and a well-tweaked configuration. The astronauts have access to spare hard drives that have images created on Earth using Norton Ghost. In one incident during Expedition One this was insufficient, and a spare hard drive was sent up during the current shuttle mission in order to bring that laptop back into service. But since they have plenty, it probably did not materially affect operations to be missing one.
    ----
    lake effect [lakefx.nu] weblog
  • by DHartung (13689) on Thursday April 26, 2001 @06:38PM (#264511) Homepage
    sllort asks:
    Now what do you guys make of this?

    ... This would have been much easier with some bootable media that could run Windows. (Or if Shep was not indoctrinated by that "other" operating system).


    According to this Expedition One crew debriefing [google.com], Shep answered a provocative question thus:

    Ops LAN
    ? Was the service pack distribution system easy to follow?
    Shep: Yes. No problems.
    Sergei: I'd like to have a little more explanation of what is in the service pack.
    Shep & Sergei: That way we would have known if it was really critical to load the new version or not.
    ? Was the desktop configuration (SSC Client, SSC File Server) easy to navigate? Any suggestions on how to improve the desktop layout?
    o Shep (joking): Go to a Mac OS.


    This fits with the wording: Shep is a Mac user. The log is tweaking him for being less technical because he uses a Mac. It's unclear if this section of the log was written by one of the cosmonauts, or possibly Shep tweaking himself. But he's known to have a real sense of humor.
    ----
    lake effect [lakefx.nu] weblog
  • There is no indication of an actual BSOD, since there is no indication of MS Windows being used. And how exactly would you get a BSOD screenshot unless you were using VMWare or something? Seems rather impossible to me.
  • I'm curious what/if any Linux document editing programs can display all the Russian characters? It sounds like that is part of the reason for using Windows at least on some of the systems that the Astro/Cosmonauts use for workstations.
  • Gee, and I was going to speculate that the missing Martian Lander was using Windows.

    Caution: Now approaching the (technological) singularity.
  • I find it quite weird that they operate a space station on a normal consumer point and click OS.

    I would have though that with the resources needed to build an orbiting spacestation they'd have enough human resources to either build their own specialised OS, or customize some existing one (perhaps something like QNX).

    One can only fear what happens when they upgrade to one of the new microsoft leases based licenses so when their link goes down and they can't contact microsofts license server the entire space station shuts down :)

  • Hmmm. The ISS. Man I can't wait to see a Beowolf cluster of these.

    --
  • I haven't seen a BSOD on our Win2000 Pro installs yet, but I have seen it crash hard in two separate, reproducable instances.

    When trying to change my NT Domain password and my Netware 5 NDS password at the same time using the password change function did not BSOD, but simply rebooted the PC without changing the password anywhere.

    HP PrecisionScan 3.x and the HP official Win2000 drivers for the HP LaserJet 5/5M Standard printer do not get along. If the printer drivers have been on the PC at ANY TIME, even if they have been removed, running PrecisionScan reboots the PC. In addition, HP's crappy little MacroMedia installer program will reboot the PC when Autorun grabs it. If you skip the MacroMedia crap like anyone who is sick to death of HP's dog-shit quality installers, you can't get the install to complete. It will not accept any install location as valid. Granted, the HP problem has to do with their arrogance any sloppy coding, but Win2000 is supposed to be so incredibly stable, it shouldn't let a minor program like that crash the PC.

    --

  • Have to stay with Novell for this year. We merged with a company running Applied, and only have a Novell license for Applied.

    --
  • Perhaps it was a Mir sympathy crash...

    Space Stations of the World, Unite!

    Kevin Fox
    --
  • 1) The article says nothing about Windows or any other OS.
    2) Yes, NASA uses Windows.. they use windows 95 on their laptops aboard the station.... because they have long-standing procedures on how to use these notebooks reliably. When they crash, they know how long it takes to reset them, and just what to do, etc.

    But please don't just make it out like Windows fucked up the ISS. That's silly.
  • There's going to be a lot of Windows bashing on this story, but folks, remember we're not talking about the Windows 9x kernel here.

    I wouldn't want to trust windows 2000 with my life, but I haven't yet seen a BSOD on it

    I think the odd thing is that they have three systems, but they're all the same OS. Usually, these control systems are implemented three different ways, so that whatever bugs are present don't affect all of them.

    Windows 2000 would be a much saner choice, IMHO, if backup #1 was linux and backup #2 was another unix.
  • by Webmonger (24302) on Thursday April 26, 2001 @05:37AM (#264527) Homepage
    Man, it is really bizarre to see a press release about an oranization cold booting into safe mode. The way they write it up, you'd think it was rocket science. . .
  • A href="http://www.cnn.com/2001/TECH/space/04/25/shu ttle.spacestation.02/index.html">CNN; and the BBC [bbc.co.uk]; report that all three Command & Control computers on International Space Station Alpha failed yesterday. They either weren't working or not communicating, although life support and navigation were not affected.

    Apparently a single server is malfunctioning. Problems include not being able to communicate with the Station, command the new robot arm, nor turn off the Station navigation system. The Shuttle also cannot lift the orbit while the Station navigation system is flying the Station.

    A NASA page [nasa.gov] says:

    The primary result of today's computer problem was a loss of communication and data transfer between the Space Station Flight Control Room and the station. Communication capability was routed through Endeavour enabling the crew and flight controllers to talk to one another.

    Despite the difficulties encountered with the computer system today, all systems on board the spacecraft continued to function properly.

    We discussed some of the ISS computers in an April 4 article about ISS logs [slashdot.org], although not the C&C computers. Apparently there is a malfunction of the Control & Data Handling [erau.edu] C&C MDMs, not merely communications to the PCS C&C laptops. The 6MB PDF ISS overview [nasa.gov] describes CDH in Section 2.

  • What the hell does crappy video drivers have to do with the OS stability?

    Errrr ... the fact that they *CAN* crash the operating system.

    Now, while this may be acceptable behaviour in a high performance workstation (maybe ...), it is completely unacceptable in a mission critical server.

    IIRC, this started in NT4, prior to that (ie 3.51) it was not possible. It is certainly not possible in many other well designed mission critical OS's.

    Basically, the driver should not be in kernel space where it can cause that damage.

  • So, it seems, the LAN is using thinnet. Makes sense -- it's shielded.

    We should be careful about jumping to conclusions about this being an NT BSOD problem. That usually isn't an all day affair to fix. Now, a bad terminator resistor on thinnet segment or a crimped cable, or a slightly wacky transceiver could cause a tricky to diagnose problem. One of the big wins for 10*BaseT*, aside from using standard phone cable, is that error detection and isolation is much easier in a hub and spoke topology than it is in a bus.

  • So, what do you do with the shielding? Ground it? To what?
  • by overshoot (39700) on Thursday April 26, 2001 @06:20AM (#264538)
    Of course, the fact that NASA had just installed a bunch of critical hotfixes from Microsoft's FunLove-infected update site is purely coincidental.
  • Apparently, the station came back online [bbc.co.uk] before having to "rock the casbah." I guess that's good because we didn't have to intentionally break things. It's unfortunate because I love silly, brute force solutions like this.
  • Also, according to that same article, it is sounding more like a software glitch again. Of course, that still doesn't mean it is Microsoft.
  • The thinkpads are called the PCS's (Personal Computing System). They run Solaris and use a custom graphical program to interface with the computer systems of the ISS. They are only interfaces and don't actually "control" the station. They CAN send signals requesting certain control items, but all the control system software is on a seperate system.

    I have no idea what the arm and stuff is running and how it communicates with everything else.

    I think there are also a couple thinkpads that are Windows only, but they are just used for email and reading documents and stuff (nothing mission critical).

  • by boarder (41071) on Thursday April 26, 2001 @08:45AM (#264545) Homepage
    That is not what happened at all. The IBM thinkpads are just INTERFACES for the control system. They don't actually control things. They just allow the astronauts to see what is going on in the station and sendc ommands. All of the actual control (autonomous and commanded) is done by other machines: three Command and Control Multiplexor/DeMultiplexors (not running windows).
  • by boarder (41071) on Thursday April 26, 2001 @01:46PM (#264546) Homepage
    Yes, I saw the humor, but I didn't think it was relevant or correct.

    In this case, the problem was not with the interface software OR interface computer (thinkpad) but with the core system (they were still not sure whether it was software or hardware last I checked). Not only that, but the software of the Thinkpad was not provided by a "monolith^H^H^H^Hpoly" unless you consider Sun Solaris a monopoly.

    I guess I always did think of HAL as an OS and not an interface. That is an interesting revelation to me, but that still doesn't change the fact that the interface didn't cause the problem and the fact that the interface wasn't supplied by a monopoly.

  • by boarder (41071) on Thursday April 26, 2001 @08:57AM (#264547) Homepage
    First off, Windows almost definitely did not cause the crash; /. personnel are the only people saying that. It was a hardware failure in all likelyhood occuring the the US control module (probably in the Command and Control MDMs). I can't believe the kind of reporting going on here; it reads like a M$ FUD press release. Blue Screen of Death my ass!?!

    What really happened is the US control module computers stopped responding to any inputs from the ground. They weren't able to control the station or tell it to shutdown or anything. Their plan to fix it (last I heard) was to have the Russian control module move and shake the ISS around until the US system thought it was out of control and went into what is called Free Drift Mode. In this mode, it can be completely controlled by the Russian module and we can debug the system and bring it back online.

  • by p3d0 (42270) on Thursday April 26, 2001 @09:17AM (#264548)
    IIRC, the stated reason for using Windows is that astronauts (who are not necessarily computer experts) can manage it. Well, is it worth the risk?

    Wouldn't it be better to use whatever system is best for the job, and send a computer guy up there to maintain it?

    (Yes, I admit it, I'm only suggesting this because it increases my chances of getting into space from zero to negligible.)
    --
  • The opperation of the 386 is well known, and studied, any bugs in the chip are well documented and can be programmed around.

    And yet these are the same people who chose a Microsoft product as their OS... Scary.
  • There's *nothing* in the CNN article ... implying that Windows is the reason for the server crash

    Micro~1.oft spent a lot of time, energy and money to ensure that their OSes were dominant on the ISS. They have spent millions of $$$ just to place a few hundred copies on the ISS, in the space flight centre, and in the russian control centres. The reason for this massive cost was to use the ISS as a giant marketing tool, and they even created a whole marketing campaign around it.

    Windoze is not the only OS on the ISS, but it is dominant. There are some *nixes running critical communication processes, such as the main link from the station to ground points, and these have not had many problems at all.

    When the M$ servers started crashing [theregister.co.uk], the whole micr~1.oft in space campaign was put on hold. If you read the logs created by the station crew, they are pretty upset having to spend entire days trying to fix micr~1.oft problems. NASA has a direct line into the best and brightest engineers at M$, but even they are clueless as to why certain processes hang, why backups fail to happen, why entire directories are blown away with no trace, or why new patches cause driver conflicts.

    Since the Register article highlighting the ISS problems in the logs, micr~1.oft has been putting pressure on NASA to redact all mention of micr~1.oft. Certainly someone has been archiving copies of the logs since they appeared, so they can diff them later and see when NASA bows to micr~1.oft pressure.

    As you noticed, none of the mainstream reporting now mentions micr~1.oft by name, that is due to a pressure campaign by one of the largest advertising bugdets in the US. But when the logs are posted for these events, you will notice a great many references to the machines running micr~1.oft, even if the name of OS is redacted out. If you do a little research, you will see these machines are running either DoS [nasa.gov] or windoze.

    the AC
  • I had heard it was a combination of factors:

    user error - entering an illegal value in a DB

    app error - DB accepting illegal value

    OS error - OS crashing because of divide by 0 because of previous errors.

    The first is understandable.
    The second is unacceptable (should have been caught in test).
    The third is unforgivable.

  • by sconeu (64226) on Thursday April 26, 2001 @09:31AM (#264568) Homepage Journal
    Dude, I was referring to the Yorktown discussion thread. I never said it BSOD'ed. I said crashed. There's a difference.

    Here's the article [gcn.com] about the Yorktown.

    I used to work for a defense contractor, so I know how these things should be tested. You don't just test on good inputs, you test with bad ones. That's why I said that the app crashing was unacceptable. However, nothing should ever cause an OS to crash, especially in a military environment.

    It doesn't have to be a BSOD, it could be some other failure mode, which is what appeared to happen to the Yorktown.
  • ISS: Ahhh control the NT CD won't boot.
    Control: Yeah, thats right, that old 386 BIOS does'nt support CDROM booting but damn it's got some rad rad protection!!. Just run winnt /b to make boot floppies.
    ISS: OK, so how do I get to the CDROM?
    Control: Well you boot off the MSDOS boot floppy that has CDROM drivers of course.
    ISS: OK, done that... ahh control it started copying over that boot disk and is now complaining that it cannot find command.com and is asking me where to find it. It asked me to label these disks 1 to 3 NT something or other, so I'll boot of this first one...
    Control: No ISS, that won't work, it actually copies those floppies 3, 2 and then 1 (boot). Do you have another MSDOS boot floppy with CDROM drivers?
    ISS: No, but Igor says he has the "Deb-Ian(?)" boot floppies and that I can "ftp install it from his Notebook?!?!????", he rekons that we can install "Lee? Nooks?", do something to the NT CD, create a FAT partition, copy the i386 dir to it, set it bootable with eff-disk and "sys it" from FreeDOS(?), reboot and then install NT from there????
    Control: Stand-by ISS, we have MS support on the line, we have been placed in a queue and will be answered by the first available opperator, she's saying something about us being a valuable customer, just a sec... ummm, you guys would'nt happen to have a Visa or Mastercard handy?

  • Is there any way to impliment a system whereby we moderate /. editors? Ideally our moderation would effect salary directly. :)

    In all seriousness though, I know that we can change our preferences to ignore articles from certain editors, but perhaps an editor moderation system would increase the quality of headlines and submissions around here. While most publications make headlines inflamatory and eye catching, few blatently lie like /. has lately.
  • Then Katz would certainly be the high priest to this council.

    The majority of rants . . . er articles on slashdot are often incorrect, biased, pure propaganda, reactionary, immature and half-baked.
  • the Space Station is all run off of 386s!!!

    Not true! The article suggests that this was an NT failure. From the M$ site the minimum system requirements for NT are "At least a 486-DX2 33MHz processor"

  • Probably an impossibility... but is it possible to contact Shep regarding a clarification of this?
  • by RollingThunder (88952) on Thursday April 26, 2001 @01:10PM (#264586)

    Not that I believe this at all, but it occured to me and I figure it's amusing enough to share.

    CNN:
    A delay in the departure of Endeavour could mean a delay in the launch of space tourist Dennis Tito aboard a Russian Soyuz craft. Tito was scheduled to lift off on Saturday, but that mission would have to be delayed if the computer problem is not corrected, NASA spokesman Doug Peterson told Reuters.

    "Sorry, Dennis. That darn computer system crashed again, we just can't let ya launch right now. We figure it'll be fixed by... oh... October." <sotto voce: Frank, have you finished the bluescreen plan for Friday yet?>

  • by jbridge21 (90597) <jeffrey+slashdot AT firehead DOT org> on Thursday April 26, 2001 @08:07AM (#264590) Journal
    It is specifically Solaris x86 running on a laptop.
    -----
  • ...maybe they were really Really REALLY stupid and got infected with Chernobyl [yahoo.com]. The articles say the crash happened Wednesday in USA time, but what time zone does the ISS use for its computer clocks?

    Plus there's that M$ support site infected with FunLove [theregister.co.uk]. Or maybe it was just a hardware failure...

  • I'd much rather use a 386 then a new 1.33Ghz machine. Those 386 chips were build like tanks, you cun run over then with a car and still use it. The fan could die and it would run months without it. The opperation of the 386 is well known, and studied, any bugs in the chip are well documented and can be programmed around.
    =\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\=\= \=\=\=\=\
  • by hardcode (105714) on Thursday April 26, 2001 @08:32AM (#264601) Homepage
    Try http://www.theregister.co.uk/content/2/18540.html to find out NASAs' rebuttal of that Register story. Seems it's not only /. that froths at the mouth at the thought of bashing IBM and Microsoft.

    hc
  • by criticalrealist (111008) on Thursday April 26, 2001 @08:56AM (#264602) Homepage
    In addition to Microsoft Windows NT, they're probably using Coax, or 10Base-2, also known as thinnet. They probably have BNC connectors on the backs of the NIC's. The logs say they fixed their network problem by jiggling the cables. That's an indication of 10Base-2 if I ever saw one. The logs said they had to cold boot. This is frequently the case after a coax network crash.

    Coax would have the advantage of plenty of shielding from electromagnetic interference. Otherwise, no advantage.

    If you're reading this NASA, here's some advice. Buy some little metal doohickeys for the back of each networked computer. These doohickeys fit around a coax cable, can be screwed into the back of a power supply, and cost about 5 cents. In my experience, using these helps stabilize the cables a lot, and you get more uptime that way.

  • Pertinent quote from CaptianZapp's The Register link [theregister.co.uk]:
    NASA hasn't said what the problem machines are but all a strong body of evidence points to IBM Thinkpads featuring older Intel processors, when the project began around two years ago these machines used 486 chips.

    Back in February we obtained exclusive pictures on a crashed IBM Thinkpad on board the space station. Subsequent emails from our readers revealed these machines were involved in far more than playing space invaders. It seems the laptops were running most of the main functions on board the station, including the communications functions that have failed.

    Also see this link [theregister.co.uk] for more confirmation that ISS depends on MS products (whether DOS or Windows) for more than Leisure Suit Larry...
  • by rjamestaylor (117847) <rjamestaylor@gmail.com> on Thursday April 26, 2001 @08:04AM (#264607) Homepage Journal
    Let's get this straight: a space station built with with international cooperation has a computer error threatening to cut off communication with earth-based command-control? The computer is an IBM Thinkpad? The year is 2001?

    That's a space oddyssey, er, oddity.

    Open the pod-bay doors, HAL.

    I'm sorry, Dave, I'm afraid I can't do that

    And the software in question is provided by a huge monolith^H^H^H^Hpoly...
  • how about a heterogenous OS environment.

    Set up your main e-mail server to be Sparc Solaris running sun's sendmail, your secondary e-mail server as Alpha Linux running sendmail, and your tertiary e-mail server to be Intel OpenBSD running qmail.

    No trivial task for the ISS people, but if they had 3 programming groups working on 3 implementations of the same communications code, but each for 3 different platform/OS's, your redundancy wouldn't be as restricted to software issues.

    Going 3 times over budget isn't bad, is it?

    -f

  • That is incorrect. As an orbiting object decreases speed, it falls in its orbital path.

    This is why satellites eventually lose their orbit and burn up in the atmosphere. They experience decceleration due to air resistance.

  • First, of all, this has been hashed before on /. [slashdot.org]

    Second, it's not that the P3/P4 is more sensitive to radition. It's that the i386 and i486 [google.com] have been around long enough to have had the military and NASA pay for radition hardening, not to mention low power consumption which is just as important in space.

    hardened cmos device [intel.com] - with actual rad specifications.
  • by revbob (155074) on Thursday April 26, 2001 @06:36AM (#264624) Homepage Journal
    You were either misinformed or you misunderstood what your interviewer said.

    Real time software for mission critical systems is written in Ada. That's a no-brainer. If there is any assembler, it's tiny, of severely limited scope, and meticulously tested. In fact, having worked with some very low level networking code for ISS (in Ada), I doubt there's any assembler in there at all.

    As to the 386's, they're rad hardened and known reliable. And, unlike the home computer I bought a couple of months ago that's state of the art, whether I need state of the art or not, the jobs these CPUs had to do simply didn't require anything faster than a 386, even given a hefty allowance of spare cycles and memory for future growth.

    We bought what we needed (in space, rad hardening is not optional) and we didn't buy what we didn't need. That's not $400 hammers, that's the definition of responsible stewardship of the public's money.

  • The launch of space tourist Dennis Tito aboard a Russian Soyuz craft on Saturday will have to be delayed if the computer problem is not corrected, NASA spokesman Doug Peterson told Reuters.

    It's no secret that NASA isn't too keen on Tito's planned visit to the station. Looks like their choice of Windows will help them out in this regard!

    --

  • by tibbettsatmit (157338) on Thursday April 26, 2001 @06:30AM (#264627) Homepage
    It is not particularly scary. Software systems don't benefit from redudancy in the same way that hardware systems do. Most software bugs are systemic (ie, an uncommon code path that just doesn't work). So redudant software systems (even ones that are multiple seperate "clean room" implementations) frequently go down at the same time when in the same operating environment. For more information check out the work of Nancy Levison [mit.edu] and the other people in her group.
  • Maybe there is a reason that the MSnbc article doesn't mention anything about Operating Systems... Have you forgotten what the "MS" in MSNBC stands for? (Here's a hint: Microsoft!) Though the last line of the /. article says: NASA is using Windows for most of their computing functions, as mentioned here. [nasa.gov]
  • by b0r1s (170449) on Thursday April 26, 2001 @07:35AM (#264631) Homepage
    ONE server went down... the THREE you speak of were clients, which of course are useless because of it.

  • Just in from The Register [theregister.co.uk]:

    Steve Husty, a senior software engineer who works for NASA on the portable computer systems used on the International Space Station, has written to correct us on aspects of our story about the failure of computers aboard the International Space Station. In the process he's provided us with an interesting explanation of the technology on the space station which we've published below.

    The IBM Thinkpad laptops to which you refer, called PCS (Portable Computer System) are used throughout the station. They are indeed 486 based laptops. However, they are running Sun's Solaris OS for x86, and the OpenWindows WM, and a custom application that provides a graphical interface to the various on-board systems.

    Also he writes that the computer that crashed were not the laptops:

    The computers that crashed (the C&Cs) and the PCS laptops are not the same computers and that the latter, while important are not responsible for running the station's operations.
  • This reminds me of the US Navy ship that had it's operational systems running on WIN NT. When they had a BSOD, the ship was dead in the water, and had to be towed in. There is this government news article [gcn.com], which has the details of that old story.

    We simply cannot have peoples lives being dependant on software that can crash. In a business context, we can get used to crashes, after all it is only data, and it is only the livelyhood of the bussiness at stake. It is only maybe millions of dollars. In space, it is lives.

    Which OS would you be willing to literally bet your life on?

    Check out the Vinny the Vampire [eplugz.com] comic strip

  • There is no indication of an actual BSOD, since there is no indication of MS Windows being used. And how exactly would you get a BSOD screenshot unless you were using VMWare or something? Seems rather impossible to me.

    You use a camera. Check out this short Register story [theregister.co.uk], which has a link to a very high rez photo where you can sorta make out the error messages, especially if you are familiar with the system.

    Check out the Vinny the Vampire [eplugz.com] comic strip

  • A Nasa Engineer wrote in to the Register [here [theregister.co.uk]], and supplied extra info on the Systems onboard. Here are the essential bits:

    The IBM Thinkpad laptops to which you refer, [are] called PCS (Portable Computer System) [and] are used throughout the station. They are indeed 486 based laptops. However, they are running Sun's Solaris OS for x86, and the OpenWindows WM, and a custom application that provides a graphical interface to the various on-board systems.

    It is not unusual for a project of this size and scope to be using technology that seems dated to the man-on-the-street. [...] The PCS runs its own applications, which have very little to do with the actual main function operations in a module. [...] The laptop's processor is not involved in the calculation, monitoring or execution of the station's processes. [...] The computers that crashed (the C&Cs) and the PCS laptops are not the same computers

    So usual original assumption was wrong. But that still leaves us with the other question of what *are* they running on the main system.

    And the Original question of what you would bet your life on is also still interesting.

    Check out the Vinny the Vampire [eplugz.com] comic strip

  • Almost all processor modern or old are vunerable to cosmic ray glitches. Modern electronics have not proven to necessarily be any more sensitive to radiation, but many "experts" would love for you to think so.
  • You are correct that it takes smaller amount of charge to cause a bit flip, but the cell also receives less charge because it has less volume. Furthermore, because cosmic ray tracks have extremely high energy, their tracks have a very dense portion, surrounded by a less dense portion of the track. The likelihood that any individual cell will be hit by the most dense portion of the track is lower...if the threshold for the bit flip is higher than the less dense portion of the ion track, there will be fewer upsets per cell.
  • In some respects, I could be, but the generally excepted definition of radiation hardening is to build the electronics with another layout and/or foundry.

    I just wanted to point out that the other alternative is sometimes the better path. See what issues you have and then use good engineering to make them non-issues. For many years, the process has been: "Well, we are going to put this into space. Okay, well let's have Lockheed Martin (now BAE NA), or Honeywell or Sandia make a radiation hardened version and we will fly that."

    It really is not that difficult to build simple circuits that perform EDAC, measure current levels, and reset units.

    The other thing is that a number of times, the result of radiation hardening is not that the device is less susceptible to most SEEs, but merely total dose. That was the case with the ADSP21020 and that is pretty useless in my opinion. You can put some simple shielding around the device (like the SEi (now Maxwell)RadPack(tm), but simpler) and decrease the amount of dose that the device will see in space significantly.
  • by boing boing (182014) on Thursday April 26, 2001 @07:47AM (#264647) Journal
    I just want to contradict one point you made: "in space, rad hardening is not optional".

    That is incorrect.

    Microprocessors (electronics in general also) have a wide variety of radiation response out of the box. For instance, the AMD K6 is known to be pretty bad for single event latch-up and not very usable. On the other hand, the PC603 actually is not to bad right off a commercial foundry line.

    With this in mind, there are also a number of ways to mitigate radiation effects, including latch-up protection circuits, EDAC, redundancy, cold sparing, etc. These methods can remove the number of effects that propogate to the subsystem or system level.

    Radiation hardening in many instances can also succeed in preventing effects from reaching the system level, but there are a number of penalties to pay. Schedule is often the biggest (as you know, many rad hard processors are very old), cost (this stuff isn't cheap since it is boutique), performance (many rad hard processors can't perform to the speed of their commercial brothers because of layout changes, extra resistance etc.), and also many times the required power and size can be affected.

    Now we are presented with two paths: 1) radiation harden a processor, 2) measure the rad effects of a commercial processor and mitigate them with extra circuitry (which has its own extra liabilities in cost, power, size, but typically are much lower).

    In some instances, rad hard is the right choice (in human flight missions, it tends to be a good choice, but not always), and in some commercial products with some workarounds are best.

    Simplifying the issue to "rad hardening is not optional" is wrong...it is optional, but if you say "radiation effects must be dealt with", then I agree with you.
  • here [theregister.co.uk]
  • Just an addendum... according to the article at The Register (posted elsewhere), the fault was possibly due to the actual IBM Thinkpads used... so the implication that it's Windows is even related to this problem is probably wrong.
  • by Abcd1234 (188840) on Thursday April 26, 2001 @06:01AM (#264651) Homepage
    I'm no fan of Windows... frankly, I use Linux whenever I get the chance. And it's great that Slashdot is evangelical about my favorite OS. But that's no excuse for bad reporting. There's *nothing* in the CNN article (or any of the others, for that matter) implying that Windows is the reason for the server crash. Implying that it is related (with the little tagline "NASA is using Windows for most of their computing functions"... why add this, except to add sensationalism to the article?), is just bad, bad form. If any other publication did this, I'm sure people here would be complaining about poor journalism, bias, etc, etc, et al, ad nauseum. Frankly, I think that little line should be removed, and the post should be allowed to stand on it's own. Please, don't put these little editorial comments into the stories. There's no need. All it does is damage Slashdot's (already shakey) credibility.
  • by Abcd1234 (188840) on Thursday April 26, 2001 @06:51AM (#264652) Homepage
    I totally agree. Slashdot posts stories with the author's opinion thrown in. However, an opinion is one thing... warping the facts, implying something that's not true... that's entirely another. The comment (and the title of the story) implies that Windows was the reason for the crash... however, not even NASA knows why the crash occured. Now, if we'd had a confirmation that, yes, Windows caused the problem, and then we had a little MS bashing comment in the story, well, so be it. Or if the title of the story was "Severe server crash on ISS", and the comment was something like "I wonder if Windows had anything to do with it...", that'd be fine, too. But this isn't the case... the author tried to imply causation when there is no proof of it. That's irresponsible.

    Now, I've been around Slashdot for a long time, as well... like you, before the Andover buy-out. But that doesn't mean I'm not going to be objective. The author fscked up here. I'm not saying /. should praise M$... frankly, M$ has absolutely NOTHING to do with it. I simply think that Slashdot should try to report *true*, *accurate* stories. Is that so much to ask? A little journalistic integrity (I know, I know... naive... :)

  • by Tyrannosaurus (203173) on Thursday April 26, 2001 @06:40AM (#264658)
    One can only fear what happens when they upgrade to one of the new microsoft leases based licenses so when their link goes down and they can't contact microsofts license server the entire space station shuts down :)

    The worst part is that whenever they upgrade a piece of hardware, they have to re-register with Microsoft. Since their comm is no longer working, they have to use Morse Code by blinking a flashlight out the window.

    ---

  • by micromoog (206608) on Thursday April 26, 2001 @06:06AM (#264660)
    Everyone seems to be jumping to the conclusion that this is somehow Microsoft's fault. Where's the article that even says the systems were running NT/2000? If that is known, is there anything stating that the problem was caused by an OS defect?

    I mean really, people. Sure, we've all had bad M$ experiences, but blame the NASA engineers for a poorly designed redundancy, and let them blame their supplier.

  • by mr.ska (208224) on Thursday April 26, 2001 @07:12AM (#264661) Homepage Journal
    Geez, would it kill CNN, or any other American news feed, to mention that the robotic armis known as the "Canadarm2", because it was designed and built by Canadians? We may be 1/10th the size of The United States Of America , but for crying out loud, you're allowed to mention that OTHER nations are contributing to the station. Especially when the contribution is the feature that will allow the station to be built over the next 5 years!

    While they're at it, maybe add the fact that the Canadarm2 is the big brother of the Canadarm that each space shuttle has. Maybe that it has 2 "hands", one on each end, that will allow it to "inchworm" its way along the outside of the station. Perhaps mention that Canadian Chris Hadfield, the first Canadian spacewalker (as of this mission) is the one who installed the arm??

    You'd think every American news editor has a spark plug up their GI orifice that gives them a shock anytime they allow "Canada" to get into print. Sheesh.

    Mr. Ska

    I slit a sheet
    A sheet I slit

  • by Random Utinni (208410) on Thursday April 26, 2001 @09:34AM (#264662)
    Well, AFAIK, it's "Klaatu, Barata, N..." ergh. Necktie... Nickel... it's definitely an 'N' word.

    Hmmm... "Klaatu, Barata, N<cough>" There you go. Works like a charm... : )
  • would this be news (here) if it had been Linux...or BSD...or XFree that had crashed?
  • Putting "BSOD" in the title falls under this too. I concur this was bad form. An aside, most of the crashes I have seen in Windows are from non-Microsoft drivers. Iff the crash was in Windows, whose fault was it? If the driver that crashed was nasa.sys, then maybe their engineers accessed pageable mem at an elevated IRQ or something. no, I'm not trolling...I am quite serious.
  • by rabtech (223758) on Thursday April 26, 2001 @08:44AM (#264670) Homepage
    The password change is a well-known bug in the Novell client that they refuse to fix. Novell has suspended pretty much all work on their client software. Netware is dying, jump now while you can.

    Your HP situation highlights 99% of Windows 2000 BSODs: faulty drivers. If you only use HCL-approved hardware and signed drivers, you aren't going to get any BSODs, unless you have faulty hardware.

    I believe that the ISS is using NT4.0, in which case I'm not surprised. While somewhat stable, it pales in comparison to Windows 2000.
    -------
    -- russ

    "You want people to think logically? ACK! Turn in your UID, you traitor!"
  • by ec_hack (247907) on Thursday April 26, 2001 @06:34AM (#264675)
    The ISS computers that have been crashing (the MDMs) don't use Windows. The MDMs and other embedded computer systems are based on Intel 386 chips. If they have a kernel, it is probably VxWorks or other commercial RTOS. AFAIK, the only ISS computers that use Windows are some of the laptops, however, some use the Intel version of Solaris.

    Why 386 chips? Because they have been tested and been found to be relatively radiation tolerant. More current chips are likely to be subject to more radiation-induced faults due to smaller transistor size.

  • by imipak (254310) on Thursday April 26, 2001 @05:40AM (#264680) Journal
    It sounds really rather scary to me. Apart from the fact that three redundant computers going down at once just should NOT happen - if Endeavour hadn't happenedto be docked, they'd have no voice/date uplink /at all/.

    As far as I can see, wouldn't that put the crew into a really hairy position? Without support from the ground, how they'd have no way to know how to try diagnosing / fixing the problem. And if they couldn't get it going... well, perhaps they'd all just goof off for a while, like when the boss takes a day off sick ;) ... but wouldn't they have serious problems, say, preparing for the next shuttle or Soyuz docking?
    --
    If the good lord had meant me to live in Los Angeles

  • by cavemanf16 (303184) on Thursday April 26, 2001 @06:57AM (#264689) Homepage Journal
    ...I'm sure people here would be complaining about poor journalism, bias, etc...

    This isn't a journalism site, it's a bulletin board system. Jon Katz is the only one who really writes stories of his own, each time. Most of the rest of the stories are just links to other sites. So yes, that's why slashdot evangelizes about Linux 24/7 and bashes Microsoft. Sure, we all realize that NASA didn't just pick Windows to run space shuttle operations just cause it was easy to use. I'm sure plenty of considerations went into how well it would work versus other OS's. But it's still fun to discuss whether they made the best choice possible, which is what slashdot is so popular for. Discussion.

  • by Cranston Snord (314056) on Thursday April 26, 2001 @06:48AM (#264698) Homepage
    What really gets me is the following quote...

    The computers were running, but were unable to access data in their memory banks because of the downed server.

    Danger Will Robinson! Danger Will Robinson! Memory banks unreachable!
  • by JediTrainer (314273) on Thursday April 26, 2001 @08:43AM (#264699)
    NASA is using Windows for most of their computing functions,

    In that case forget it. I'm not setting foot on that death trap! I think I'd rather take my chances on Mir! Oh wait, too late....


    Personally, I'd still rather take my chances on Mir!
  • by Spamalamadingdong (323207) on Thursday April 26, 2001 @07:51AM (#264703) Homepage Journal
    As an orbiting object decreases speed, it falls in its orbital path.
    Which is correct as far as it goes (it only applies to single-impulse velocity changes). However, after losing speed the object falls into a lower orbit (it no longer has the velocity to maintain its original orbit), and the trade of potential energy for kinetic energy increases the orbital speed.

    Total energy/mass of an object in orbit is 1/2 v^2 - GM(earth)/r; you get a circular orbit when the kinetic energy is equal to half the (negative) potential energy, i.e. v = sqrt(GM(earth)/r). The total energy of an object in an orbit (as opposed to an escape trajectory) is always negative.
    --
    spam spam spam spam spam spam
    No one expects the Spammish Repetition!

  • by pavonis (415389) on Thursday April 26, 2001 @08:31AM (#264706)

    For gods' sakes, someone with some karma mod this thing up. /.'s reaction to this story, in the complete absence of the relevant facts, was kind of distressing- so many instant Windoze bashers popping up, the usual modding-separating-wheat-from-chaff system failed completely. The only systems aboard ISS running Windows that I am aware of are some of the laptops, which are not the sole interfaces to any critical system, and servers for some relatively minor tasks, like e-mail I believe.

    I assume this choice was made for the sake of simplicity. I don't agree with running windows at all, but so far as I know they're being fairly sensible about it. Those referring to NASA decisions that 'everything would run windows', or massive M$ marketing campaigns, please provide some sort of reference if at all possible...

    Side note: there are other means of communication with ground, even if Endeavor weren't parked there. They just switched to the shuttle as the simplest thing. If all else fails, amateur radio should always be usable...

    Repeat of question I posed in an earlier article: Apart from simple answers like 'More testing' and 'be more careful', do any of you have suggestions for how NASA's software might be made more robust? Of late software problems have caused more trouble than hardware, which seems odd.

  • by sllort (442574) on Thursday April 26, 2001 @06:14AM (#264707) Homepage Journal
    The link that specifically mentions Windows, for those of you wondering, is here [nasa.gov].

    Now what do you guys make of this?

    "Used the startup disk in the onboard software suite, but could not find a particular file while hunting around with DOS. This would have been much easier with some bootable media (CD-ROM?) that could run Windows. (Or if Shep was not indoctrinated by that "other" operating system). We may need an emergency boot capability again. After 5+ attempts, finally got the hard drive to take an image off the ghost CD. One of the Autoloader floppies went down, but SSC 2 is now running normally. ( 3+ hours troubleshooting). "

    Guesses? Bets?

What the large print giveth, the small print taketh away.

Working...