Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
NASA Mars

NASA's Mars Helicopter Goes On 'Stressful' Wild Flight After Malfunction (theguardian.com) 49

A navigation timing error sent Nasa's Mars helicopter on a lurching ride, its first major problem since it took to the Martian skies last month. The Associated Press reports: The experimental helicopter, named Ingenuity, managed to land safely after the problem occurred, officials at the Jet Propulsion Laboratory said on Thursday. The trouble cropped up about a minute into the helicopter's sixth test flight on Saturday at an altitude of 10 meters (33ft). One of the numerous pictures taken by an onboard navigation camera did not register in the system, confusing the craft about its location. Ingenuity began tilting back and forth by as much as 20 degrees and suffered power consumption spikes, according to Havard Grip, the helicopter's chief pilot.

A built-in system to provide extra margin for stability "came to the rescue," he wrote in an online status update. The helicopter landed within five meters (16ft) of its intended touchdown site. Grip wrote: "Ingenuity muscled through the situation, and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways. While we did not intentionally plan such a stressful flight, Nasa now has flight data probing the outer reaches of the helicopter's performance envelope."

This discussion has been archived. No new comments can be posted.

NASA's Mars Helicopter Goes On 'Stressful' Wild Flight After Malfunction

Comments Filter:
  • by greytree ( 7124971 ) on Saturday May 29, 2021 @02:32AM (#61433416)

    It was nearly lost because of a single bad image FFS

    If I was paying $85 million for a drone, I would expect its software's response to a single bad image to have been tested. Repeatedly.

    Admit all the whooping, will anyone be asking the engineers to account for their failure to do proper testing ?

    • Hard to test for something like this on Earth. I imagine this is the testing for future machines as this data is invaluable since we can't reproduce Mars environment easily.

    • by blahbooboo ( 839709 ) on Saturday May 29, 2021 @02:54AM (#61433444)

      It was nearly lost because of a single bad image FFS

      If I was paying $85 million for a drone, I would expect its software's response to a single bad image to have been tested. Repeatedly.

      Admit all the whooping, will anyone be asking the engineers to account for their failure to do proper testing ?

      They did. Read the source blog for the article. https://mars.nasa.gov/technolo... [nasa.gov]

      • Re: (Score:3, Informative)

        by geekmux ( 1040042 )

        Admit all the whooping, will anyone be asking the engineers to account for their failure to do proper testing ?

        They did. Read the source blog for the article. https://mars.nasa.gov/technolo... [nasa.gov]

        Yeah. I did.

        The answer is No. They didn't.

        "This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps. From this point on, each time the navigation algorithm performed a correction based on a navigation image, it was operating on the basis of incorrect information about when the image was taken."

        In other words, someone didn't test this, much less stress test it.

        • Re: (Score:2, Informative)

          by Anonymous Coward

          In other words, someone didn't test this, much less stress test it.

          From TFA:

          Despite encountering this anomaly, Ingenuity was able to maintain flight and land safely on the surface within approximately 16 feet (5 meters) of the intended landing location. One reason it was able to do so is the considerable effort that has gone into ensuring that the helicopter’s flight control system has ample “stability margin”: We designed Ingenuity to tolerate significant errors without becoming unstable, including errors in timing. This built-in margin was not fully nee

          • NASA does vastly more testing than the typical software dev doing continuous integration, yet that software dev seems to instantly jump on any failure. SHIP IT! Customers can't complain if we don't answer the phones!

        • by JoeRobe ( 207552 ) on Saturday May 29, 2021 @06:34AM (#61433708) Homepage

          Just my opinion, but given the complexity of the entire drone system and their concern it would even fly at all, I'm not that bothered by a software bug at this level. I'm sure they tested the drone ad nauseum on earth, but in reality it was an $0.085 billion optional piece of a $2.2 billion mission. So if NASA had to dedicate their efforts somewhere, I'm glad it was on the mission-critical rover and lander that appear have worked well thus far.

        • In other words, someone didn't test this, much less stress test it.

          Even in the physical world of real stuff, problems often fail to manifest during testing. The problem just won't occur while a technician is examining, stress testing, etc. And the same definitely happens with software. As complexity increases the chance of some special combination of factors combining in ways which are ever more difficult to predict also increases.

          Would more testing have caught this bug? Maybe. Maybe not.

        • They are stress testing it. That is literally the mission they are on.

        • In other words, someone didn't test this, much less stress test it.

          From TFA: "One reason it was able to do so is the considerable effort that has gone into ensuring that the helicopter’s flight control system has ample “stability margin”"

          So your claim: Someone didn't test something.
          Reality: Not only did they know exactly what would happen but compensated for it by introducing a margin of error and the end result is that everything worked successfully without incident.

          Geekmux, pull your head out of your arse. You're not smarter than the people at NASA, you

        • The time-stamps should come from the camera itself, not a later process. This reduces the chance of bad time-stamps on images. Was the separation a resource or weight trade-off?

    • by Amiga Trombone ( 592952 ) on Saturday May 29, 2021 @04:51AM (#61433616)

      They tested it well enough that it completed all of the objectives it was scheduled to perform. Everything at this point is just gravy. Remember, this is a test. It was uncertain whether it would fly at all.

      • They tested it well enough that it completed all of the objectives it was scheduled to perform.

        This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps. From this point on, each time the navigation algorithm performed a correction based on a navigation image, it was operating on the basis of incorrect information about when the image was taken.

        Crashing is not a planned objective. And unless this was a known problem/limitation , they didn't test this, much less stress test it. We got lucky that backup systems kicked in and salvaged the hardware. Kudos to them for redundancy at least.

        • Crashing is not a planned objective.

          Wow, so this is the state of computer science in 2021. Inclusion in one set does not imply anything about what other sets contain. For example: all the objectives can be completed, but at the same time, there can be things that happen that are not in the set of objectives.

          We got lucky that backup systems kicked in

          they didn't test this, much less stress test it.

          But they did test it. The test went like this: "what happens when the sensor data doesn't match the image data, make sure the backup systems kick in".

        • by caseih ( 160668 )

          Too funny. With your perfect 20/20 hindsight you know all the failure paths and can get perfect test coverage. But you're dead wrong of course. All of the mission objectives were met. The new mission objective is to see what else it can do. They are to push the envelope more and more, and if might just crash. That's acceptable and will be very informative if it happens. However it appears that the copter will be flying for some months yet. Likely it may never crash but just eventually be unable to char

    • by Dixie_Flatline ( 5077 ) <vincent@jan@goh.gmail@com> on Saturday May 29, 2021 @06:48AM (#61433736) Homepage

      This is the test. This whole situation is the test. They weren't even sure Ingenuity would take off the first time. They're working on a 7 minute light delay on a planet with nearly no atmosphere and a fraction of Earth's gravity and they built in enough redundancy to recover from what would otherwise be a fatal error.

      As anyone that's actually shipped a product will tell you, you can test all you want, but actually having the product in the wild is where you really find out all the mistakes you made. Being able to build a failsafe that worked that well is a bigger triumph than the navigation error was a failure.

    • Testing is the reason why it is there. You realize your notion of "proper testing", is a halting problem, right?

      https://en.wikipedia.org/wiki/Halting_problem

    • $85M sounds like a lot but it goes very quickly. Remember, its not 85M for software but for the entire project . Its developing a new type of aircraft to operate in a new environment. Testing is very difficult because there is no practical way to simulate the low gravity. Apple spends something like $15B/year on engineering, or 200X the cost of the helicopter. If you divide that among their new products each year, its clear that developing modern hardware is very expensive.
    • This glitch caused a single image to be lost, but more importantly, it resulted in all later navigation images being delivered with inaccurate timestamps.

      It was not a single image being lost. All the timestamps after that point was off by 33.3333 milli seconds.

      • by tap ( 18562 )

        It was not a single image being lost. All the timestamps after that point was off by 33.3333 milli seconds.

        That seemed like the real design flaw to me. Since the image timestamp is critical, it should not be computed as being relative to previous timestamps. Done that way, any large error will forever persist, as happened here. But also any small errors will accumulate and eventually become large errors. They should have found a way to generate an absolute timestamp for each image that is independent of the any previous timestamps.

        There's a similar problem maintaining A/V sync in video playback. A naive app

        • by tlhIngan ( 30335 )

          The problem is Mars' atmosphere is around 2% as thick as Earth's. So the rotors have to spin very fast while carrying very little weight, most of which is consumed by the motor and battery. This limits what you have for a processor, and it's likely they made the choices they did based on the limitations.

          After all, there are good practices, and perhaps they realized if they did it properly, it would overwhelm the processor and the chances of having a bad image are limited so shortcuts may be taken to achieve

          • by tap ( 18562 )

            It's a snapdragon 801 running Linux. There's zero reason they couldn't timestamp their images properly. It comes down to "didn't know how" or "bad design".

    • It was nearly lost because of a single bad image FFS ... Admit all the whooping, will anyone be asking the engineers to account for their failure to do proper testing ?

      Don't know if it was a lack of testing or that their specs are simply too tight.

      ... and while the flight uncovered a timing vulnerability that will now have to be addressed, it also confirmed the robustness of the system in multiple ways.

      From what I've seen on NOVA (etc) they definitely have people on their teams who think "outside the box", but perhaps they need people who think outside different boxes than their regular hires.

    • by caseih ( 160668 )

      You do realize that Ingenuity is a technology demonstrator, right? If it had failed after the first flight they'd have still learned a great deal and got their money's worth. In fact no one, even the project engineers, were even sure it would fly at all. There were many unknowns. Would it survive the cold nights? Would there be enough energy to heat and also to fly? How long would it take to charge the batteries? How would the batteries perform? None of these questions could be answered by testing on e

      • So, do you have something relevant to say?

        I repeat my statement:

        "It was nearly lost because of a single bad image FFS
        If I was paying $85 million for a drone, I would expect its software's response to a single bad image to have been tested. Repeatedly."

        Try again.

    • by N1AK ( 864906 )
      It wasn't a bad image that caused the issue, it was incorrect timestamps on all images from a certain point in the flight (due to an image not registering); when it happened the other controls on the device brought it back down safely. Always cracks me up how people who've achieved nothing of any interest to anyone else in their life feel the need to try and twist events to demean people who have; like getting a fricking aircraft flying succesfully on mars.
      • Always cracks me up how people who've achieved nothing of any interest to anyone else in their life think that because clever people have done something, no-one anywhere is allowed to point out obvious faults in their methodology, because "these poeple are clever, so can do no wrong".

        "It wasn't a bad image that caused the issue, it was incorrect timestamps on all images from a certain point in the flight (due to an image not registering)"

        So you claim it wasn't a bad image, but it was an image that badly reg

    • It is not rocket science to make a list of the things your craft's flight depends on:
      It's calculated location, the generation from that location of instructions for the control system, the transmission of those instruction to the control system.

      It is not rocket science to make from that a list of the inputs those things depend on:
      The first, we are told, depends on the images and their timestamps.

      It is not rocket science to make from that a list of things to test:

  • by nospam007 ( 722110 ) * on Saturday May 29, 2021 @05:28AM (#61433650)

    "confusing the craft about its location."

    A female one would just have asked for directions.

  • I wonder if people are missing the fact that they were able to update the software to fix the problem. I don't know about you, but I wouldn't want to be the guy in charge of a 208 million mile firmware update.
    • I wonder if people are missing the fact that they were able to update the software to fix the problem. I don't know about you, but I wouldn't want to be the guy in charge of a 208 million mile firmware update.

      If they are clever and also have space then their firmware loader has two separate boot images on two separate memory devices, and their main storage has two OS images which can be updated independently. This takes much of the pucker factor out of the equation...

      • Very likely they have a backup image that provides enough support to load another new image. The big risk is when they fly it with the new firmware since there is no "undo" from crashing.
        • by caseih ( 160668 ) on Saturday May 29, 2021 @07:16PM (#61435384)

          Correct. Tim Canham says they not only have a way of flashing the flight controller, they can also access a shell on the embedded linux box, which can be used to stage updates to the controller, although at 115200 baud it takes a while to get the file from the rover to the copter, and communications with Ingenuity is low priority right now. There's a lot of imagery and data that will slowly yet trickle in as they are permitted to use the bandwidth. A variety of watchdog timers can reboot the OS if needed as well. What's so interesting is the copter is built out of mostly off-the-shelf, non-hardened bits. It's hilarious to think about a couple of slashdotter troll posts about software testing when the copter is using off-the-shelf batteries that were never really designed for -200 nights (they are heated), and they are using off-the-self bits from Sparkfun for their flight controller, battery charge circuit, servos, etc. Odds are these low-cost components will fail long before some software bug ends the mission.

  • You're supposed to do a software FMEA and look at the logic. If this variable, or that one, corrupts, what happens and how does the system auto correct?

    "What if the update to nav data stops or is delayed?" "Dunno." Really?

  • by stikves ( 127823 ) on Saturday May 29, 2021 @06:36PM (#61435336) Homepage

    So one system failed, another (landing) corrected that failure.

    Don't go blaming with the NASA engineers. They used an experimental side project to test flight on another planet, and even this is a great achievement. And they used a "modern" (i.e.: not space tested) hardware which can actually be found on a cheap phone (Qualcomm Snapdragon 801). Bugs are okay, in flight correction is much better.

    How many of you can handle *every possible* scenario?
    If you said you can, I should talk to your "theory of computation" professor for not being able to teach you that is literally not possible.

    • by N1AK ( 864906 )

      How many of you can handle *every possible* scenario?

      Let's face it: The vast majority of people likely to critiscise the people responsible for this on here wouldn't be trusted writing code to warn a driver that their cars washer fluid is low. The fact that they "think" they could do something means jack.

Think of it! With VLSI we can pack 100 ENIACs in 1 sq. cm.!

Working...