Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Space Science Technology

Computer Science Tools Flood Astronomers With Data 60

purkinje writes "Astronomy is getting a major data-gathering boost from computer science, as new tools like real-time telescopic observations and digital sky surveys provide astronomers with an unprecedented amount of information — the Large Synoptic Survey Telescope, for instance, generates 30 terabytes of data each night. Using informatics and other data-crunching approaches, astronomers — with the help of computer science — may be able to get at some of the biggest, as-yet-unanswerable cosmological questions."
This discussion has been archived. No new comments can be posted.

Computer Science Tools Flood Astronomers With Data

Comments Filter:
  • I gots to nitpick (Score:1, Insightful)

    by Anonymous Coward

    methinks its the sensors that are doing the flooding. Not "computer science tools"

    • The sensors wouldn't be picking up anything interesting if they weren't automatically being pointed at interesting things. There aren't enough astronomers to do the pointing manually.
      • by mwvdlee ( 775178 )

        Yeah, but the "computer science tools" wouldn't know what were interesting things unless the astronomers tell them. So basically this is just the astronomers flooding themselves.

  • by Anonymous Coward

    But I think "tools" is a bit offensive. They're trying to help the astronomers in a meaningful way.

  • My biggest issue would be if there is too much information. What if the scientists are using the wrong search queries and missing something important? Or maybe something important is just buried on page 931 of a 2,000 page data report. Still, it's better than the opposite problem, of just not having the data to search.
    • Re:too much? (Score:5, Insightful)

      by SoCalChris ( 573049 ) on Tuesday July 19, 2011 @10:14PM (#36818658) Journal

      There's no such thing as too much data in a case like this, assuming that they can store it all. Even if it's too much to parse now, it won't be in a few years. Get as much data as we can now, while there's funding for it.

      • Disk I/O and the ability to backup that data can be a bitch. Especially if the delta changes overlap within a 24-hour period. Of course, there are ways of addressing this problem with multiple servers, but that comes at a financial cost. Also, SAN and DAS technology still lags behind in I/O compared to the explosive growth in storage capacity.

        Personally, I have clients that deal with 30+ TB worth of science data. Data retention is a major headache for me because as of four years ago, they only needed 2TB of

      • by mwvdlee ( 775178 )

        30TB a night, for a single telescope. The cost of storing such amounts of data would be astronomical *wink*.

    • What if the scientists are using the wrong search queries and missing something important? Or maybe something important is just buried on page 931 of a 2,000 page data report?

      Which is pretty much the same problem astronomy has had since roughly forever... Looking in the wrong place. Looking at the wrong time. Looking in the wrong wavelength. Look for the wrong search terms. Looking on the wrong page... It's all pretty much the same.

      The sky and the data will be there tomorrow and they'll try ag

      • For a lot of things .. but obviously not all. But, the concept that you will catch that one transient that will help you is astronomical as well.

  • True in all fields (Score:4, Interesting)

    by eparker05 ( 1738842 ) on Tuesday July 19, 2011 @09:38PM (#36818424)

    Many sciences are experiencing this trend. A branch of biochemistry known as metabolomics is a growing field right now (in which I happen to be participating). Using tools like liquid chromatography coupled to mass spectrometry we can get hundreds of megabytes of data per hour. Even worse is the fact that a large percentage of that data is explicitly relevant to a metabolomic profile. The only practical way of analyzing all of this information is through computational analysis, either through statistical techniques used to condense and compare the data, or though searches on painstakingly generated metabolomic libraries.

    That is just my corner of the world, but I imagine that many of the low hanging fruits of scientific endeavor have already been picked, going forward, I believe that the largest innovations will come from the people willing to tackle data sets that a generation ago would be seen as insurmountable.

    • Many sciences are experiencing this trend.

      Yes, the piracy sciences have been particularly hard hit. Modern piracy enegineering can easily generate the equivalent of 10 blu-rays, or 500 gigabytes, per day. Modern data reduction tools such as x264 have been developed to deal with this data overload, and can frequently reduce a 50GB bluray by more than 10:1 down to 8GB or less without a significant loss of information in the processed data.

    • We got Bioinformatics, now what would this field be called? Astroinformatics? The Square Kilometre Array project is another example of this.
    • Hm, small world--I'm also in metabolomics (more on the computational end than the biological side of things, what I like to call computational metabolomics). I was going to write a post similar to your own, but more generalized for those who aren't familiar with the biology behind it. The issue now is that well established informatics/statistical/computer science approaches are used as general tools in biology/astronomy/biochemistry, and there is a great need to formulate novel algorithms to take advantage
    • by mwvdlee ( 775178 )

      I download Linux distro torrents faster than "hundreds of megabytes per hour".
      At that speed, a full day's worth of data is only a few GB, or roughly 10,000 less than discussed in TFA.
      Still, analysing even a few GB of data a day is no task for mere men.

      • by Xest ( 935314 )

        "Still, analysing even a few GB of data a day is no task for mere men."

        Unless it's a word document or power point presentation in which someone has embedded an uncompressed video or bunch of uncompressed images. Then you can get through it in about 5 minutes flat, not counting the half hour it takes Word/Powerpoint to load.

        No, in all seriousness though, it really depends what the data is. That's why I'm not keen on this arbitrary "many gigabytes of data" metric which articles like this are supposed to wow u

  • I'm not an expert in Astronomy, but in general, I don't think you can collect too much data, as long as its stored in an at least somewhat intelligible format. This way, even if professional astronomers miss something today, amateurs and/or future astronomers will have tons of data to pick apart and scavenge tomorrow.

    Plus, more data should make it easier to test hypotheses with more certainty. Hopefully, the data will be made publicly available after the gatherers have had a shot or two at it.

  • 30TB per day works out to about 10 petabytes per year. If you compare this to the total amount of data produced in a year (from all human sources), around a zetabyte, it's not that huge. In fact, IIRC, the yearly transfer rate of the internet is around 250 exabytes. The people with the really hard job of data processing are internet search engines. Not only do they have to through several orders of magnitude more data, they have to do it faster, and with much less clearly defined queries.

    I sometimes wonder

    • Sounds like another task for IBM's Watson. The way I understand the problem, most scientists must be in cohorts with skilled CS folk to generate the types of answers from such large datasets, or they must be half cs folk themselves in order to traverse such scales of data. Quite an undertaking when professionals should be focused in one area. Let alone conveying the ideas of either field to the other how they themselves see/understand it. However the dawn of asking Watson or Enterprise to figure something
  • For some reason, that word scares me [marketinginformatics.com]..

  • by oneiros27 ( 46144 ) on Tuesday July 19, 2011 @10:24PM (#36818720) Homepage

    *WILL* generate. LSST isn't operating yet.

    And yes, 30TB is a lot of data now, but we have some time before they finally have first light.

    Operations isn't supposed to start 'til 2019 : http://www.lsst.org/lsst/science/timeline [lsst.org]

    We just need network and disk drive sizes to keep doubling at the rate they have, and we'll be laughing about how we thought 30TB/night was going to be a problem.

    SDO finally launched last year with a date rate of over 1TB/day ... and all through planning, people were complaining about the data rates ... it's a lot, but it's not insurmountable as it might've been 8 years ago, when we were looking at 80 to 120GB disks.

    Although, it'd be nice if monitor resolutions had kept growing ... if anything, they've gotten worse the last couple of years.

    (Disclaimer : I work in science informatics; I've run into Kirk Bourne at a lot of meetings, and we used to work in the same building, but we we deal with different science disciplines)

    • by Carnivore ( 103106 ) on Tuesday July 19, 2011 @11:09PM (#36818990)

      In fact, they just started blasting the site. I actually live next door to the LSST's architect, which is pretty cool.

      Astronomers generate a tremendous amount of data, bested only by particle physicists. Storing it all is a challenge, to put it mildly. Backup is basically impossible.
      The real problem is that the data lines that go from the summit to the outside world are still not fast. The summits here are pretty remote and even when you get to a major road, it's still in farm country. And then getting it out of the country is tough--all of our network traffic to North America hits a major bottleneck in Panama, so if you're trying to mirror the database or access the one in Chile, it can be frustratingly slow.

      • by dkf ( 304284 )

        Astronomers generate a tremendous amount of data, bested only by particle physicists.

        Earth scientists will merrily generate far more — they're purely limited by what they can store and process, since deploying more sensors is always possible — but they're mostly industrially funded, so physicists and astronomers pretend to not notice.

        • by csrster ( 861411 )
          Theoreticians surely generate most because they're only limited by how far a CPU can churn out floating-point numbers.
        • by mwvdlee ( 775178 )

          Deploying more telescopes is always possible as well.
          This isn't a race about who can fill up storage space the quickest.

      • At least you aren't at Dome A. You might would have to you some tropospheric (to no pay outrageous SAT usage rates).

    • by Shag ( 3737 )

      *WILL* generate. LSST isn't operating yet.

      This, unless they have a time machine. ;)

      The first Pan-STARRS scope with its 1.3-gigapixel camera has been doing science for a little while now, and I think it might do something like 2.5TB a night. That's still a lot of disk (and keep in mind that they originally planned to have 4 of those scopes), but I think their pipeline reduces it all to coordinates for each bright thingy in the frame and then throws away the actual image (though I could be wrong).

      Where I work, our highest-resolution toy is 80 megap

  • Ok, I know this doesn't solve the problem of actually ANALYZING the data but for storing and moving the data around, what's the best compression algorithm for astronomical (I mean the discipline, not the size!) data.

    I used to work for a company that developed a really good compression algorithm using wavelets. At the time it was the only one to be accepted by A-list movie directors (the people with the real power in Hollywood); they refused to go with any of the JPEG or MPEG variants (this was before JPEG

    • by dargaud ( 518470 )

      Do they "clean up" the images first to make it easier to compress?

      Normally they don't. Compression algorithms, almost by definition, create artifacts that are difficult if not impossible to distinguish from potentially interesting data. So science imagery is almost always saved in 'raw' format, unless you have no other option like with your Gallileo example. Imagine applying a dead pixel detection to an astronomy image: 'poof!', all the stars magically disappear!

      • by mwvdlee ( 775178 )

        Not all compression algorithms are lossy, though the lossless ones aren't nearly as space-efficient.
        But some form of lossy compression might work too; it would be easy to filter the images so, for instance, any "nearly-black" pixel is set to black. Add some RLE and you have compression.
        The key to lossy compression is having a way to determine what type of data isn't as important and approximating that data.

        • by dargaud ( 518470 )

          The key to lossy compression is having a way to determine what type of data isn't as important and approximating that data.

          The problem with research is that until you've looked, you don't know what you are looking for...

Nondeterminism means never having to say you are wrong.