Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?
Education Science

Getting Students To Think At Internet Scale 98

Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."
This discussion has been archived. No new comments can be posted.

Getting Students To Think At Internet Scale

Comments Filter:
  • by razvan784 ( 1389375 ) on Tuesday October 13, 2009 @05:19AM (#29729651)
    Science has always been about extracting knowledge from thoughtfully-generated and -processed data. Managing enormous datasets is not science per se, it's computer engineering. It's useless to say 'hey I'm processing 30 TB' if you're processing them wrong. Scientific method and principles are what count, and they don't change.
  • by Rosco P. Coltrane ( 209368 ) on Tuesday October 13, 2009 @05:48AM (#29729777)

    They just need to think. That's what they study for (ideally). Thinking people with open minds can tackle anything, including the "scale of the internet".

    When I was in high school, I used a slide rule. When I entered university, I got me a calculator. Did maths or problem solving abilities change or improve because of the calculator? no. Student today can jolly well learn about networking on small LANs, or learn to manage small datasets on aging university computers, so long as what they learn is good, they'll be able to transpose their knowledge on a vaster scale, or invent the next Big Thing. I don't see the problem.

  • Wrong (Score:3, Insightful)

    by Hognoxious ( 631665 ) on Tuesday October 13, 2009 @05:54AM (#29729819) Homepage Journal
    Summary uses data and information as if they are synonyms. They are not.
  • by Trepidity ( 597 ) <delirium-slashdot.hackish@org> on Tuesday October 13, 2009 @05:55AM (#29729821)

    I agree, and don't think it's anywhere near the science/CS-education bottleneck either. It's true that it can be useful to work with some non-trivial data even in relatively early education: sifting through a few thousand records for patterns, testing hypotheses on them, etc., can lead to a way of thinking about problems that is hard to get if you're working only toy examples of 5 data points or something. But I think there's very little of core science education that needs to be done at "internet-scale". If we had a generation of students who solidly grasped the foundations of the scientific method, of computing, of statistics, of data-processing, etc., but their only flaw was that they were used to processing data on the orders of a few megabytes, and needed to learn how to scale up bigger--- well that'd be a good problem for us to have.

    Apart from very specific knowledge, like actually studying scaling properties of algorithms to very-large data sets, I don't see much core science education even benefiting from huge data sets. If your focus in a class isn't on scalability of algorithms, but on something else, is there any reason to make students deal with an unwieldy 30 TB of data? Even "real" scientists often do their exploratory work on a subset of the full data set.

  • Huge Misstatement (Score:4, Insightful)

    by Jane Q. Public ( 1010737 ) on Tuesday October 13, 2009 @06:11AM (#29729875)
    "Science these days has basically turned into a data-management problem," says Jimmy Lin.

    This is about the grossest misstatement of the issue that I could imagine. Science is not a data-management problem at all. But it does, and will, most certainly, depend on data management. They are two very different things, no matter how closely they must work together.
  • by ghostlibrary ( 450718 ) on Tuesday October 13, 2009 @06:12AM (#29729881) Homepage Journal

    I wrote up some notes from a NASA lunch meeting on this, titled (not too originally, I admit) 'The Petabyte Problem'. It's at []. It's not just a question of thinking on the 'Internet scale', but about massive data handling in general.

    What makes it different from previous eras (where MB was big, where GB was big) is that, before, the storage was expensive, yes, but bandwidth wasn't as much of a trouble for transmitting, if even locally. You could store MBs or GBs on tape, ship it, and extract the data rapidly-- bus and LAN speeds were high. Now, with PB, there's so much data that even if you ship a rack of TB drives and hook it up locally, you can't run a program on it in reasonable time. Particularly for browsing or inquiries.

    So we're having to rely much more on metadata or abstractions to sort out which data we can then process further.

  • by adamchou ( 993073 ) on Tuesday October 13, 2009 @07:17AM (#29730149)
    A LOT of research has been put into improving algorithms for working on large scales. By not teaching our youth all that we have learned in school, they are just going to have to figure it out themselves an continue to reinvent the wheel. How are we supposed to advance if we don't put them in a situation to learn and apply our new found knowledge?
  • by Interoperable ( 1651953 ) on Tuesday October 13, 2009 @07:37AM (#29730247)

    Yeah no kidding. I don't know if maybe that quote ('Science these days has basically turned into a data-management problem') was taken out of context, but I'm surprised a professor would say something that ignorant. I recently did a Master's in physics and it certainly didn't involve huge quantities of data; I ended up transferring much of my data off a spectrum analyzer with a floppy drive. (When we lost the GPIB transfer script I thought it would take too long to learn the HP libraries to rewrite it. That was a mistake, after 4 hours of shoving floppies in the drive I sat down and wrote a script in 2 hours, ah well.)

    But the point is, a 400 data point trace may be exactly what you need to get the information your looking for. Just because we can collect and process huge quantities of data doesn't mean that all science requires you to do so, nor is simply handling the data the critical part of analyzing it.

  • by FlyingBishop ( 1293238 ) on Tuesday October 13, 2009 @07:49AM (#29730289)

    It's also useless to say 'hey I'm analyzing this graph' if you're analyzing it wrong. I think you're missing the big picture. It's incredibly naive to think that the fundamental laws are simple enough to be grasped without massive datasets. It is possible, but all the data gathered thus far suggests that the fundamental laws of nature will not be found by someone staring at an equation on a whiteboard until it clicks. That is why Cern's data capacity is measured in terabytes, and they want to grow it as much as possible. That's why we have so much genetic data.

    Scientific method and principles count, but they are not enough.

  • by WarpedMind ( 151632 ) on Tuesday October 13, 2009 @09:51AM (#29731203)

    I'm afraid you are limited by a short time horizon. I remember working and computing on systems where 100MB was just as difficult and expensive to deal with as 100PB is today. 2MB was the amount of mountable storage on small systems. Anything larger and you had to go to "big iron".

    Real work was done on those small systems and good scientific principles and methods were they key then and the key now.

    Just remember that the "laptop" 10 years from now will have over 8TB local SSD.

    I operate an archive for the university. 10 years ago when we started it, a 10MB file was considered a pretty big file. Today it is the smallest size file we like to see stored in the archive. We store several PB's and I consider ours a small archive. A 100PB in a few years will be nothing. But those 100Exabyte files... now those will be difficult to work with. It will be "difficult to find hardware capable of storing that much data."


  • by Anonymous Coward on Tuesday October 13, 2009 @09:52AM (#29731219)

    I don't think you're going against the spirit of normalized tables. You've added a persistent cache which happens to be implemented in the database, that's all. Most high-end databases support what you're doing via materialized views (or materialized query tables, or summary tables, or whatever; the name varies). The RDBMS basically just writes the triggers for you, but provides the added benefit of using the MQTs for optimization somewhat like an index. Properly done, you can write your queries against the (normalized) base tables, and the query planner will use the MQT instead if it can.

    Really, the reason to push normalized tables is the whole "Code first; optimize later, if at all" thing. Put all your source data in the database because you never know exactly how much of it you need or can benefit from using. Normalize the tables because you never know exactly how you will be using them. Only when your code is quite stable will you know what queries are too slow or complex, and then you can optimize them by creating summary tables. Optimizing too soon will result in a lot of wasted effort and make your job harder down the road.

  • by Anonymous Coward on Tuesday October 13, 2009 @11:20AM (#29732317)

    Google has invented things like Map/Reduce

    Yikes. Google has been good about applying existing parallel and distributed computing concepts into their engineering, but they didn't invent the CS fundamentals. Map-reduce constructs are a basic idiom of most functional programs and parallel programs (whether functional or not) in scientific computing. What Google may have invented was a way to finally teach such basics to the hipsters who otherwise think the CS literature starts with their own first programming task.

    Similarly, their Python guru Guido did not invent a bunch of programming language concepts so much as cherry pick and apply some into his own bastard language. In this regard, he has more in common with Larry Wall creating Perl than with the real programming language theorists who made all the breakthroughs since the early days of the Lambda calculus.

  • by Hognoxious ( 631665 ) on Tuesday October 13, 2009 @12:52PM (#29733483) Homepage Journal

    Surely a chemist should know about chemistry, a biologist about biology and so on.

    If either needs to do computation beyond his own capabilities, he needs to get a CS person to help him. That's what specialists do, they specialise.

Marvelous! The super-user's going to boot me! What a finely tuned response to the situation!