Please create an account to participate in the Slashdot moderation system


Forgot your password?
Education Science

Getting Students To Think At Internet Scale 98

Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."
This discussion has been archived. No new comments can be posted.

Getting Students To Think At Internet Scale

Comments Filter:
  • by razvan784 ( 1389375 ) on Tuesday October 13, 2009 @05:19AM (#29729651)
    Science has always been about extracting knowledge from thoughtfully-generated and -processed data. Managing enormous datasets is not science per se, it's computer engineering. It's useless to say 'hey I'm processing 30 TB' if you're processing them wrong. Scientific method and principles are what count, and they don't change.
    • Quite. Dr Snow didn't need squillobytes of data to discover the cause of cholera, just a few hundred cases, some keen observation and a bit of intuition.

    • by Interoperable ( 1651953 ) on Tuesday October 13, 2009 @07:37AM (#29730247)

      Yeah no kidding. I don't know if maybe that quote ('Science these days has basically turned into a data-management problem') was taken out of context, but I'm surprised a professor would say something that ignorant. I recently did a Master's in physics and it certainly didn't involve huge quantities of data; I ended up transferring much of my data off a spectrum analyzer with a floppy drive. (When we lost the GPIB transfer script I thought it would take too long to learn the HP libraries to rewrite it. That was a mistake, after 4 hours of shoving floppies in the drive I sat down and wrote a script in 2 hours, ah well.)

      But the point is, a 400 data point trace may be exactly what you need to get the information your looking for. Just because we can collect and process huge quantities of data doesn't mean that all science requires you to do so, nor is simply handling the data the critical part of analyzing it.

    • Re: (Score:3, Insightful)

      It's also useless to say 'hey I'm analyzing this graph' if you're analyzing it wrong. I think you're missing the big picture. It's incredibly naive to think that the fundamental laws are simple enough to be grasped without massive datasets. It is possible, but all the data gathered thus far suggests that the fundamental laws of nature will not be found by someone staring at an equation on a whiteboard until it clicks. That is why Cern's data capacity is measured in terabytes, and they want to grow it as muc

    • by jawahar ( 541989 )

      Getting Students To Think At Internet Scale

      After reading the headline, I thought this is an extension to []

  • everybody can capture ridiculous amount of data, do it smart and manage them is what makes a genius.
    • everybody can capture ridiculous amount of data, do it smart and manage them is what makes a genius.

      Go ahead and manage it, genius. The rest of us just use Azureus for that ;)

      • by HNS-I ( 1119771 )

        Allow me to introduce to you mutorrent [], poor chap.

        The article mentions hadoop which is an open source version of google's map-reduce template(I think you can call it). This is great and all but it is a fairly static mechanism and hardly the end-all of distributed computing. Shouldn't university students be working on the next generation?

    • by CompMD ( 522020 )

      Managing large amounts of data was a problem for the chief engineer for a project I worked on. This guy had a PhD in Aerospace Engineering and lots of professional and academic honors. I was running a wind tunnel test that was capturing 8 24-bit signals at 10kHz and writing the data to a csv. Now, he bought good hardware, but refused to pay for decent analysis software, mainly because he didn't know any. So I had to write a program to break up the data into files small enough that Excel could open them,

  • The LSST? (Score:5, Informative)

    by aallan ( 68633 ) <> on Tuesday October 13, 2009 @05:44AM (#29729755) Homepage

    Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night.

    Err no it doesn't, and no they aren't. The telescope hasn't been built yet? First light isn't scheduled until late in 2015.


    • Re: (Score:3, Funny)

      by Thanshin ( 1188877 )

      You clearly aren't prepared to think in a future frame of reference.

      That's the consequence of studying with equipment that existed at the moment you were working with it.

      Future generations won't have that problem, as they're already studying with equipment that will be paid for and released to the university several years after their graduation.

    • Re: (Score:3, Interesting)

      by Shag ( 3737 )

      What aallan said - although, 2015? I thought the big projects (LSST, EELT, TMT) were all setting a 2018 target now.

      I went to a talk a month and a half ago by LSST's lead camera scientist (Steve Kahn) and LSST is at this point very much vaporware (as in, they've got some of the money, and some of the parts, but are nowhere near having all the money or having it all built.) Even Pan-STARRS, which is only supposed to crank out 10TB a night, only has 1 of 4 planned scopes built (they're building a second), and

    • That was my first thought in reading this, too.

      There *are* large data systems online now, even if they're not of the scope of LSST. The big difference is that the EOS-DIS (earth science) has funding to cover it stuff like building giant unified data centers (I think they pull 2TB/day ... per satellite), while the rest of us us in the "space sciences" are trying to figure out how to get enough bandwidth to serve our data, and using various distributed data systems (PDS, the VxOs, etc.). Once SDO finally la

  • A fantastic idea (Score:2, Interesting)

    by Anonymous Coward

    This is a great idea
    Even in business we often hit problems with systems that are designed by people that just dont think about real world data volumes. I work in the ERP vendor SPACE (SAP, ORACLE, PEOPLESOFT and so on) and their inhouse systems arent designed to simulate real world data and so their performance is shocking when you load real throughput into them. AND so many times have I seen graduates think Microsoft systems can take enterprise volumes of data - and are shocked when the

  • by Rosco P. Coltrane ( 209368 ) on Tuesday October 13, 2009 @05:48AM (#29729777)

    They just need to think. That's what they study for (ideally). Thinking people with open minds can tackle anything, including the "scale of the internet".

    When I was in high school, I used a slide rule. When I entered university, I got me a calculator. Did maths or problem solving abilities change or improve because of the calculator? no. Student today can jolly well learn about networking on small LANs, or learn to manage small datasets on aging university computers, so long as what they learn is good, they'll be able to transpose their knowledge on a vaster scale, or invent the next Big Thing. I don't see the problem.

    • Re: (Score:3, Insightful)

      by adamchou ( 993073 )
      A LOT of research has been put into improving algorithms for working on large scales. By not teaching our youth all that we have learned in school, they are just going to have to figure it out themselves an continue to reinvent the wheel. How are we supposed to advance if we don't put them in a situation to learn and apply our new found knowledge?
      • They can do it the same way that us geezers have had to do it, by figuring out that something is important and studying it on your own. Says the guy with grey hair and an accounting degree who is building a Hadoop based prototype to test replacing mainframe processing systems with a map-reduce approach.

    • Re: (Score:3, Informative)

      I don't see the problem.

      ^Maybe this illustrates the point?

      Really really big numbers can be hard for the human brain to get a grip on. But more to the point, operating at large scales presents problems unique to the scale. Think of baking cookies. Doing this in your kitchen is a familiar thing to most people. But the kitchen method doesn't translate well to an industrial scale. Keebler doesn't use a million gallon bowl and cranes with giant beaters on the end. They don't have ovens the size of a cru
    • by Yvanhoe ( 564877 )
      Shhhh, let them start their One Supercomputer Per Child program. It can only be good.
      • by jc42 ( 318812 )

        We might note that in 1970, a computer with the capacity of the OLPC XO would have been one of the biggest, fastest supercomputers in the world. And you couldn't even buy a computer terminal with a screen that had that resolution. Now it's a child's (educational) toy.

        The first computers I worked with had fewer bytes of memory+disk and a slower process than the "smartphone" in my pocket. (Which phone doesn't matter; it'd be true for all of them. ;-)

  • Add me to the list of people who think this is a solution in search of a problem.

    Oh, who the hell am I kidding. I'm sure the problem they have in mind has something to do with spying on people.

  • Wrong (Score:3, Insightful)

    by Hognoxious ( 631665 ) on Tuesday October 13, 2009 @05:54AM (#29729819) Homepage Journal
    Summary uses data and information as if they are synonyms. They are not.
    • Because of the computing power to generate the higher level data products, some data systems are serving level 1 data (calibrated data), not the raw sensor recordings (level 0).

      Knowledge of the sensor's characteristics are thus encoded into the products being served, and this, from an Information Science standpoint, you could characterize the higher level data products as "Information", not "Data". ... see, I *did* actually read the first chapter of Donald Case's book []. (although, I proved that by criticizin

  • Indeed (Score:5, Interesting)

    by saisuman ( 1041662 ) on Tuesday October 13, 2009 @06:07AM (#29729861)
    I worked for one of the detectors at CERN, and I strongly agree with the notion of Science being a data management problem. We (intend to :-) pull a colossal amount of data from the detectors (about 40 TB/sec in case of the experiment I was working for). Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales. In short, there is a lot that happens with the data before it is even seen by a physicist. Having said that, I agree that very few people have a real appreciation and/or understanding of these kinds of systems and even fewer have the required depth of knowledge to build them. But this tends to be a highly specialized area, and I can't imagine it's easy to study it as a generic subject.
    • Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales.

      I didn't know they needed perl coders at CERN. No wonder everyone is afraid of the LHC destroying the world...


    • This is nothing new. I worked at a university back in the early 90s and the center for remote sensing and optics was pulling in more data every single day than most department servers could hold. Their setup was both amazing and frightening. Just a massive pile of machines with saturated SCSI controllers. One of their big projects was to build a 4tb array. But 9.6 gig drives were just trickling into the market at that time. You'd need over 400 of those just to provide 4tb of raw storage. Nevermind pa

  • As an Internet user, I really can't imagine how I can download / upload petabytes of data, in my whole life.
  • Huge Misstatement (Score:4, Insightful)

    by Jane Q. Public ( 1010737 ) on Tuesday October 13, 2009 @06:11AM (#29729875)
    "Science these days has basically turned into a data-management problem," says Jimmy Lin.

    This is about the grossest misstatement of the issue that I could imagine. Science is not a data-management problem at all. But it does, and will, most certainly, depend on data management. They are two very different things, no matter how closely they must work together.
    • Exactly. These snappy one-liners are annoying and almost always innacurate. I dabble in Data Mining and while signifcant breakthroughs can be made by trawling through large amounts of mostly useless data, the most pertinent discoveries usually relate to just a few significant data features. More time and effort should be devoted to managing how much data gets produced and ensuring that what you do store is highly likely to be useful.
  • by ghostlibrary ( 450718 ) on Tuesday October 13, 2009 @06:12AM (#29729881) Homepage Journal

    I wrote up some notes from a NASA lunch meeting on this, titled (not too originally, I admit) 'The Petabyte Problem'. It's at []. It's not just a question of thinking on the 'Internet scale', but about massive data handling in general.

    What makes it different from previous eras (where MB was big, where GB was big) is that, before, the storage was expensive, yes, but bandwidth wasn't as much of a trouble for transmitting, if even locally. You could store MBs or GBs on tape, ship it, and extract the data rapidly-- bus and LAN speeds were high. Now, with PB, there's so much data that even if you ship a rack of TB drives and hook it up locally, you can't run a program on it in reasonable time. Particularly for browsing or inquiries.

    So we're having to rely much more on metadata or abstractions to sort out which data we can then process further.

    • Agreed, more or less.

      If you pick a random starting point, say the mid/late 80's the rate of improvement for CPU speeds, bus speeds, network speeds, disk speeds and disk sizes were similar. Their doubling rates differences were in months not years or decades. Through the 90s and the last 10 years what worked in the late 80s continued to more or less work.

      Disk capacity has had the fastest doubling times while networks have had the slowest over the past two decades. The resulting difference between now and

    • by zrq ( 794138 )

      Computer, show me all ship-like objects, in any profile. Ah, there it is."

      We are working on it IVOA [].

  • by DavidR1991 ( 1047748 ) on Tuesday October 13, 2009 @06:19AM (#29729909) Homepage

    If you swap the focus from smaller size problems to the mega-scale problems, then you get a bunch of students who can only do mega-scale problems (reverse of the trend the article talks about)

    Here's the rub: It's easier to scale up than it is to scale down. Most big problems are made up of lots of little problems. Little problems are rarely made up of mega-scale problems...

    I think what they need to do is to keep the focus on the small/'regular' stuff, but also show how their knowledge applies to the "big stuff" (so they can 'see' problems from both ends) - not just focus on one or the other

    • by Cederic ( 9623 )

      Without disagreeing with you, I'd suggest that small scale problems have different answers to large scale ones.

      The obvious approach is thus to teach both.

      Although there are a lot of petabyte scale problems out there, as a proportion of the total problem space they are still minute. Most students wont need to work on them.

      Further to that, there's no point being able to address a large scale problem if the building blocks you're using (which individually need to deal with individual data points) aren't suffic

  • by sdiz ( 224607 )

    ... a director at IBM's Almaden Research Center

    He is just trying to sell some mainframe computer.

  • by SharpFang ( 651121 ) on Tuesday October 13, 2009 @06:44AM (#29730005) Homepage Journal

    It was a very surprising experience, moving from small services where you get 10 hits per minute maybe, to a corporation that receives several thousands hits per second.

    There was a layer of cache between each of 4 application layers (database, back-end, front-end and adserver), and whenever a generic cache wouldn't cut it, a custom one was applied. On my last project there, the dedicated caching system could reduce some 5000 hits per second to 1 database query per 5 seconds - way overengineered even for our needs but it was a pleasure watching the backend compressing several thousands requests into one, and the frontend split into pieces of "very strong cache, keep in browser cache for weeks", "strong caching, refresh once/15 min site-wide", "weak caching, refresh site-wide every 30s" and "no caching, per visitor data" with the first being some 15K of Javascript, the second about 5K of generic content data, the third about 100 bytes of immediate reports and the last some 10 bytes of user prefs and choices.

  • 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer

    That is SOOO true! I mean, I was brought up on my Commodore 64, and I have NO IDEA how to to contemplate petabytes of data! (What does that EVEN MEAN?!?) I still don't see why ANYONE would need more than 64kB of memory.

  • 'Science these days has basically turned into a data-management problem,'

    The assumption here is that with 'size of data-set approaching infinity' the probability of finding a random result is approaching 1. Ph.D. students might like that.

  • A degree course is the first step, not the final result in a worthwhile scientific education. You don't expect to teach every student every technique they might use in every job they could get. Most of them won't even go into research - so there is a lot of waste teaching people skills that only a few will need. Far better to focus on the foundations (which could well include the basics of data analysis), rather than spending time on the ins and outs of products that are in use today - and will therefore be
  • Some have the attitude for juggling with exabytes. Since I was very young I've realized I never wanted to be human size. So I avoid the crowds and traffic jams. They just remind me of how small I am. Because of this longing in my heart I'm going to start the growing art. I'm going to grow now and never stop. Think like a mountain, grow to the top. Tall, I want to be tall. As big as a wall. And if I'm not tall, then I will crawl. With concentration, my size increased. And now I'm fourteen stories high, at le
  • Is there a single intro to programming book that uses long in favor of int? Just like double has replaced float for almost all numerical calculations, we need long to replace int.
  • Part of the problem is that young students fresh out of high school have no pet datasets. For many, they're buying a new laptop for college and keeping, at most, their music. Chat logs, banking, browsing history; it hasn't occurred to them to keep these things. Hell, I doubt few CS students make backups of their own computers. I know I didn't.

    Without a personal dataset of interest to maintain and process, you'll find little demand from students for classes on large dataset computations. Unless they enjoy as

  • Working with a small firewalled service provider that is reasonably large in terms of IP Allocation (Over half a million addresses) I'm constantly amazed that none of the design engineers I encounter seem to envision the number of sessions a firewall has to cope with.

    It's frustrating that we keep encountering firewalls with 10 Gbps + claimed throughput that fall over at barely more than 100 Mbps due to resource exhaustion and then the vendor engineers try to tell us that's because we aren't bonding the NICs

  • It's like this:
    Learn to play all the campaigns on Age of Empires II of which there is a population limit of 75.
    Repeat for a number of years until you are perfect and the most efficient.
    Then go play a network AOEII game with a pop cap of 200 and you will invariably lose because you can't get your head around it.
    The game is simple, yet hard to manipulate when scaled up and takes a lot more effort to win. And that's only changing one variable.

  • When we speak of "Science" in a general sense, it's about using the Scientific Method to pursue a goal or enhance our knowledge. This has nothing to do with the size of the data accumulated to perform the task. These days, all of us are learning to think at "Internet Scale." Join Facebook and "befriend" 200 million people. Enroll in LinkedIn and you have 40 million possible connections. National debts are measured in numbers with more zeros than ever used before to describe money. In other words, every fie

You will never amount to much. -- Munich Schoolmaster, to Albert Einstein, age 10