Please create an account to participate in the Slashdot moderation system


Forgot your password?
Education Science

Getting Students To Think At Internet Scale 98

Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."
This discussion has been archived. No new comments can be posted.

Getting Students To Think At Internet Scale

Comments Filter:
  • by Anonymous Coward on Tuesday October 13, 2009 @04:34AM (#29729715)
    The article doesn't convince me there's any need to "think at internet scale." Whether processing 100MB or 100 petabytes the process would be the same.
  • A fantastic idea (Score:2, Interesting)

    by Anonymous Coward on Tuesday October 13, 2009 @04:47AM (#29729765)

    This is a great idea
        Even in business we often hit problems with systems that are designed by people that just dont think about real world data volumes. I work in the ERP vendor SPACE (SAP, ORACLE, PEOPLESOFT and so on) and their inhouse systems arent designed to simulate real world data and so their performance is shocking when you load real throughput into them. AND so many times have I seen graduates think Microsoft systems can take enterprise volumes of data - and are shocked when the build something that collapses under a few terabytes or so ! Im used to having to post millions of transactions a day and there isnt an MS system in the world that deals with that. No offence to MS - we use excel for reporting and drilldowns and access a lot but understanding the limitations of the tools what it can really handle and scale to is essential. As well as understanding what large data volumes actually are these days !

    I know of a large bank that put in an ERP system using INTEL and MS SQL SERVER (with LOTS of press). We were a bit shocked actually because that bank was larger than we were and we had mainframes struggling to cope with our transaction load.
    In fact I was hauled over the coals for the cost of our hardware - so i investigate. The INTEL / MS solution failed so miserably they quietly shut it down and moved back to their mainframe - no press !. It wasnt able to cope with the merest fraction of the load and couldnt have. However the people involved had no conception of what large meant ( and they thought that a faster processor was all you needed - it never occurred to them you get something for all the extra money you pay for in a mainframe !)

    I think this is a terrific idea - but not only a the whole internet but they should teach this so the students understand these concepts for any large corporation they may work for !

  • Indeed (Score:5, Interesting)

    by saisuman (1041662) on Tuesday October 13, 2009 @05:07AM (#29729861)
    I worked for one of the detectors at CERN, and I strongly agree with the notion of Science being a data management problem. We (intend to :-) pull a colossal amount of data from the detectors (about 40 TB/sec in case of the experiment I was working for). Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales. In short, there is a lot that happens with the data before it is even seen by a physicist. Having said that, I agree that very few people have a real appreciation and/or understanding of these kinds of systems and even fewer have the required depth of knowledge to build them. But this tends to be a highly specialized area, and I can't imagine it's easy to study it as a generic subject.
  • by SharpFang (651121) on Tuesday October 13, 2009 @05:44AM (#29730005) Homepage Journal

    It was a very surprising experience, moving from small services where you get 10 hits per minute maybe, to a corporation that receives several thousands hits per second.

    There was a layer of cache between each of 4 application layers (database, back-end, front-end and adserver), and whenever a generic cache wouldn't cut it, a custom one was applied. On my last project there, the dedicated caching system could reduce some 5000 hits per second to 1 database query per 5 seconds - way overengineered even for our needs but it was a pleasure watching the backend compressing several thousands requests into one, and the frontend split into pieces of "very strong cache, keep in browser cache for weeks", "strong caching, refresh once/15 min site-wide", "weak caching, refresh site-wide every 30s" and "no caching, per visitor data" with the first being some 15K of Javascript, the second about 5K of generic content data, the third about 100 bytes of immediate reports and the last some 10 bytes of user prefs and choices.

  • Re:The LSST? (Score:3, Interesting)

    by Shag (3737) on Tuesday October 13, 2009 @05:45AM (#29730011) Homepage

    What aallan said - although, 2015? I thought the big projects (LSST, EELT, TMT) were all setting a 2018 target now.

    I went to a talk a month and a half ago by LSST's lead camera scientist (Steve Kahn) and LSST is at this point very much vaporware (as in, they've got some of the money, and some of the parts, but are nowhere near having all the money or having it all built.) Even Pan-STARRS, which is only supposed to crank out 10TB a night, only has 1 of 4 planned scopes built (they're building a second), and has been having optical quality problems with that one. By the time kids born at the turn of the century are leaving high school, though, yes, we do expect things like these to be up and running.

    But at the risk of sounding like that one college that publishes a list every year of what the freshman class of that year does and doesn't know, kids born around the turn of the century (my daughter is one) don't have the "OMG a TB!" mentality that we grownups have. The smallest capacity hard-drive my daughter will probably remember was 5 gigs - and that was in an iPod. Things like 64-bit, gigahertz speeds, multiprocessing, fast ethernet, wifi, home broadband... always been there. DVD-R media has, to her knowledge, always been there. (I did once have to explain to her that CDs used to be the size of platters and made of black plastic, after she found some Queensrÿche vinyl.)

    She's ten now, and you can put a half-terabyte or more in a laptop, so while the idea of some big scientific project spitting out 50 or 60 laptops worth of data in a night is clearly a lot of data, it's not something that can't be envisioned.

  • by Anonymous Coward on Tuesday October 13, 2009 @06:44AM (#29730279)

    Whether processing 100MB or 100 petabytes the process would be the same.

    I disagree. From my perspective, as a research student in astronomy, I can set my desktop to search through 100MB of catalogued images looking for objects that meet a certain set of criteria, and expect it to finish overnight. If I find that I've made an error in setting the criteria, that's not such a big deal - I fix my algorithm, and get the results tomorrow.

    With a 100PB archive, like the next generation of telescopes is likely to produce, I can't do that. I need more computing power. (There's a cluster at my university which we astronomers can borrow for tasks like this - fortunately, most such problems are trivially parallelisable.) I need to make sure I get my criteria right the first time, or people will be annoyed that I've wasted a week's worth of supercomputer time. And if I can do anything to make my search more efficient, even if it takes me a few days, it's worth doing.

    It's issues like this that make me wish I'd studied a bit of computer science in my undergraduate years - which is, incidentally, exactly what TFA is talking about.

  • by Enter the Shoggoth (1362079) on Tuesday October 13, 2009 @06:58AM (#29730339)

    I agree, and don't think it's anywhere near the science/CS-education bottleneck either. It's true that it can be useful to work with some non-trivial data even in relatively early education: sifting through a few thousand records for patterns, testing hypotheses on them, etc., can lead to a way of thinking about problems that is hard to get if you're working only toy examples of 5 data points or something. But I think there's very little of core science education that needs to be done at "internet-scale". If we had a generation of students who solidly grasped the foundations of the scientific method, of computing, of statistics, of data-processing, etc., but their only flaw was that they were used to processing data on the orders of a few megabytes, and needed to learn how to scale up bigger--- well that'd be a good problem for us to have.

    Apart from very specific knowledge, like actually studying scaling properties of algorithms to very-large data sets, I don't see much core science education even benefiting from huge data sets. If your focus in a class isn't on scalability of algorithms, but on something else, is there any reason to make students deal with an unwieldy 30 TB of data? Even "real" scientists often do their exploratory work on a subset of the full data set.

    I disagree with your agreement :-)

    I suspect that what the article is getting at is that when you deal with very large sets of data you have to think about different algorithmic approaches rather than the cookie-cutter style of "problem solving" that most software engineering courses focus on.

    These kinds of problems require a very good understanding of not just the engineering side of things but also a comprehensive idea of statistical, numerical and analytical methods as well as an encyclopaedic knowledge of computability, complexity and information theory.

    Just think about how different the Lucene [] library or MapReduce []
      are from the way most developers would have approached the problems that these tools address.

  • by Hal_Porter (817932) on Tuesday October 13, 2009 @07:03AM (#29730373)

    That's not true. The way you solve the problem changes radically depending on the amount of data you have. Consider

    100 KB - You could use the dumbest algorithm imaginable and the slowest processor and everything is fine.

    100 MB - most embedded systems can happily manage it. A desktop system can easily, even in a rather inefficient language. Algorithms are important.

    100 GB - Big ass server - you'd definitely want to make sure you were using an efficient language and had an algorithm that scaled well, certainly to 2 processors and most likely to 4 processors. Probably should be 64 bit for efficiency.

    100 PB+ You'd want a Google like system with lots of nodes. Actually I think at this point the code would look nothing like the 10 MB case. I remember someone saying that Google is "just a hash table". Now I think that misses the point. Google has invented things like Map/Reduce and has custom file systems. They've also spent a lot of time trying to cut costs by studying the effects of temperature on failure rates.

    Now I think these guys are spouting buzzwords. But if you want to process 100PB of data on

The herd instinct among economists makes sheep look like independent thinkers.