Follow Slashdot stories on Twitter

Getting Students To Think At Internet Scale 98

Posted by kdawson on Tuesday October 13, 2009 @04:08AM from the peta-here-a-peta-there dept.

Hugh Pickens writes "The NY Times reports that researchers and workers in fields as diverse as biotechnology, astronomy, and computer science will soon find themselves overwhelmed with information — so the next generation of computer scientists will have to learn think in terms of Internet scale of petabytes of data. For the most part, university students have used rather modest computing systems to support their studies, but these machines fail to churn through enough data to really challenge and train young minds to ponder the mega-scale problems of tomorrow. 'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer, a director at IBM's Almaden Research Center. This year, the National Science Foundation funded 14 universities that want to teach their students how to grapple with big data questions. Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night. 'Science these days has basically turned into a data-management problem,' says Jimmy Lin, an associate professor at the University of Maryland."

This discussion has been archived. No new comments can be posted.

Getting Students To Think At Internet Scale

Load All Comments

Search 98 Comments Log In/Create an Account

Comments Filter:

Data management problem (Score:5, Insightful)

by razvan784 ( 1389375 ) writes: on Tuesday October 13, 2009 @04:19AM (#29729651)

Science has always been about extracting knowledge from thoughtfully-generated and -processed data. Managing enormous datasets is not science per se, it's computer engineering. It's useless to say 'hey I'm processing 30 TB' if you're processing them wrong. Scientific method and principles are what count, and they don't change.

Share
twitter facebook
- - Re: (Score:3, Insightful)
    
    by Trepidity ( 597 ) writes:
    
    I agree, and don't think it's anywhere near the science/CS-education bottleneck either. It's true that it can be useful to work with some non-trivial data even in relatively early education: sifting through a few thousand records for patterns, testing hypotheses on them, etc., can lead to a way of thinking about problems that is hard to get if you're working only toy examples of 5 data points or something. But I think there's very little of core science education that needs to be done at "internet-scale". I
    - Re: (Score:3, Interesting)
      
      by Enter the Shoggoth ( 1362079 ) writes:
      
      I agree, and don't think it's anywhere near the science/CS-education bottleneck either. It's true that it can be useful to work with some non-trivial data even in relatively early education: sifting through a few thousand records for patterns, testing hypotheses on them, etc., can lead to a way of thinking about problems that is hard to get if you're working only toy examples of 5 data points or something. But I think there's very little of core science education that needs to be done at "internet-scale". If we had a generation of students who solidly grasped the foundations of the scientific method, of computing, of statistics, of data-processing, etc., but their only flaw was that they were used to processing data on the orders of a few megabytes, and needed to learn how to scale up bigger--- well that'd be a good problem for us to have.
      Apart from very specific knowledge, like actually studying scaling properties of algorithms to very-large data sets, I don't see much core science education even benefiting from huge data sets. If your focus in a class isn't on scalability of algorithms, but on something else, is there any reason to make students deal with an unwieldy 30 TB of data? Even "real" scientists often do their exploratory work on a subset of the full data set.
      I disagree with your agreement :-)
      I suspect that what the article is getting at is that when you deal with very large sets of data you have to think about different algorithmic approaches rather than the cookie-cutter style of "problem solving" that most software engineering courses focus on.
      These kinds of problems require a very good understanding of not just the engineering side of things but also a comprehensive idea of statistical, numerical and analytical methods as well as an encyclopaedic knowledge o
      - Re: (Score:2)
        
        by StellarFury ( 1058280 ) writes:
        
        Clarification: the headline says "students" in science need to learn to think in internet-scale terms. This is clearly, clearly false, and bordering on stupid. There's no reason a chemist, biologist, or physicist needs internet-scale data sets if the systems they study are simply not that large.
        The summary says computer scientists, which I only partially buy. Again, you have to be working in a field that uses those data sets. If you aren't, then what does all your upscaling knowledge do for you? Diddly. Bas
        
        Re: (Score:3, Insightful)
        
        by Hognoxious ( 631665 ) writes:
        
        Surely a chemist should know about chemistry, a biologist about biology and so on.
        If either needs to do computation beyond his own capabilities, he needs to get a CS person to help him. That's what specialists do, they specialise.
        
        Re: (Score:2)
        
        by lewiscr ( 3314 ) writes:
        
        There's no reason a chemist, biologist, or physicist needs internet-scale data sets if the systems they study are simply not that large.
        Yes, if the systems are not that large. But I think all three of your examples are poorly chosen. You picked the three groups of scientists that are (as a field) producing huge data sets. Have you seen the amount of data generated in a single run of a small particle accelerator? Plasma containment simulations? Chemistry simulations (esp. where it pertains to biology)? I get this list from reading slashdot. I'm sure there are a lot more fields that I'm unaware of.
        Yes, not every physicist is working on
  - Re: (Score:2)
    
    by rikkards ( 98006 ) writes:
    
    No no, it's "learn think" not "think".
    WTF? This is the second day in a row where I have seen a similar typo like this in the summary.
  - Re:Data management problem (Score:5, Informative)
    
    by adamchou ( 993073 ) writes: on Tuesday October 13, 2009 @06:11AM (#29730109)
    
    thats absolutely not true. the process is vastly different when it comes to working with 100 MB or 10 petabytes. lets take databases for instance. if you have 100MB of data, you can just store the entire database on one server. when it comes to 100 PB of data, its even difficult to find the hardware capable of storing that much data. you need to start looking at distributed systems and distributed systems is such a broad field in itself.
    when i graduated in 2005, a lot of the techniques i was taught worked great for working with database systems that handled a few hundred thousand rows. then i got a job at an internet company that had tables with over 80 million rows. all that normalization stuff i learned in school had to be thrown out. times may have changed now, but when i was in school, not only did i not learn how to handle "internet scale" data sets, i was taught the wrong methods to handle large data sets.
    undergrad college students should at least get a basic intro to large data sets, if not have a class completely dedicated to learning on how to work with those data sets. school is supposed to prepare you for the work force. at least give the students the option to take a class that covers those topics if they want to go into those industries. i sure wish i had that option
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by MartinSchou ( 1360093 ) writes:
      
      then i got a job at an internet company that had tables with over 80 million rows. all that normalization stuff i learned in school had to be thrown out.
      Why is normalization useless just because you have 80,000,000 rows? I'm genuinely curious
      - Re: (Score:3, Informative)
        
        by autocracy ( 192714 ) writes:
        
        One example: I deal with healthcare claims. We keep everything normalized on insertion, but we also create some redundant, denormalized tables (data warehousing). Almost every query needs the same basic claim information, but I'm doing it in a query with one or two joins instead of 10.
        
        If something goes south with my manipulated tables, or I need a strange field, I still have my source data in a pure form. For a standard query, though, I can operate an order of magnitude faster by adding redundant tables tha
        
        Re: (Score:2, Insightful)
        
        by Anonymous Coward writes:
        
        I don't think you're going against the spirit of normalized tables. You've added a persistent cache which happens to be implemented in the database, that's all. Most high-end databases support what you're doing via materialized views (or materialized query tables, or summary tables, or whatever; the name varies). The RDBMS basically just writes the triggers for you, but provides the added benefit of using the MQTs for optimization somewhat like an index. Properly done, you can write your queries against
        
        Re: (Score:2)
        
        by DragonWriter ( 970822 ) writes:
        
        One example: I deal with healthcare claims. We keep everything normalized on insertion, but we also create some redundant, denormalized tables (data warehousing). Almost every query needs the same basic claim information, but I'm doing it in a query with one or two joins instead of 10.
        That sounds like just using a form of materialized views, which, while the implementation (in an RDBMS that doesn't implement them internally) may involve using triggers and denormalized "base" (in terms of the RDBMS, not the
        
        Re: (Score:2)
        
        by autocracy ( 192714 ) writes:
        
        I wasn't aiming to point out something outside. It doesn't have to be novel, or advanced. While Oracle implements materialized views, using MySQL I have to do it myself.
        
        Denormalization has never been novel. The point was just to give the parent of my comment an example of denormalization, and why one might do it.
        
        Re: (Score:2)
        
        by DragonWriter ( 970822 ) writes:
        
        I wasn't aiming to point out something outside. It doesn't have to be novel, or advanced.
        
        The post you were responding to was itself responding to someone else saying that large datasets required throwing out "all that normalization stuff" they learned in school.
        Using materialized views (whether canned or roll-your-own) in the way you describe (that is, with a normalized base schema to prevent the anomalies normalizations exists to prevent, and appropriate tables and triggers set up to present efficient-to-
      - Re: (Score:2)
        
        by adamchou ( 993073 ) writes:
        
        the most common reason i run into is because sometimes, a table becomes too big or receives too much traffic that a dedicated server is needed for it (i'm aware of the sharding option but thats much more difficult to implement and has limitations). so you split that table off from the rest of the tables. one table we had an issue with was our users table. now normally, you would split off some user data into other tables and do a join between the main user table and some of the secondary data on the other t
      - Re: (Score:2)
        
        by Alpha830RulZ ( 939527 ) writes:
        
        It's not accurate to say that normalization is useless - it's more accurate that normalization has costs that can become large, especially when you are sequentially processing an entire dataset.
        For example, consider the following
        Employee table
        Name| Position |salary
        Andrew| programmer|40,000
        Joe|tester|43000
        Jane|programmer|60000 --fucking reverse discrimination!
        If you were to normalize this, you'd factor out the position column into a positions table
        position_key| position
        1|pregrammer
        2|tester
        and change
    - Re: (Score:2, Insightful)
      
      by WarpedMind ( 151632 ) writes:
      
      I'm afraid you are limited by a short time horizon. I remember working and computing on systems where 100MB was just as difficult and expensive to deal with as 100PB is today. 2MB was the amount of mountable storage on small systems. Anything larger and you had to go to "big iron".
      Real work was done on those small systems and good scientific principles and methods were they key then and the key now.
      Just remember that the "laptop" 10 years from now will have over 8TB local SSD.
      I operate an archive for the un
      - Re: (Score:2)
        
        by adamchou ( 993073 ) writes:
        
        when i said finding the hardware to store 100PB is difficult, i didn't mean finding enough harddrives that we could hook up to a network. thats easy to do. but if you have a 100PB file, what do you do? thats where you need distributed file systems like hadoop to come in. then you have the issue with trying to process that much data. running a processing thread on one node will take forever for the processing to get done.
      - Re: (Score:1)
        
        by ResidentSourcerer ( 1011469 ) writes:
        
        If this is really a concern for teaching, then perhaps the problem should be scaled down virtually:
        Suppose that you implement virtual i386's with 4 MB ram and a 40 MB hard drive. Write a special driver for it to add a 80 ms delay for each disk access. Now instantiate 1000 copies of this VM on a machine or small cluster. This effectively turns a 3 GHz machine into a 3 MHz machine (Actually slower because of VM host overhead)
        Each machine above is 1/1000 of a real machine. Now instead of needing a 10 PB d
    - Re: (Score:2)
      
      by 2obvious4u ( 871996 ) writes:
      
      Well if Kryder's law holds then in 2042 we should have a 100PB iPod. Storing the data really isn't that hard anymore.
    - Re: (Score:2)
      
      by lewiscr ( 3314 ) writes:
      
      school is supposed to prepare you for the work force
      University (what TFS mentions) is not supposed to prepare you for the workforce. University is supposed to teach to you think. Once you know how to think, it's your job to figure out how to work. If you went to a university to prepare you for the workforce, you got swindled. Go to Technical or Vocational school, like DeVry or ITT Tech.
      That said, this genre of algorithms should be included in an algorithms class. At least introduce the concept, so that people that have learned how to think don't have to
  - Re: (Score:1, Interesting)
    
    by Anonymous Coward writes:
    
    Whether processing 100MB or 100 petabytes the process would be the same.
    I disagree. From my perspective, as a research student in astronomy, I can set my desktop to search through 100MB of catalogued images looking for objects that meet a certain set of criteria, and expect it to finish overnight. If I find that I've made an error in setting the criteria, that's not such a big deal - I fix my algorithm, and get the results tomorrow.
    With a 100PB archive, like the next generation of telescopes is likely to produce, I can't do that. I need more computing power. (There's a clus
    - Re: (Score:1)
      
      by mrrudge ( 1120279 ) writes:
      
      Do the search on a known subset of data ( including samples of everything you need to detect ) on your local machine overnight, iterate, perfect, throw it at the large set of data ?
      
      I did four and a half years of degree level computer science. Fun times ...
  - Re: (Score:3, Interesting)
    
    by Hal_Porter ( 817932 ) writes:
    
    That's not true. The way you solve the problem changes radically depending on the amount of data you have. Consider
    100 KB - You could use the dumbest algorithm imaginable and the slowest processor and everything is fine.
    100 MB - most embedded systems can happily manage it. A desktop system can easily, even in a rather inefficient language. Algorithms are important.
    100 GB - Big ass server - you'd definitely want to make sure you were using an efficient language and had an algorithm that scaled well, certainl
    - Re: (Score:1, Insightful)
      
      by Anonymous Coward writes:
      
      Google has invented things like Map/Reduce
      Yikes. Google has been good about applying existing parallel and distributed computing concepts into their engineering, but they didn't invent the CS fundamentals. Map-reduce constructs are a basic idiom of most functional programs and parallel programs (whether functional or not) in scientific computing. What Google may have invented was a way to finally teach such basics to the hipsters who otherwise think the CS literature starts with their own first programmin
    - Re: (Score:2)
      
      by markov_chain ( 202465 ) writes:
      
      Now I think these guys are spouting buzzwords. But if you want to process 100PB of data on
      Error: Out of comment memory on line 9. Aborting!
      - Re: (Score:1)
        
        by Hal_Porter ( 817932 ) writes:
        
        > Now I think these guys are spouting buzzwords. But if you want to process 100 PB of data on
        Yeah, a high profile work interrupt came in at that point and unfortunately scrambled my slashdot post composing process. What I meant to say was
        "Now I think these guys are spouting buzzwords. But if you want to process 100 PB of data on a Google like cluster of machines the way you do it is very different from 100 KB or even 100 MB on single processor machine.
        Of course small systems have their own challenges - I
- Re: (Score:2)
  
  by Hognoxious ( 631665 ) writes:
  
  Quite. Dr Snow didn't need squillobytes of data to discover the cause of cholera, just a few hundred cases, some keen observation and a bit of intuition.
- Re:Data management problem (Score:4, Insightful)
  
  by Interoperable ( 1651953 ) writes: on Tuesday October 13, 2009 @06:37AM (#29730247)
  
  Yeah no kidding. I don't know if maybe that quote ('Science these days has basically turned into a data-management problem') was taken out of context, but I'm surprised a professor would say something that ignorant. I recently did a Master's in physics and it certainly didn't involve huge quantities of data; I ended up transferring much of my data off a spectrum analyzer with a floppy drive. (When we lost the GPIB transfer script I thought it would take too long to learn the HP libraries to rewrite it. That was a mistake, after 4 hours of shoving floppies in the drive I sat down and wrote a script in 2 hours, ah well.)
  But the point is, a 400 data point trace may be exactly what you need to get the information your looking for. Just because we can collect and process huge quantities of data doesn't mean that all science requires you to do so, nor is simply handling the data the critical part of analyzing it.
  
  Parent Share
  twitter facebook
- Re: (Score:3, Insightful)
  
  by FlyingBishop ( 1293238 ) writes:
  
  It's also useless to say 'hey I'm analyzing this graph' if you're analyzing it wrong. I think you're missing the big picture. It's incredibly naive to think that the fundamental laws are simple enough to be grasped without massive datasets. It is possible, but all the data gathered thus far suggests that the fundamental laws of nature will not be found by someone staring at an equation on a whiteboard until it clicks. That is why Cern's data capacity is measured in terabytes, and they want to grow it as muc
- Re: (Score:1)
  
  by jawahar ( 541989 ) writes:
  
  Getting Students To Think At Internet Scale
  After reading the headline, I thought this is an extension to http://www.kegel.com/c10k.html [kegel.com]
everybody can (Score:1)

by Fotograf ( 1515543 ) writes:

everybody can capture ridiculous amount of data, do it smart and manage them is what makes a genius.
- Re: (Score:2)
  
  by CarpetShark ( 865376 ) writes:
  
  everybody can capture ridiculous amount of data, do it smart and manage them is what makes a genius.
  Go ahead and manage it, genius. The rest of us just use Azureus for that ;)
  - Re: (Score:1)
    
    by HNS-I ( 1119771 ) writes:
    
    Allow me to introduce to you mutorrent [utorrent.com], poor chap.
    The article mentions hadoop which is an open source version of google's map-reduce template(I think you can call it). This is great and all but it is a fairly static mechanism and hardly the end-all of distributed computing. Shouldn't university students be working on the next generation?
    - Re: (Score:2)
      
      by CarpetShark ( 865376 ) writes:
      
      I don't trust uTorrent -- it's militantly closed source, and I've heard that the developers have conflicting interests, although admittedly, I can't remember the details right now.
      http://www.utorrent.com/faq/#faq2 [utorrent.com]
      - Re: (Score:2)
        
        by Randle_Revar ( 229304 ) writes:
        
        rtorrent is where it is at
- Re: (Score:2)
  
  by CompMD ( 522020 ) writes:
  
  Managing large amounts of data was a problem for the chief engineer for a project I worked on. This guy had a PhD in Aerospace Engineering and lots of professional and academic honors. I was running a wind tunnel test that was capturing 8 24-bit signals at 10kHz and writing the data to a csv. Now, he bought good hardware, but refused to pay for decent analysis software, mainly because he didn't know any. So I had to write a program to break up the data into files small enough that Excel could open them,
- Re: (Score:2)
  
  by bsDaemon ( 87307 ) writes:
  
  oh... I thought this was going to be some cleaver advertisement for the 'R' programming language -- http://www.r-project.org/
The LSST? (Score:5, Informative)

by aallan ( 68633 ) writes: <alasdair@BLUEbabilim.co.uk minus berry> on Tuesday October 13, 2009 @04:44AM (#29729755) Homepage

Students are beginning to work with data sets like the Large Synoptic Survey Telescope, the largest public data set in the world. The telescope takes detailed images of large chunks of the sky and produces about 30 terabytes of data each night.
Err no it doesn't, and no they aren't. The telescope hasn't been built yet? First light isn't scheduled until late in 2015.
Al.

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by Thanshin ( 1188877 ) writes:
  
  You clearly aren't prepared to think in a future frame of reference.
  That's the consequence of studying with equipment that existed at the moment you were working with it.
  Future generations won't have that problem, as they're already studying with equipment that will be paid for and released to the university several years after their graduation.
- Re: (Score:3, Interesting)
  
  by Shag ( 3737 ) writes:
  
  What aallan said - although, 2015? I thought the big projects (LSST, EELT, TMT) were all setting a 2018 target now.
  I went to a talk a month and a half ago by LSST's lead camera scientist (Steve Kahn) and LSST is at this point very much vaporware (as in, they've got some of the money, and some of the parts, but are nowhere near having all the money or having it all built.) Even Pan-STARRS, which is only supposed to crank out 10TB a night, only has 1 of 4 planned scopes built (they're building a second), and
- Re: (Score:2)
  
  by oneiros27 ( 46144 ) writes:
  
  That was my first thought in reading this, too.
  There *are* large data systems online now, even if they're not of the scope of LSST. The big difference is that the EOS-DIS (earth science) has funding to cover it stuff like building giant unified data centers (I think they pull 2TB/day ... per satellite), while the rest of us us in the "space sciences" are trying to figure out how to get enough bandwidth to serve our data, and using various distributed data systems (PDS, the VxOs, etc.). Once SDO finally la
A fantastic idea (Score:2, Interesting)

by Anonymous Coward writes:

This is a great idea
Even in business we often hit problems with systems that are designed by people that just dont think about real world data volumes. I work in the ERP vendor SPACE (SAP, ORACLE, PEOPLESOFT and so on) and their inhouse systems arent designed to simulate real world data and so their performance is shocking when you load real throughput into them. AND so many times have I seen graduates think Microsoft systems can take enterprise volumes of data - and are shocked when the
Students don't need to think at internet scale (Score:3, Insightful)

by Rosco P. Coltrane ( 209368 ) writes: on Tuesday October 13, 2009 @04:48AM (#29729777)

They just need to think. That's what they study for (ideally). Thinking people with open minds can tackle anything, including the "scale of the internet".
When I was in high school, I used a slide rule. When I entered university, I got me a calculator. Did maths or problem solving abilities change or improve because of the calculator? no. Student today can jolly well learn about networking on small LANs, or learn to manage small datasets on aging university computers, so long as what they learn is good, they'll be able to transpose their knowledge on a vaster scale, or invent the next Big Thing. I don't see the problem.

Share
twitter facebook
- Re: (Score:3, Insightful)
  
  by adamchou ( 993073 ) writes:
  
  A LOT of research has been put into improving algorithms for working on large scales. By not teaching our youth all that we have learned in school, they are just going to have to figure it out themselves an continue to reinvent the wheel. How are we supposed to advance if we don't put them in a situation to learn and apply our new found knowledge?
  - Re: (Score:2)
    
    by Alpha830RulZ ( 939527 ) writes:
    
    They can do it the same way that us geezers have had to do it, by figuring out that something is important and studying it on your own. Says the guy with grey hair and an accounting degree who is building a Hadoop based prototype to test replacing mainframe processing systems with a map-reduce approach.
- Re: (Score:3, Informative)
  
  by Strange Ranger ( 454494 ) writes:
  
  I don't see the problem.
  
  ^Maybe this illustrates the point?
  
  Really really big numbers can be hard for the human brain to get a grip on. But more to the point, operating at large scales presents problems unique to the scale. Think of baking cookies. Doing this in your kitchen is a familiar thing to most people. But the kitchen method doesn't translate well to an industrial scale. Keebler doesn't use a million gallon bowl and cranes with giant beaters on the end. They don't have ovens the size of a cru
  - Re: (Score:3, Funny)
    
    by vxvxvxvx ( 745287 ) writes:
    
    So when it comes to really really big numbers, we need to rely upon elfs in trees?
- Re: (Score:2)
  
  by Yvanhoe ( 564877 ) writes:
  
  Shhhh, let them start their One Supercomputer Per Child program. It can only be good.
  - Re: (Score:2)
    
    by jc42 ( 318812 ) writes:
    
    We might note that in 1970, a computer with the capacity of the OLPC XO would have been one of the biggest, fastest supercomputers in the world. And you couldn't even buy a computer terminal with a screen that had that resolution. Now it's a child's (educational) toy.
    The first computers I worked with had fewer bytes of memory+disk and a slower process than the "smartphone" in my pocket. (Which phone doesn't matter; it'd be true for all of them. ;-)
Why? (Score:1)

by benjamindees ( 441808 ) writes:

Add me to the list of people who think this is a solution in search of a problem.
Oh, who the hell am I kidding. I'm sure the problem they have in mind has something to do with spying on people.
Wrong (Score:3, Insightful)

by Hognoxious ( 631665 ) writes: on Tuesday October 13, 2009 @04:54AM (#29729819) Homepage Journal

Summary uses data and information as if they are synonyms. They are not.

Share
twitter facebook
- yes, but ... (Score:2)
  
  by oneiros27 ( 46144 ) writes:
  
  Because of the computing power to generate the higher level data products, some data systems are serving level 1 data (calibrated data), not the raw sensor recordings (level 0).
  Knowledge of the sensor's characteristics are thus encoded into the products being served, and this, from an Information Science standpoint, you could characterize the higher level data products as "Information", not "Data". ... see, I *did* actually read the first chapter of Donald Case's book [amazon.com]. (although, I proved that by criticizin
Indeed (Score:5, Interesting)

by saisuman ( 1041662 ) writes: on Tuesday October 13, 2009 @05:07AM (#29729861)

I worked for one of the detectors at CERN, and I strongly agree with the notion of Science being a data management problem. We (intend to :-) pull a colossal amount of data from the detectors (about 40 TB/sec in case of the experiment I was working for). Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales. In short, there is a lot that happens with the data before it is even seen by a physicist. Having said that, I agree that very few people have a real appreciation and/or understanding of these kinds of systems and even fewer have the required depth of knowledge to build them. But this tends to be a highly specialized area, and I can't imagine it's easy to study it as a generic subject.

Share
twitter facebook
- - Re: (Score:1)
    
    by saisuman ( 1041662 ) writes:
    
    Our experiment (LHCb, not to be confused with the collider, called the LHC) looks for specific types of particle collisions. The rate of the collisions determines the frequency of sampling the sensors, and the number of sensors (and the number of bits read out from each sensor) determines the size of each sample. But with a combination of some really fast electronics and a large cluster of general-purpose servers we end up getting this rate down to somewhere between 150 to 300 MB/sec. This is (designed to)
- Re: (Score:2)
  
  by BJ_Covert_Action ( 1499847 ) writes:
  
  Unsurprisingly, all of it can't be stored. There's a dedicated group of people whose only job is to make sure that only relevant information is extracted, and another small group whose only job is to make sure that all this information can be stored, accessed, and processed at large scales.
  I didn't know they needed perl coders at CERN. No wonder everyone is afraid of the LHC destroying the world...
  
  =P
  - Re: (Score:2)
    
    by Alpha830RulZ ( 939527 ) writes:
    
    Give them Python, and we'll be safe again.
- Re: (Score:2)
  
  by jtownatpunk.net ( 245670 ) writes:
  
  This is nothing new. I worked at a university back in the early 90s and the center for remote sensing and optics was pulling in more data every single day than most department servers could hold. Their setup was both amazing and frightening. Just a massive pile of machines with saturated SCSI controllers. One of their big projects was to build a 4tb array. But 9.6 gig drives were just trickling into the market at that time. You'd need over 400 of those just to provide 4tb of raw storage. Nevermind pa
Internet scale of petabytes of data... (Score:1)

by fatp ( 1171151 ) writes:

As an Internet user, I really can't imagine how I can download / upload petabytes of data, in my whole life.
- - Re: (Score:1)
    
    by troll8901 ( 1397145 ) * writes:
    
    This is 42 kilodays, or 114 years. You can do it, if you start young enough and live long enough.
    This sounds exactly like the savings plan offered by my local bank, with a massive 0.00001% interest rate compounded. Hooray to financial freedom within 114 years!
Huge Misstatement (Score:4, Insightful)

by Jane Q. Public ( 1010737 ) writes: on Tuesday October 13, 2009 @05:11AM (#29729875)

"Science these days has basically turned into a data-management problem," says Jimmy Lin.

This is about the grossest misstatement of the issue that I could imagine. Science is not a data-management problem at all. But it does, and will, most certainly, depend on data management. They are two very different things, no matter how closely they must work together.

Share
twitter facebook
- Re: (Score:2)
  
  by Shrike82 ( 1471633 ) writes:
  
  Exactly. These snappy one-liners are annoying and almost always innacurate. I dabble in Data Mining and while signifcant breakthroughs can be made by trawling through large amounts of mostly useless data, the most pertinent discoveries usually relate to just a few significant data features. More time and effort should be devoted to managing how much data gets produced and ensuring that what you do store is highly likely to be useful.
The Petabyte Problem (Score:5, Insightful)

by ghostlibrary ( 450718 ) writes: on Tuesday October 13, 2009 @05:12AM (#29729881) Homepage Journal

I wrote up some notes from a NASA lunch meeting on this, titled (not too originally, I admit) 'The Petabyte Problem'. It's at
http://www.scientificblogging.com/daytime_astronomer/petabyte_problem [scientificblogging.com]. It's not just a question of thinking on the 'Internet scale', but about massive data handling in general.
What makes it different from previous eras (where MB was big, where GB was big) is that, before, the storage was expensive, yes, but bandwidth wasn't as much of a trouble for transmitting, if even locally. You could store MBs or GBs on tape, ship it, and extract the data rapidly-- bus and LAN speeds were high. Now, with PB, there's so much data that even if you ship a rack of TB drives and hook it up locally, you can't run a program on it in reasonable time. Particularly for browsing or inquiries.
So we're having to rely much more on metadata or abstractions to sort out which data we can then process further.

Share
twitter facebook
- Re: (Score:2)
  
  by belthize ( 990217 ) writes:
  
  Agreed, more or less.
  If you pick a random starting point, say the mid/late 80's the rate of improvement for CPU speeds, bus speeds, network speeds, disk speeds and disk sizes were similar. Their doubling rates differences were in months not years or decades. Through the 90s and the last 10 years what worked in the late 80s continued to more or less work.
  Disk capacity has had the fastest doubling times while networks have had the slowest over the past two decades. The resulting difference between now and
- Re: (Score:2)
  
  by zrq ( 794138 ) writes:
  
  Computer, show me all ship-like objects, in any profile. Ah, there it is."
  
  We are working on it IVOA [ivoa.net].
Well... (Score:3)

by DavidR1991 ( 1047748 ) writes: on Tuesday October 13, 2009 @05:19AM (#29729909) Homepage

If you swap the focus from smaller size problems to the mega-scale problems, then you get a bunch of students who can only do mega-scale problems (reverse of the trend the article talks about)
Here's the rub: It's easier to scale up than it is to scale down. Most big problems are made up of lots of little problems. Little problems are rarely made up of mega-scale problems...
I think what they need to do is to keep the focus on the small/'regular' stuff, but also show how their knowledge applies to the "big stuff" (so they can 'see' problems from both ends) - not just focus on one or the other

Share
twitter facebook
- Re: (Score:2)
  
  by Cederic ( 9623 ) writes:
  
  Without disagreeing with you, I'd suggest that small scale problems have different answers to large scale ones.
  The obvious approach is thus to teach both.
  Although there are a lot of petabyte scale problems out there, as a proportion of the total problem space they are still minute. Most students wont need to work on them.
  Further to that, there's no point being able to address a large scale problem if the building blocks you're using (which individually need to deal with individual data points) aren't suffic
IBM (Score:1)

by sdiz ( 224607 ) writes:

... a director at IBM's Almaden Research Center
He is just trying to sell some mainframe computer.
Work at enterprise... (Score:3, Interesting)

by SharpFang ( 651121 ) writes: on Tuesday October 13, 2009 @05:44AM (#29730005) Homepage Journal

It was a very surprising experience, moving from small services where you get 10 hits per minute maybe, to a corporation that receives several thousands hits per second.
There was a layer of cache between each of 4 application layers (database, back-end, front-end and adserver), and whenever a generic cache wouldn't cut it, a custom one was applied. On my last project there, the dedicated caching system could reduce some 5000 hits per second to 1 database query per 5 seconds - way overengineered even for our needs but it was a pleasure watching the backend compressing several thousands requests into one, and the frontend split into pieces of "very strong cache, keep in browser cache for weeks", "strong caching, refresh once/15 min site-wide", "weak caching, refresh site-wide every 30s" and "no caching, per visitor data" with the first being some 15K of Javascript, the second about 5K of generic content data, the third about 100 bytes of immediate reports and the last some 10 bytes of user prefs and choices.

Share
twitter facebook
Sooo true! (Score:2)

by psnyder ( 1326089 ) writes:

'If they imprint on these small systems, that becomes their frame of reference and what they're always thinking about,' said Jim Spohrer
That is SOOO true! I mean, I was brought up on my Commodore 64, and I have NO IDEA how to to contemplate petabytes of data! (What does that EVEN MEAN?!?) I still don't see why ANYONE would need more than 64kB of memory.
Needle in the haystack ... (Score:2)

by foobsr ( 693224 ) writes:

'Science these days has basically turned into a data-management problem,'

The assumption here is that with 'size of data-set approaching infinity' the probability of finding a random result is approaching 1. Ph.D. students might like that.

CC.
data reduction is it's own discipline (Score:1, Troll)

by petes_PoV ( 912422 ) writes:

A degree course is the first step, not the final result in a worthwhile scientific education. You don't expect to teach every student every technique they might use in every job they could get. Most of them won't even go into research - so there is a lot of waste teaching people skills that only a few will need. Far better to focus on the foundations (which could well include the basics of data analysis), rather than spending time on the ins and outs of products that are in use today - and will therefore be
Since I was very young (Score:1)

by C0quette ( 1466487 ) writes:

Some have the attitude for juggling with exabytes. Since I was very young I've realized I never wanted to be human size. So I avoid the crowds and traffic jams. They just remind me of how small I am. Because of this longing in my heart I'm going to start the growing art. I'm going to grow now and never stop. Think like a mountain, grow to the top. Tall, I want to be tall. As big as a wall. And if I'm not tall, then I will crawl. With concentration, my size increased. And now I'm fourteen stories high, at le
Past Time to Stop Using int (Score:2)

by scruffy ( 29773 ) writes:

Is there a single intro to programming book that uses long in favor of int? Just like double has replaced float for almost all numerical calculations, we need long to replace int.
- - Re: (Score:2)
    
    by scruffy ( 29773 ) writes:
    
    We need to people to use the correct sized object for the item they're storing.
    "Premature optimization is the root of all evil."
    That quote aside, I agree with you, but also would claim that a long is correct-sized for more integers than int. Yes, it uses more space, but that is a reasonable tradeoff for safety.
    - Re: (Score:2)
      
      by cervo ( 626632 ) writes:
      
      I would claim on a 32 bit processor integer operations with 32 bits will be more efficient. I would also claim that on a 64 bit processor/operating the size of int should be 64. similar to the way on windows 3.1 int was 16 bits and windows 95 int was 32 bit. The forced upgrade for 32 bit apps.... soon we'll probably have another forced upgrade to 64 bit apps.
Datasets of interest (Score:2)

by xenocide2 ( 231786 ) writes:

Part of the problem is that young students fresh out of high school have no pet datasets. For many, they're buying a new laptop for college and keeping, at most, their music. Chat logs, banking, browsing history; it hasn't occurred to them to keep these things. Hell, I doubt few CS students make backups of their own computers. I know I didn't.
Without a personal dataset of interest to maintain and process, you'll find little demand from students for classes on large dataset computations. Unless they enjoy as
Bandwidth isn't the only issue with Internet Scale (Score:2)

by GrpA ( 691294 ) writes:

Working with a small firewalled service provider that is reasonably large in terms of IP Allocation (Over half a million addresses) I'm constantly amazed that none of the design engineers I encounter seem to envision the number of sessions a firewall has to cope with.
It's frustrating that we keep encountering firewalls with 10 Gbps + claimed throughput that fall over at barely more than 100 Mbps due to resource exhaustion and then the vendor engineers try to tell us that's because we aren't bonding the NICs
Hey I knew that! (Score:2)

by Whiteox ( 919863 ) writes:

It's like this:
Learn to play all the campaigns on Age of Empires II of which there is a population limit of 75.
Repeat for a number of years until you are perfect and the most efficient.
Then go play a network AOEII game with a pop cap of 200 and you will invariably lose because you can't get your head around it.
The game is simple, yet hard to manipulate when scaled up and takes a lot more effort to win. And that's only changing one variable.
Pedants should be Pedantic (Score:1)

by BonysGambit ( 1316469 ) writes:

When we speak of "Science" in a general sense, it's about using the Scientific Method to pursue a goal or enhance our knowledge. This has nothing to do with the size of the data accumulated to perform the task. These days, all of us are learning to think at "Internet Scale." Join Facebook and "befriend" 200 million people. Enroll in LinkedIn and you have 40 million possible connections. National debts are measured in numbers with more zeros than ever used before to describe money. In other words, every fie

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Data management problem (Score:5, Insightful)

Re: (Score:3, Insightful)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re:Data management problem (Score:5, Informative)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1, Interesting)

Re: (Score:1)

Re: (Score:3, Interesting)

Re: (Score:1, Insightful)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re:Data management problem (Score:4, Insightful)

Re: (Score:3, Insightful)

Re: (Score:1)

everybody can (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

The LSST? (Score:5, Informative)

Re: (Score:3, Funny)

Re: (Score:3, Interesting)

Re: (Score:2)

A fantastic idea (Score:2, Interesting)

Students don't need to think at internet scale (Score:3, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:2)

Why? (Score:1)

Wrong (Score:3, Insightful)

yes, but ... (Score:2)

Indeed (Score:5, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Internet scale of petabytes of data... (Score:1)

Re: (Score:1)

Huge Misstatement (Score:4, Insightful)

Re: (Score:2)

The Petabyte Problem (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Well... (Score:3)

Re: (Score:2)

IBM (Score:1)

Work at enterprise... (Score:3, Interesting)

Sooo true! (Score:2)

Needle in the haystack ... (Score:2)

data reduction is it's own discipline (Score:1, Troll)

Since I was very young (Score:1)

Past Time to Stop Using int (Score:2)

Re: (Score:2)

Re: (Score:2)

Datasets of interest (Score:2)

Bandwidth isn't the only issue with Internet Scale (Score:2)

Hey I knew that! (Score:2)

Pedants should be Pedantic (Score:1)