Forgot your password?
typodupeerror
Networking Science

CERN Collider To Trigger a Data Deluge 226

Posted by kdawson
from the things-that-go-bang dept.
slashthedot sends us to High Productivity Computing Wire for a look at the effort to beef up computing and communications infrastructure at a number of US universities in preparation for the data deluge anticipated later this year from two experiments coming online at CERN. The collider will smash protons together hoping to catch a glimpse of the subatomic particles that are thought to have last been seen at the Big Bang. From the article: "The world's largest science experiment, a physics experiment designed to determine the nature of matter, will produce a mountain of data. And because the world's physicists cannot move to the mountain, an army of computer research scientists is preparing to move the mountain to the physicists... The CERN collider will begin producing data in November, and from the trillions of collisions of protons it will generate 15 petabytes of data per year... [This] would be the equivalent of all of the information in all of the university libraries in the United States seven times over. It would be the equivalent of 22 Internets, or more than 1,000 Libraries of Congress. And there is no search function."
This discussion has been archived. No new comments can be posted.

CERN Collider To Trigger a Data Deluge

Comments Filter:
  • by chriss (26574) * <chriss@memomo.net> on Tuesday May 22, 2007 @05:27AM (#19218523) Homepage

    Okay, the Library of Congress has been estimated to contain about 10 Terabyte, so I buy the 1000 * LoC = 15 Petabyte. But archive.org alone expanded its storage capacity to 1 Petabyte in 2005, so the CERN is not going to generate anything near "22 Internet" (whatever that might be). This estimate [berkeley.edu] from 2002 calculates the size of the internet as about 530 Exabyte, 440 Exabyte of which are email, 157 Petabyte for the "surface web"

    • The real fundamental question is not about beginning of the universe, but something much much more important: Are they going to backup the data?
      On the other hand, I'm sure it will be available on some torrent soon.
      • by databyss (586137) on Tuesday May 22, 2007 @09:52AM (#19220555) Homepage Journal
        My quantum computer has been working on downloading the torrent for the past few weeks.
    • Re: (Score:2, Insightful)

      by vitya404 (959458)
      Have you read that article? Firstly, what you say exa, is peta really. But, according to me the size of the internet is the available data through internet. And my emails are not available through the web (hopefully). And while the data transmitted through the network is redundant and huge part of it worthless data (eg. my post), this experiment will give us an enormous amount of meaningful, therefore valuable data.
      • by chriss (26574) * <chriss@memomo.net> on Tuesday May 22, 2007 @05:55AM (#19218669) Homepage

        Firstly, what you say exa, is peta really.

        Me bad, miscalculated, off by a factor of 1000.

      • Re: (Score:3, Interesting)

        by joto (134244)
        Meaningful and valuable to who? If I had to make the choice between using the bandwidth and storage space to store your post, or to store half a kilobyte of CERN sensor data, I would actually choose to store your post. And it's not because I find your post particularly valuable. It's because the CERN data is as meaningless to me as line-noise would be. For me even donkey bukkake with midgets is more meaningful, than random sensor data from CERN. Only when the scientists make discoveries from it that either
        • by mooingyak (720677)
          For me even donkey bukkake with midgets is more meaningful, than random sensor data from CERN.

          I think I'd rather the random sensor data, given those two options. It's kind of like staring at the wall in front of you when you're at a urinal. It's not that the wall is so interesting...
    • by Rix (54095)
      1 "internet" is being used as the amount of data transfered in a given period.
    • You assume the Archive use all that capacity...

      Either way, the Archive also keeps old versions of the sites, meaning multiple copies of what is essentially the same site.
    • by Servo (9177)
      That's 22 times the legitimate content of the internet. Including porn its only about .3 Internet.
    • by jamesh (87723) on Tuesday May 22, 2007 @06:49AM (#19218937)
      This is really bad news. By defining the amount of data in LoC's, they leave themselves open to a huge exploit... If the LoC ever includes this data, then there will be a recursive loop of definitions and the LoC will expand to fill the universe.

      Okay... maybe not, but if they ever did put this data in the LoC, the effort required to re-factor all the LoC based measurements would bankrupt the world. And the confusion that goes on while this re-factoring is happening will surely crash at least one probe into Mars, where the English have used the new LoC units and the Americans will have used the old LoC units.
      • by TapeCutter (624760) on Tuesday May 22, 2007 @07:14AM (#19219021) Journal
        It seems the metric LoC = 10TB. If that is so then an LoC is no longer based on a physical library but has rather been redefined based on a more basic unit of information, (ie: the byte). This sort of thing has happened before, the standard time unit (second) is no longer based on the earth's rotation, rather it is based on some esoteric (but very stable) feature of cesium atoms.

        IMHO: This is a GoodThing(TM), it could mean the LoC is well on it's way to becoming an accepted SI unit. :)
        • by 1u3hr (530656)
          an LoC is no longer based on a physical library but has rather been redefined based on a more basic unit of information, (ie: the byte)...it could mean the LoC is well on it's way to becoming an accepted SI unit

          Now we will have a whole other schism over whether the 10 TB is binary (10 x 2^40) or decimal (10 x 10^12), with SI purists demanding the binary be distinguished as 10 tebibytes.

          • The EU will go for decimal (metric), the US will go for binary (pocket caclulator lobbyists), the rest of the world will put/omit a (US) behind quoted figures to inform/confuse the reader.

            It's far from perfect, but it's better than a recurively expanding unit of information gobbling up the universe.
    • If you mean by the size of the internet being how many bytes you can get to if you want, then that's only half the concern. Getting access to the bytes in a timely manner is another serious concern.
  • by DTemp (1086779) on Tuesday May 22, 2007 @05:27AM (#19218525)
    I hope they're planning on running their own fiber optic line across the Atlantic, or shipping a lot of hard drives, cause thats too much data to pass over the public internet.

    FYI 15 petabytes per year = 120 petabits per year = 120,000,000 gigabits per year

    120,000,000 gigabits per year / ~30,000,000 seconds per year = 4gbps of continuous transmission. They could run a fiber across the Atlantic that could handle 4gbps.
    • Neutrinos (Score:5, Funny)

      by MichaelSmith (789609) on Tuesday May 22, 2007 @05:46AM (#19218623) Homepage Journal

      I hope they're planning on running their own fiber optic line across the Atlantic

      You know with the right sort of particle accelerator you could send messages straight through the Earth and save a heap of latency.

    • So long as it's not needed right now pretty much any amount of data can be transmitted.
      • by MikShapi (681808) on Tuesday May 22, 2007 @07:44AM (#19219177) Journal
        That's a highly misleading figure (whatever figure you had in mind).

        When you add the amount of time, money, kit and effort that'd go into either burning that many optical disks or filling that many harddrives, then connecting them on the other end and reading it out makes it less attractive than fiber optics.

        On the other hand, if the 747 is crammed full of ultra-high-capacity hard-drives (say, the new Hitachi 1TB) in high-density racks that do not need unloading from the aircraft (it lands, it plugs into a power/multiple-10GbE-grid, offloads the data to a local ground facility, then goes out for the next run), you get something that'd possibly be competitive with fiber, as well as a possible business model avenue.

        You would, of course, need someone to be willing pay the rough equivalent of .. say .. 500 economy airline tickets (shooting from the hip here, I tried compounding business/first-class costs).. to get that through. That's a lot of cash. Then again, at 1TB/drive, it's a LOT of data.

    • Re: (Score:2, Informative)

      by Anonymous Coward
      They could run a fiber across the Atlantic that could handle 4gbps.

      They have been getting sustained performance (with simulated data) of more than that for several years now. This is the sort of thing that Internet2 does well, when it's not on fire.
    • by Cyberax (705495)
      Actually, it's not that much data.

      Two hard drives can fit 1Tb of data now (1Tb hard drives are also available), so 15Pb can fit on 'just' 30000 hard drives. A large number, but manageable.
    • Re: (Score:2, Interesting)

      by Anonymous Coward
      "They could run a fiber across the Atlantic that could handle 4gbps."

      The .eu academic networks have a lot more transatlantic bandwidth than that already. When I worked at JANET (the uk academic network) we were one hop from .us and had 10G transatlatic bandwidth (how much of that was on-demand I can't remember). Geant, the .eu research network interconnect, also has direct connections to the .us research networks. The bandwidth is in place and has been for some time. It's being updated right now as well.

      Che
      • Re: (Score:3, Interesting)

        by markov_chain (202465)
        If they could get 1GB/s sustained, it would take them... 173 days to transfer 15PB. I hope they have dark fiber to light up!
        • Re: (Score:3, Informative)

          by kestasjk (933987)
          They're not going to run the particle accelerator for a day and then spend half a year transferring all the data generated, the lifetime of a particle accelerator is longer than 173 days.
          • Re: (Score:3, Funny)

            by kestasjk (933987)
            Oh wait, this is Slashdot.
            • Okay, so that's 15 petabytes *tapping on calculator* that's 3.4x10^29 bits.
            • Taking the maximum data rate from a given node as 3 gigabits per second, and taking into account the effect of bandwidth increases over time.. *tapping on calculator*
            • Okay, and taking the average mosquito lifetime as 20 days.. *tapping on calculator*
            • *breaks into a cold sweat*
            • Now, assuming mutations in mosquitos occur at a rate of 1 base pair per generation, *tap tap tap* and that our genes are diff
    • Is it really too much? The average torrent release of a popular TV show spreads to hundreds of users at an average of perhaps a megabit / second. University networks can probably handle that load without problem right now.
      • Re: (Score:2, Funny)

        by ender- (42944)

        Is it really too much? The average torrent release of a popular TV show spreads to hundreds of users at an average of perhaps a megabit / second. University networks can probably handle that load without problem right now.
        Um, no they can't, they're full to the brim with torrent traffic. :)

    • by bockelboy (824282) on Tuesday May 22, 2007 @07:59AM (#19219279)
      That's 4Gbps AVERAGE, meaning it's much below the peak rate. That's also the raw data stream, not accounting for site X in the US wanting to read reconstructed data from site Y in Europe.

      LHC-related experiments will eventually have 70 Gbps of private fibers across the atlantic (Most NY -> Geneva, but at least 10Gbps NY -> Amsterdam), and at least 10 Gbps across the Pacific.

      For what it's worth, here's the current transfer rates for one LHC experiment [cmsdoc.cern.ch] You'll notice that there's one site, Nebraska (my site), which averages 3.2 Gbps over the last day. That's a Tier 2 site - meaning it won't even recieve the raw data, just reconstructed data.

      Our peak is designed to be 200TB / week (2.6Gbps averaged over a whole week). That's one site out of 30 Tier 2 sites and 7 Tier 1 sites (each Tier 1 should be about 4-times as big as a Tier 2).

      Of course, the network backbone work has been progressing for years. It's to the point where Abilene, the current I2 network, [iu.edu] rarely is at 50% capacity.

      The network part is easy; it's a function of buying the right equipment and hiring smart people. The extremely hard part is putting disk servers in place that can handle the load. When we went from OC-12 (622 Mbps) to OC-192 (~10Gbps), we had RAIDs crash because we wrote at 2Gbps on some servers for days at a time. Try building up such a system without the budget to buy high-end Fiber Channel equipment too!

      And yes, I am on a development team that works to provide data transfer services for the CMS experiment.
    • The NSA will have to scan the data for potential terrorist Tachyons hiding among the Bosons. That will slow things down a bit.
  • No Search Function (Score:5, Interesting)

    by tacocat (527354) <tallison1 AT twmi DOT rr DOT com> on Tuesday May 22, 2007 @05:27AM (#19218527)

    Google it?

    If Google is so awesome, maybe they can put their money where there mouth is and do something commendable. Of course, they'll probably have a hard time turning this data into marketing material.

    • Re: (Score:3, Informative)

      by gedhrel (241953)
      Well, there _is_ a search function, and that's what the tier-2 sites will be running. The data describes individual experiements (that is, individual collisions) and comes off LHC at a whacking rate. There's some front-end processing to throw away a lot of it before what's left gets sent to the tier-1 sites for further distribution.

      The data is suitable for high-throughput (ie, batch processing) and the idea is to keep copies of the experimental data in several places during processing. Interesting results g
    • Re: (Score:2, Interesting)

      by Raptoer (984438)
      The problem is less that there is no search function (with digital data all you're doing is matching one pattern to another), the problem is more that you don't know exactly what you are searching for!
      My guess is that they are looking for anomalies within the data that would indicate the presence of one of these subatomic particles. My guess furthermore is that once they get enough data analyzed they will be able to form a model to base a search function around.
      That or the summary lies (wouldn't be the firs
      • by scheme (19778) on Tuesday May 22, 2007 @08:20AM (#19219483)

        The problem is less that there is no search function (with digital data all you're doing is matching one pattern to another), the problem is more that you don't know exactly what you are searching for!
        My guess is that they are looking for anomalies within the data that would indicate the presence of one of these subatomic particles. My guess furthermore is that once they get enough data analyzed they will be able to form a model to base a search function around.
        That or the summary lies (wouldn't be the first time) and in fact they know exactly what they are searching for, and they have a search function, but of course someone has to look at the output of those functions to determine what impact they have on their model/ideas.

        For a lot of the physics, the researchers know what they are looking for. For example, with the Higgs boson, theories constrain the decay and production to certain channels that have characteristic signatures. So they would be looking for events that have a muon at a certain energy with a hadron jet with another given energy coming off x degrees away and so on. There have been monte carlo simulations and other calculations done to predict what the interesting events should look like using various different theories. Of course there maybe interesting events that pop up that no one has predicted but everyone has a fairly good idea of what the expected events should look like.

    • by Benson Arizona (933024) on Tuesday May 22, 2007 @08:35AM (#19219641) Homepage
      Buy Higgs Boson now at e-bay.com

      Buy books about Bosons at Amazon.com
    • by oglueck (235089)
      I already see the "related" Google ads:

      * scintillators fix your mortgage
      * Viagra particles
      * free teen bosons
  • 60% (Score:5, Funny)

    by Alsee (515537) on Tuesday May 22, 2007 @05:34AM (#19218563) Homepage
    The CERN collider will begin producing data in November, and from the trillions of collisions of protons it will generate 15 petabytes of data per year... [This] would be the equivalent of all of the information in all of the university libraries in the United States seven times over. It would be the equivalent of 22 Internets, or more than 1,000 Libraries of Congress. And there is no search function.

    And 60% of it will be porn.

    -
  • Never mind the data (Score:5, Interesting)

    by simong (32944) on Tuesday May 22, 2007 @05:34AM (#19218565) Homepage
    What about the backups?
  • catch a glimpse of the subatomic particles that are thought to have last been seen at the Big Bang

    I read of "fringe" scientists who warn that there could be potential catastrophic consequences to the coming generation of colliders. The answer to these warnings seems to be that cosmic rays of higher energy than our colliders can generate have been zipping around for billions of years - so if something "bad" could come of it, then it would have already happened.

    So, is the above quote simply a poster who doesn't know what he is talking about (someone more interested in a catchy phrase in an article th

    • From my (college-level) physics knowledge, the advantage of these colliders is that they come close to recreating the conditions which existed at the time of the beginning on the Universe (according to the Big Bang hypotheses). Whether or not these conditions allow certain never-before-seen particles to be observed is uncertain, but likely, since some kinds of particles (like mesons or bosons) have a tendency to dessapear in less than a nanosecond (1*10^-9 seconds).

      On a related note, all the particle collid
      • by SamSim (630795) on Tuesday May 22, 2007 @06:03AM (#19218723) Homepage Journal

        all the particle colliders of the most recent generation (like the Tevatron at Fermilab or the Relativistic Heavy Ion collider in New York) have the capability (if certain theoretical models are accurate enough) to generate very tiny (around nine millimeters), but stable black holes (though the probability is extremely low)

        Well, yeah, but the probability is about the same as that of you generating a small black hole by clapping your hands together really hard.

        • by alexhs (877055)

          subatomic particles that are thought to have last been seen at the Big Bang.
          Mmh... Big-Bang... Black Holes...

          Reminds me of Commander Blood [wikipedia.org].

          Rendez-vous at the Big-Bang :)

          Has someone played it ?
      • by Excelcia (906188)
        I've read the different theories on what some people say "could" happen. Strangelets and micro-black-holes (where Hawking's theory of evaporation is wrong) tend to be the worst case scenarios. My question, though, is are the coming generation of colliders capable of producing energies greater than is already seen in the rare high-engergy cosmic rays? If not, then the types of particles or events created in our accelerators is unlikely to be any different than happens all over the universe when these cosm
      • have the capability (if certain theoretical models are accurate enough) to generate very tiny (around nine millimeters)

        9 millimeters? That's huge.

        You'd need a mass of about 6x10^24kg to get a Schwartzchild radius of 9mm.

        Microscopic (much smaller than a proton) black holes, yes but 9mm just doesn't sound credible unless you've got some very outlandish theories about black holes.

        (I've just been to read your link - a 9mm hole is what is left when the entire Earth is consumed by a microscopic black hole.)

        Tim.
      • by delt0r (999393)
        Your information is quite wrong. Nothing even close to mm dimesions will be created. The "stable" black holes that *mite* form would last for less than 10^-15 seconds IIRC. Light can't even travel 1 mm in that time.
    • by Hatta (162192)
      The answer to these warnings seems to be that cosmic rays of higher energy than our colliders can generate have been zipping around for billions of years - so if something "bad" could come of it, then it would have already happened.

      What do you think the big bang was?
  • by UnHolier than ever (803328) <unholy_@hot m a i l . c om> on Tuesday May 22, 2007 @05:54AM (#19218663)
    Would that be 0.84 Internet per forthnight? Or 1 kiloLibrary per Congress session? How much in tubes?
    • by eMbry00s (952989)
      NO!

      You don't understand!! Argh, slashdot makes me so aggrivated. Don't you understand that you can't just dump stuff on the tubes? It's not like a truck, you know.
    • Re: (Score:2, Funny)

      by AndyboyH (837116)

      How much in tubes?


      Too much, and that's why we should pay the good companies all our hard earned cash to drill giant tubes for all our torrents, MP3s, smut and VoIP calls. Or at least, wasn't that what they were arguing for? ;)
    • Re: (Score:3, Funny)

      by SharpFang (651121)
      The tube radius of 420 attoparsecs.

      OTOH owning the harddrives capable of holding this much data gives you about 730 kilometers of e-penis.
  • "The actual data analysis by physicists will take place at Tier-2 sites, so it's important that we can receive whatever data our physicists need," Würthwein says. "We will take data from CERN and push it across the worldwide networks to these seven places. They will receive it, analyze it, the whole gimbang. Once we have the data in all these places, a physicist will be able to submit jobs from their office computer, or even from a laptop in Starbucks."

    2007: CernNET becomes self aware.

  • by Laxator2 (973549) on Tuesday May 22, 2007 @06:14AM (#19218785)
    The main difference between the LHC data and the Internet is that all that 15 PB of data will come in a standard format, so a search is much easier to perform. In fact most of the search will consist on discarding non-interesting stuff while attempting to identify the very rare events that may show indications of new particles (Higgs for example). The Internet is a lot more diverse, the variety of information dwarfs the limited number of patterns LHC is looking for, so "no search available" for LHC data sounds more like "no search needed".
    • by imsabbel (611519)
      If they didnt discard the unwanted stuff, you would have to put 3 or 4 additional zeros to that number...
      • Do they have to keep track of every single decay particle? They could probably get rid of a lot of data no one is going to consult that way. On the other hand if you are going to spend that much money on CERN every bit of data coming out has a significant price on it.
      • Re: (Score:3, Informative)

        by vondo (303621)
        More like 6 or more extra zeros, actually. There seems to be a lot of confusion about this, so let me try to explain.

        Generally the data coming out of these experiments is filtered in two or more stages. It has to run in real time since the data volume is enormous. A detector like this can easily spew out several TB a second of raw data. The first layer of filtering will look at very small portions of the data and make very loose requirements on it, but can run very fast in dedicated electronics. This might
  • by $RANDOMLUSER (804576) on Tuesday May 22, 2007 @06:30AM (#19218879)

    "Like an exercise session getting you ready for the big game, we've been going to the physics gym," Hacker says
    Must. Erase. Image.
    Physics locker room.
  • by mindwhip (894744)
    FTA:"catch a glimpse of the subatomic particles that are thought to have last been seen at the Big Bang."

    Who was at the Big Bang to see them then? I suspect that the numbers are a lot lower than the number of people that heard that tree fall in the woods and heard the sound of one hand clapping put together.

    • Re: (Score:2, Funny)

      by Dunbal (464142)
      and heard the sound of one hand clapping put together.

            Don't be daft. Everyone here at UU knows that the sound of one hand clapping is 'cl-'

  • by Glock27 (446276) on Tuesday May 22, 2007 @07:12AM (#19219011)
    The collider will smash protons together hoping to catch a glimpse of the subatomic particles that are thought to have last been seen at the Big Bang.

    That line is some of the worst hyperbole ever. Here's why. First, there was (almost by definition) no one there to 'see' anything at the Big Bang. (Supernatural explanations aside, and this purports to be a science article.) Second, these subatomic particles are formed frequently in nature, as high-energy astronomy has found various natural particle accelerators that are FAR more powerful than anything we're likely to build on Earth.

    One hopes the author will do better next time.

  • It would be the equivalent of 22 Internets
    So our President was right about the "Internets" after all, he must have access to a few of those 22 Internets!
  • and also put some Library of Congress saucing on it.
  • it will generate 15 petabytes of data per year...

          Umm, question. Is this BEFORE or AFTER time stops?
  • Think for a moment (Score:3, Interesting)

    by kilodelta (843627) on Tuesday May 22, 2007 @08:53AM (#19219817) Homepage
    There are some other benefits to building such a huge network of high powered computers. And it's not the teleportation you thought, it's more copying of metadata and re-creating the original.

    Think about it, the only thing stopping us is the ability to store and transfer large amounts of data necessary to describe the precise makeup of a human being. I have a feeling this project will branch off into that area.
    • Re: (Score:3, Funny)

      by Control Group (105494) *
      kilodelta, I have someone I think you should meet. His name is Werner Heisenberg, and he's got some ideas that may interest you.
  • I know, it's a minor nit to pick.

    ...15 petabytes of data per year... [This] would be the equivalent of all of the information in all of the university libraries...

    I suspect that 15 petabytes of data will actually be equivalent to at most a 2x the information in a number of standard model journal articles and texts. They just have to figure out the right compression kernel.
  • Sounds like the article was written by Senator Stevens. Nothing to fear, 22 emails can't possibly clog our tubes.
  • That's a lot of tubes.
  • Wonder if they'll hire or contract some Google engineers for a data mining effort. Personally I'd work for free to get a chance to mine that much data.

Arithmetic is being able to count up to twenty without taking off your shoes. -- Mickey Mouse

Working...