Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Math

Data Sorting World Record — 1 Terabyte, 1 Minute 129

An anonymous reader writes "Computer scientists from the University of California, San Diego have broken the 'terabyte barrier' — and a world record — when they sorted more than a trillion bytes of data in 60 seconds. During this 2010 'Sort Benchmark' competition, a sort of 'World Cup of data sorting,' the UCSD team also tied a world record for fastest data sorting rate, sifting through one trillion data records in 172 minutes — and did so using just a quarter of the computing resources of the other record holder."
This discussion has been archived. No new comments can be posted.

Data Sorting World Record — 1 Terabyte, 1 Minute

Comments Filter:
  • Re:1-byte records? (Score:5, Informative)

    by mike260 ( 224212 ) on Tuesday July 27, 2010 @11:45PM (#33053396)

    Ah, my mistake - the "trillion data records" refers to a different benchmark. In the 1TB benchmark, records are 100 bytes long.

  • Re:Only 52 nodes (Score:4, Informative)

    by rm999 ( 775449 ) on Tuesday July 27, 2010 @11:49PM (#33053426)

    I don't know much about them, but perhaps most supercomputers break this rule: "The hardware used should be commercially available (off-the-shelf), and unmodified (e.g. no processor over or under clocking)"

  • Re:Only 52 nodes (Score:3, Informative)

    by afidel ( 530433 ) on Wednesday July 28, 2010 @12:12AM (#33053526)
    Many of the systems on the TOP500 list are simple COTS clusters with perhaps an interesting interconnect. Remember the Appleseed cluster that was ranked in the top10 a few years ago, it was nothing more than a bunch of Apple desktops with an interesting gigabit mesh network. Regardless they hardly used optimal hardware, hex core Xeon's with 96GB's of ram per node and a better I/O subsystem could have crushed this result.
  • Re:Only 52 nodes (Score:5, Informative)

    by straponego ( 521991 ) on Wednesday July 28, 2010 @12:24AM (#33053572)
    No, most Top500 machines are composed of commercially available, unmodified parts. More than half use GigE interconnects, though Infiniband has nearly caught up. I'm not sure if you'd count Infiniband as COTS-- at least a couple of years ago, it was fairly involved to configure, and it's not cheap. But anybody who has the money can buy it.
  • by amirulbahr ( 1216502 ) on Wednesday July 28, 2010 @12:36AM (#33053620)
    When I read the summary I thought what's the big deal if the 1 TB of data only contained two records 0.5 TB each. Then I saw that kdawson wrote the summary. So I browsed over here [sortbenchmark.org] and saw that the impressive thing is that they sorted 1,000,000,000,000 records of 100 bytes each with 10 byte keys.
  • Re:1-byte records? (Score:1, Informative)

    by topcoder ( 1662257 ) on Wednesday July 28, 2010 @12:58AM (#33053684)

    Doesn't seem like that requires any real computation - you just go through the data maintaining a count of each of the 256 possible values (an embarassingly parallel problem), then write it back out with 256 memsets (likewise trivially parallelisable).

    That technique is called bucked sort BTW.

  • Re:1-byte records? (Score:2, Informative)

    by topcoder ( 1662257 ) on Wednesday July 28, 2010 @01:02AM (#33053700)
    Sorry, i commited a mistake too, the right name is Counting Sort (http://en.wikipedia.org/wiki/Counting_sort [wikipedia.org]).
  • by Cold hard reality ( 1536175 ) on Wednesday July 28, 2010 @01:07AM (#33053716)
    A 256 byte bubble sort requires about 33,000 operations. At 1 MHz, that means every operation (compare + possible swap) took 30 cycles. While somewhat slow, this is definitely much faster than what basic could have achieved.
  • by Carthag ( 643047 ) on Wednesday July 28, 2010 @06:27AM (#33054466) Homepage

    Why is sorting 1TB so hard?

    I can just instruct my tape library to put the tapes in the library in alphabetical order in the slots... Y'know AA0000 AA0001 AA0002... moves a hell of a lot more than 1TB.

    That's not sorting 1TB. That's sorting n records where n = the number of tapes.

  • by Lord Grey ( 463613 ) * on Wednesday July 28, 2010 @08:35AM (#33054976)

    A paper describing the test is here [sortbenchmark.org]. TritonSort is the abstract method; GraySort and MinuteSort are concrete/benchmarked variations of TritonSort.

    As TFA states, this is more about balancing system resources than anything else. The actual sort method used was "a variant of radix sort" (page two of the linked PDF). Everything else was an exercise in how to break up the data into manageable chunks (850MB), distribute the chunks to workers, then merge the results. That was the real breakthrough, I think. But I wonder how much is truly a breakthrough and how much was just taking advantage of modern hardware, since one of the major changes in MinuteSort is simply avoiding a couple of disk writes by keeping everything in RAM (a feat possible because that much RAM is readily available to a single worker).

    Regardless, this kind of work is very cool. If web framework developers (for example) paid greater attention to balancing data flow and work performed, we could probably get away with half the hardware footprint in our data centers. As it stands now, too many COTS -- and hence, often-used -- frameworks are out of balance. They read too much data for the task, don't allow processing until it's all read in, etc.. This causes implementors to add hardware to scale up.

There are two ways to write error-free programs; only the third one works.

Working...