Data Sorting World Record — 1 Terabyte, 1 Minute 129
An anonymous reader writes "Computer scientists from the University of California, San Diego have broken the 'terabyte barrier' — and a world record — when they sorted more than a trillion bytes of data in 60 seconds. During this 2010 'Sort Benchmark' competition, a sort of 'World Cup of data sorting,' the UCSD team also tied a world record for fastest data sorting rate, sifting through one trillion data records in 172 minutes — and did so using just a quarter of the computing resources of the other record holder."
I used to think it was great (Score:5, Interesting)
I had a 6502 system with BASIC in ROM and a machine code monitor. The idea is to copy a page (256 bytes) from the BASIC ROM to the video card address space. This puts random characters into one quarter of the screen. Then bubble sort the 256 bytes. It took about one second.
For extra difficulty do it again with the full 1K of video. Thats harder with the 6502 because you have to use vectors in RAM for the addresses. So reads and writes are a two step operation, as is incrementing the address. You have to test for carry. But the result was spectacular.
Only 52 nodes (Score:5, Interesting)
You've got to be kidding me. Each node was only 2 quad core processors, with 16 500GB drives (big potential disk IO per node) but this system doesn't even begin to scratch the very bottom of the top 500 list.
I just can't image that if even the bottom rung of the top 500 was even slightly interested in this record, that they wouldn't blow this team out of the water.
Re:Great to see sorting research advance (Score:2, Interesting)
I work in the OLAP realm. Trust me, it matters. Being able to run an adhoc query across terabytes of data with near real-time results is the holy grail of what we do. The industry has known for a while that parallel computing is the way to go, but only recently has the technology become cheap enough to consider deploying on a large scale. (Though Oracle will still happily take millions from you for Exadata if you want the expensive solution.)
One other area... (Score:2, Interesting)
Come to think of it, one area where it also matters currently is in mobile development. If you aren't considering memory or processor usage you can quickly lead yourself into some really bad performance, thinking hard about how to make use of what little you have really matters in that space too.
So only desktop or smallish backend development can generally remain unconcerned these days with algorithmic performance...
I had to work with large datasets in my previous life as a backend IT guy, but nothing at the levels you are talking about. Even then I thought carefully about how any give approach would affect performance.
Important Details.... (Score:5, Interesting)
I think this is cool, but.... how fast is it in a more practical situation?
source [sortbenchmark.org]
Pretty close to theoretical max (Score:4, Interesting)
Let's consider 100TB in 172 minute thing they also did. 52 nodes, 16 spindles per node is 832 spindles total and 120GB of data per spindle. 120GB of data can be read in 20 minutes and transfered in another 15 to the target spindles (assuming uniform distribution of keys). You can then break it down into 2GB chunks locally (again by key) as you reduce. Then you spend another hour and a half reading individual chunks, sorting them in memory, concatenating and writing.
Of course this only works well if the keys are uniformly distributed (which they often are) and if data is already on the spindles (which it often isn't).