Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
Space Supercomputing Science IT

GPU Supercomputer Could Crunch Exabyte of Data Daily For Square Kilometer Array 40

An anonymous reader writes "Researchers on the Square Kilometer Array project to build the world's largest radio telescope believe that a GPU cluster could be suited to stitching together the more than an exabyte of data that will be gathered by the telescope each day after its completion in 2024. One of the project heads said that graphics cards could be cut out for the job because of their high I/O and core count, adding that a conventional CPU-based supercomputer doesn't have the necessary I/O bandwidth to do the work."
This discussion has been archived. No new comments can be posted.

GPU Supercomputer Could Crunch Exabyte of Data Daily For Square Kilometer Array

Comments Filter:
  • Not well explained (Score:5, Informative)

    by EyeSavant ( 725627 ) on Saturday August 04, 2012 @11:11AM (#40877639)

    I guess they did not get anyone that technical to write that article or the summary.

    For I/O I guess they mean memory bandwidth. GPUs have a LOT of memory bandwidth from their cache memory, the problem is that they sit at the end of a PCIe bus from the CPU and the CPU has to handle most of the book keeping (and the actual IO, i.e. taking data from an external source).

    So what is important is the compute density i.e. how much computation you do for each piece of data. Getting stuff into the GPU is slow, getting stuff out is slow, but doing stuff on the data is very very fast (because you have so many compute units and so much memory bandwidth).

    That is also the way they are programmed, with the main code running on the CPU, and then the kernals getting launched on the GPU with explicit or implict transfer of data from the CPU memory to the GPU memory and back again.

    I do have high hopes for stuff like Fusion ( http://en.wikipedia.org/wiki/AMD_Fusion [wikipedia.org] ) which gets rid of the PCIe bus, and make it a lot easier to get data to the GPU cores and back again.

    And if you are going to mention GPU machines, why not mention titan ? ( http://www.olcf.ornl.gov/computing-resources/titan/ [ornl.gov] )

  • Re:2024 (Score:5, Informative)

    by epiphani ( 254981 ) <epiphani@d a l .net> on Saturday August 04, 2012 @11:17AM (#40877675)

    And for good measure, now the actual paper:

    http://www.skatelescope.org/uploaded/31235_139_Memo_Ford.pdf [skatelescope.org]

    Funny thing, I was reading this last night.

  • GPU I/O bandwidth? (Score:3, Informative)

    by saratchandra ( 847748 ) on Saturday August 04, 2012 @01:27PM (#40878441) Homepage
    Give me a break.

    a conventional CPU-based supercomputer doesn't have the necessary I/O bandwidth to do the work.

    I work in HPC and the trend is towards heterogeneous architectures ( CPU+accelerators). Moore's law, power requirements and economics are dictating that trend. It's definitely a stretch to claim that you get better I/O bandwidth with GPUs. Even with PCI Gen 3, the effective bandwidth you get per CPU core is greater than that of an 'equivalent' GPU core.

  • by Anonymous Coward on Saturday August 04, 2012 @03:56PM (#40879573)

    You are the one without a clue what you are talking about. Let's look at the fastest shipping devices from Intel and Nvidia.

    Intel SandyBridge CPU (8c 2.6GHz) has a peak compute of 166 DP GFLOPS, peak memory bandwidth of 51.2GB/s.

    GF110 based tesla has peak compute of 666 DP GFLOPS, peak memory bandwidth of 177.4GB/s.

    It has 4 times the raw DP compute, and 3.5x the raw memory bandwidth.

    Now this is best case for the GPU. In reality, they tend to have far lower efficiency (actual versus peak) numbers for a few reasons. Firstly, they are harder to program and need to be driven by a CPU. Secondly, they require far higher parallelism which can fall afoul of Amdahl's law. Thirdly, they have limited memory capacities and relatively slow and high latency PCI connection to main memory which must be used to copy data from and copy results back to. Fourthly, the SandyBridge CPU has far greater capabilities to extract performance, it is aggressively out of order, and has several levels of large fast caches.

    Look at the numbers on top500 supercomputers. Linpack (which is very easy and incredibly parallel, i.e., a great case for GPUs). The top Xeon result achieves 91% efficiency. The top NVIDIA result got 54.5%.

    So in a *real* workload when comparing a properly optimized CPU implementation with an optimized GPU implementation, you would be very lucky to see a 4x increase with the GPU. Very lucky indeed. Somewhere around 2-3x would be more typical. No matter how much you stick your head in the sand, you can't get away from the reality of these numbers.

    Now there are some other cases where fixed function units on the GPU have been used to provide a larger speedup. That's all well and good, but it tends to be rather limited. It may be akin to comparing a load using the CPU's encryption or random number acceleration functions.

    Here is some further reading if you're interested.

    www.cs.utexas.edu/users/ckkim/papers/isca10_ckkim.pdf
    www.realworldtech.com/compute-efficiency-2012/
    top500.org

  • Re:Computations (Score:5, Informative)

    by GumphMaster ( 772693 ) on Saturday August 04, 2012 @11:18PM (#40882831)

    The SKA will have digitised signals coming from one or more receiving heads and radio receivers mounted on each of the 3000 radio telescopes that form the array. There's a massive amount of data that that needs to be time correlated to within a nanosecond or so (over transmission distances > 1000km), corrected for known system distortions, subject to beam forming, corrected for rotation and atmospheric effects, passed through Fourier analysis, analysed for polarisation, filtered, binned, summarised and stored in useful ways. Some of the tasks need to be done in real time, others can wait. Some of those tasks are heavy on the floating point work and easy to parallelise. Much can be done with dedicated hardware but that is much less flexible over the longer term than a programmable device.

All seems condemned in the long run to approximate a state akin to Gaussian noise. -- James Martin

Working...