## Stanford Uses Million-Core Supercomputer To Model Supersonic Jet Noise 66

Posted
by
Soulskill

from the i-would-use-it-to-play-quake dept.

from the i-would-use-it-to-play-quake dept.

coondoggie writes

*"Stanford researchers said this week they had used a supercomputer with 1,572,864 compute cores to predict the noise generated by a supersonic jet engine. 'Computational fluid dynamics simulations test all aspects of a supercomputer. The waves propagating throughout the simulation require a carefully orchestrated balance between computation, memory and communication. Supercomputers like Sequoia divvy up the complex math into smaller parts so they can be computed simultaneously. The more cores you have, the faster and more complex the calculations can be. And yet, despite the additional computing horsepower, the difficulty of the calculations only becomes more challenging with more cores. At the one-million-core level, previously innocuous parts of the computer code can suddenly become bottlenecks.'"*
## Pfft. I can simulate supersonic jet noise just by (Score:5, Funny)

Pfft. I can simulate supersonic jet noise just by overclocking my Radeon 7970.

## Re: (Score:3, Funny)

Pfft is my simulation of jet noice

## Re: (Score:2)

I can get it by flipping the switch on my amp and running a pick down the low E string on my Ibanez. They seriously needed a million processors? Sounds like government had a hand in that one.

## Re: (Score:2)

My first tower desktop left dark dust-marks against the wall where the fans were. Told my parents that I forgot to turn the after-burners off after take-off.

## Beowolf? (Score:1)

I don't know. The word just popped into my head.

## Wake me when they reach 4444444 cores (Score:3)

everything is in the subject

## Re: (Score:2)

## Re: (Score:3)

Well, do they count CUDA cores as fully-fledged CPU cores ?

## For those who can't afford that type of equipment (Score:2)

Fwoooooooooooosh. Fwoooooooooooooooooooooooooosh. KWEEEOW. Fwooooooooooooooosh.## Re: (Score:2)

## Re: (Score:3)

Slashdotters don't have sex, and so they cannot have slashdaughters. Ergo, slashdaughters do not exist. QED.

## Re: (Score:3)

## Re: (Score:1)

Slashdotters don't have sex, and so they cannot have slashdaughters. Ergo, slashdaughters do not exist. QED.

Slash has had sex with many, many women over the years.

I'm sure he has at least a few Slashdaughters.

## Re: (Score:1)

Is there a system that can handle a 3000 ship EVE online batter with no lag?

## That many ... (Score:2)

## five-dimensionally connecting the cores (Score:5, Interesting)

But searching for "5-d torus interconnect" gets you nothing on wikipedia. Here's the 2-dimensional version explanation: http://en.wikipedia.org/wiki/Torus_interconnect [wikipedia.org]

and the K computer by Fujitsu [wikipedia.org] at Riken uses a 6-d (six dimensional) torus network. So how does the 5-d torus interconnect lead to the 2**19 + 2**20 cores or possibly 2**17+2**18 cpus? I'm not seeing it in my head clearly. Off to a paper-napkin to sketch it out!

.

Each core connects 5-dimensionally going forward or back in each dimension gives 10 interconnects from one core to the 10 5-dimensional neighbors one distance away. But the number of cores is divisible only by twos and a three (factor number of cores = 3 * 2^19) so I'm not seeing the construct...

## Re:five-dimensionally connecting the cores (Score:5, Informative)

See Hardware Section 8, BG/Q Networks

## Re: (Score:2)

## Re: (Score:2)

.

is the relevant picture showing a "Midplane, 512 nodes, 4x4x4x4x2 Torus". So a five-dimensional torus of size 4 x 4 x 4 x 4 x 2 is divisible by three.

## Re: (Score:2)

.

"Spaghetti code, I'd like you to meet the spaghetti interconnect."

## Re: (Score:1)

Well a quick google and we get this

https://asc.llnl.gov/computing_resources/sequoia/configuration.html [llnl.gov]

16 cores per CPU/chip. (or according to wikipeda 18, but one is used for the OS and one is saved as a spare).

Note also that each dimension of the torus does not have to be the same, so the constraint is

No of cores = 16*A*B*C*D*E

That is assuming each node on the torus has 16 cores (potentially could be a multiple of 16).

Anyway according to

http://en.wikipedia.org/wiki/Blue_Gene [wikipedia.org]

The system is 96 racks

Each rack

## Re: (Score:2)

I'm more interesting in how the headline writer got from "1,572,864 cores" to "million core".

## why round down "1,572,864 cores" to "million core" (Score:2)

I'm more interesting in how the headline writer got from "1,572,864 cores" to "million core".Rounding down to the nearest million? ;>)I think the achievement was surpassing the arbitrary limit of "one million cores" in a cluster or parallel environment. The same way that people like to celebrate milestones of 10^3 somethings or multiples of {365,365,365,366) added together in ratios of approximately 4 to 1. And yes, that does (or should) make you "more interesting"! (you said "I'm more interesti## Re: (Score:2)

in ratios of approximately 4 to 1Shouldn't that be in ratios of 3 to 1 approximately? Responding to myself to catch the error of leap year frequency!

## Re: (Score:2)

It's the same topology as the state space of a Rubik's cube.

a 1D torus is simply a ring. Imagine a simple ring made from eight points. Translate that ring to the side a bit, and spin 360 degrees in steps of 45 degrees. That gives you a 2D torus. Now once again move that torus off to one side, and spin it again with the same number of iterations. That's a 3D torus. Another tasty way of visualisation would be a ring of donuts sitting on the sides stacked top to bottom in a closed circle. Every node then has s

## How many cores does it take to (Score:2)

simulate the Matrix?

## Re: (Score:3)

simulate the Matrix?

One. Actually, you could do it with rocks [xkcd.com].

## Wikipedia article out of date. (Score:2)

## Makes the sound 'boom!' (Score:1)

But what was the question?

## Re: (Score:1)

## Re: (Score:1)

Such simulations using double precision for accuracy. You get precision problems if you just used 32-bit floating-point otherwise the tiny differences between approximated values will amplify over every time-step. The goal of this project was to model turbulence and how it could be reduced by adding grooves to the engine exhausts. Turbulence is almost fractal in nature - the closer you look at any volume in space, the smaller the vortex tubes get, right down to atoms spinning round each other. Because there

## More cores = interesting problems (Score:3)

You get some pretty interesting problems, when you increase the number of cores in your computer.

A couple of years ago, we replaced a 4-core IBM P5 with a 32-core HP DL 580. We tested it for a couple of months with just a user, or two, at a time. Then, we took a day and tested with the entire company (roughly 250 users). Thank goodness we did before we put it into production because, for some people, it was actually

slowerthan the P5. It looked like it was going to be a disaster.Fortunately, I had seen this problem before (on a Sequent Symmetry, of all things). I ran "strace" on the offending process, and sure enough, we were having problems with lock contention. We talked to our software vendor and, while it took a while for them to admit it was their problem (and probably cost us multiple thousands of dollars to have them fix it), they rewrote the code to use fewer locks. Problem solved.

## Re: (Score:2)

"Using fewer locks" often means "data integrity goes down the drain".

That race condition could never to happen to us, right?

## Re: (Score:2)

Of course not. It's closed-source software.

Thankfully, they were only reading the file. Why they were locking it, in the first place, I'll never know.

## Re: (Score:2)

It's cooler. Look up "compute server".

## No cores needed? (Score:2)

I was able to calculate the noise from the jet *inside the cabin* without so much as a calculator...

## Strange implications? (Score:1)

"The waves propagating throughout the simulation require a carefully orchestrated balance between computation, memory and communication."

This statement seems to imply the outcome of the simulation depends somehow on the tuning of the system hardware. That has dire implications for whatever method they are using.

If a simulation becomes non-deterministic depending on how the hardware communicates, and gives different solutions to the same problem because of that, then I would say it is not a good approach to

## Re: (Score:2)

I think the way the sentence is constructed is slightly confusing. They are not talking about the simulated sound waves but about the computation waves. This type of code is not monolithic, but runs through various phases during computation (as in map-reduce for instance). To remain efficient, you have to orchestrate the nodes to remain in sync to avoid costly idling locks.

Typically, some parts of CFD or other scientific simulations may include non-deterministic steps, e.g. mathematical optimization often b

## Re: (Score:2)

It means they have to match the speed at which calculations are performed on chunks of data against the speed that these chunks can be propagated to and from neighbors. Then every now and again they need to save checkpoints or saves of the entire simulation, so they don't lose months of calculations.

## Physics is on their side. (Score:5, Insightful)

## Re: (Score:2)

Thank you for this info.

Do you have some example of physics related to other classes of equations ?

Wikipedia [wikipedia.org] confirms this and also tells us that heat equations are parabolic, but there aren't much examples.

## Re: (Score:2)

I'm not extremely experienced with the details of numerically solving these equations numerically in parallel, but generally the solution of an elliptic equation at a given grid point depend

## Re: (Score:2)

Whereas wave equation problems are propagation problems, the theory to solve the corresponding PDEs is well understood and can rely on explicit finite differences or finite element schemes which are known to be stable, for instance under the Courant–Friedrichs–Lewy condition (look this up if you want). The corresponding code "only" requires matrix additions and multiplications, this is why for this problems the Linpack benchmark is relevant.

On the other hand the parabolic Laplace and Poisson equ

## Re:Physics is on their side. (Score:2)

On the other hand the parabolic Laplace and Poisson equations can only be solved by matrix inversion.You can, but you don't have to. A common undergrad PDE computing task is to solve Laplace's equation using finite differencing. If at each time step you make the value of the current point equal to the average of the 4 surrounding points at the previous step, then you converge to the solution.

All you're doing is solving Ax=b, so any style of solver will work just fine, including but not limited to direct sol

## Re: (Score:2)

With 1.5 million nodes, space could be partitioned into 64x64x384 cubical regions (from the piccy in the article they are simulating a non-cubic region). Not having *

## Amdahl's law (Score:3)

At the one-million-core level, previously innocuous parts of the computer code can suddenly become bottlenecks.

When they say this, they mean it. To put this in perspective: with 1,572,864 cores, an application which is 99.9999% scalable will use LESS THAN HALF of the hardware! Over 60% of the hardware will be tied up waiting for that 0.0001% of serial code to execute.

This problem is explained by Amdahl's law [wikipedia.org], an important (yet depressing) observation which shows just how difficult writing an effective parallel algorithm actually is -- even when you're only writing for 4 cores.

## Re: (Score:2)

You can use Amdahls law to go in the reverse direction from the speedup seen to a figure for the "parallelisabiliy" of the implementation of the solution to the problem, but that's just a meaningless number. You're more interested in the speedup you achieved, there's no need to bring Amdahl's law into discussions at all.

I'm not saying Amdahl's law is useless, I frequently have to bring it to the at

## Re: (Score:3)

There's Gustafson's Law [slashdot.org] exactly for this. Amdahl's law is not appropriate at this case. In fact, even the Wikipedia page of Amdahl's law mentions this. You are never going to use a computer with 1 million cores to do something done manageable time for a 4 core cpu or whatever. If the portion of the code that is serial is consistently small (let's suppose just reading the initial conditions from a text file) then you make sure you are applying the 1 million-cpu machine to a large enough job.

People don't want