Sign in with
Sign up | Sign in
Your question

Best CPU for Parallel Cluster?

Last response: in CPUs
Share
January 5, 2009 4:49:40 PM

Hi All,

I need to build about 16 computers that will be used as a parallel computing cluster. I want to keep each box around $400 but also get the most bang for the buck. My draft config is with a Phenom 9950 (~160 at Newegg) but I am considering moving to a Q6600 (~$190 at Newegg). Any thoughts on these processors (or others) and a decent < $80 mobo for them?

If I go 9950 then I will probably get Foxconn A7GM-S 780G mobo (~70) and G Skill (2 x 2GB) DDR2 1066 RAM.

- Systems will run XP Pro 64-bit and Matlab Distributed Computing Server. Eventually I may move to Linux to reduce overhead.
- Systems will be running 24/7 for days at a time.
- I will be using all 4 cores simultaneously. The matlab parallel toolbox treats each core independently.
- Primary concern is overall speed of the entire parallel job (processing + other overhead) . I am doing a bunch single precision calcuations and looping. There are also a lot of memory reads. I am still trying to figure out all the bottlenecks.
- Power and heat are a bit of a concern since there will be 16 of these in the same room.
- I will probably not overclock since these guys will be running all the time.
- I tried GPU computing already with a nvidia GTX 280 but CUDA gives me a headeache, I rather build PCs.
- I don't have great benchmarks for my job but a Phenom 9950 core seems to run about 20% faster than a Xenon E5310 core for my application.

On a side note, Dell is running a sale on their T605 servers. Dual Operton 2350's for around $800(no ram). Any thoughts on going this route for a slower processor but the benefit 8 cores/box, prebuilt with ECC ram?

Thanks,
Mike

More about : cpu parallel cluster

January 5, 2009 6:25:02 PM

Just making sure I'm on the same page...

16 computers = 16 boxes = 64 threads (or 128 if you go with dual socket boxes)

Now, assuming the above:

You need to find out what your memory requirements are.

If you are using a high amount of bandwidth then probably better to go AMD.

*BUT*

If you are using a high amount of bandwidth, you need to look VERY closely at what switches you are going to use between the boxes. It would almost certainly be the case that your jobs would be limited by the switches, rather than what is within the boxes.




At such high cpu numbers, you need to make sure of your scaling before making the decision on the T605s - going from 64 to 128 could well result in a slow-down rather than speed up. Try to find some equivalent codes and compare - I know its your own matlab code and that makes it difficult to find comparable stuff - but better searching than shooting blind!
January 5, 2009 6:44:13 PM

Tough to say, but here is one benchmark I found.
http://techreport.com/articles.x/13633/14

The Intel seems to have a small lead or large lead depending on the test.
Now, if you run the Chips at the same speed the difference would be quite large.

(Note: You could OC the Q6600 and maintain the Power Envolope off the Q9950 and have a large performance gain or you could run at stock and have a large power savings vs the Phenom.)

(Note2: If you will not be OCing the CPU, I would recommend dropping the CPU multiplier and increasing the FSB so the ram is running faster than 533Mhz. This should greatly enhance the Memory Performance.)
Related resources
January 5, 2009 6:52:46 PM

Thanks for the reply.

- Yup, 16 computers = 16 boxes = 64 cores = 64 'threads' or 'jobs' in Matlab-ese

- I will be using an unmanaged 24 port gigabit switch. Seems like they run about $170 on Newegg

- We have been testing the various bottlenecks. We played with the various job sizes and it seems the network speed is not a big issue for us. Actually the smaller job size the better (runtime about 10 seconds). Since the different processors are finishing at slightly different times depending on their task, we gain more by sending smaller 'packets' and reducing processor idle time. This seems to suggest that the network isn't adding as much overhead as we thought. ... But we are still testing.

- Still trying to find some equivalent codes. The parallel world is new to me. The beowulf.org guys may have some thoughts.

Thanks!
January 5, 2009 6:59:30 PM

Thanks zenmaster.

Could I OC a Q6600 with stock cooling that was running 24/7? I guess I could spring for some better cooling if the performance increase was substantial.

I was trying to fit this into a microATX box (since there will be 16 in my office) but perhaps I need something a little more roomy.
January 5, 2009 7:17:47 PM

I'm not sure how much you could OC on Stock.
I always buy 3rd party.

I actually run my Q6600 on Stock speeds, but I have dropped the Default Voltage well under default because It runs 24x7 and is stored in a small and nearly unventilated cabinet.


If the room will be small, you may want to wait until you find a sale on a 45nm quad. Frys does have sales with a price under $200 quite often.

The Phenom 9950 uses quite a bit more power than the Q6600 which uses quite a bit more than the Q9300/Q9400. However you could only afford the Q9xxxx series if you found a good sale.

I suspect you could go from about 2.4 to 3.0 and still keep in the Phenom power numbers, but I will admit I'm pulling that a little bit out of you know where. They key may be to OC as far as you can w/o increasing the CPU voltage. (Don't Leave on Auto). You likely want to find a nice reasonable number that works on all of your systems.

But I suspect Heat could be an issue depending on room size so you may want to stick with stock coolers. You may not want to be throwing off the heat that would necessitate aftermarket coolers. I would be quite surprised if you could not tweak up the processors slightly and still keep the voltage and/or power down.

Example - The Q6700 was also rated at 95w and ran at 2.66Ghz.
I suspect you could easily tweak the Q6600 to run at that power envelope.


January 5, 2009 7:42:36 PM

tradingguy said:
Hi All,

If I go 9950 then I will probably get Foxconn A7GM-S 780G mobo (~70) and G Skill (2 x 2GB) DDR2 1066 RAM.

- Systems will run XP Pro 64-bit and Matlab Distributed Computing Server. Eventually I may move to Linux to reduce overhead.


Floating point on 64 bit OS ? Head AMD.

tradingguy said:

- Systems will be running 24/7 for days at a time.


Don't overclock. Spending 30 or 40$ (cooler) multiplied by 16 might not give you the edge you need. and 640$ of coolers (total) might give you room for 2 more boxes. Btw, with so many boxes i would think of clustering.

tradingguy said:

- I will be using all 4 cores simultaneously. The matlab parallel toolbox treats each core independently.


If the threads are not separated, i read somewhere that Intel shutdowns the prefetchers between cores to save bandwitdh. I can't be sure now. If so AMD wins another bout here with the Native design.

tradingguy said:

- Primary concern is overall speed of the entire parallel job (processing + other overhead) . I am doing a bunch single precision calcuations and looping. There are also a lot of memory reads. I am still trying to figure out all the bottlenecks.


Not integer calculations or vectorial work, ill head AMD once more. Loads of memory read, head AMD also. I would suggest a 9850 or lower.

tradingguy said:

- Power and heat are a bit of a concern since there will be 16 of these in the same room.


Yup. I advice you to study well what CPU. I hardly consider the 9950 a good choice for the work your doing. The 9850 might be a bit much. I have a bit of experience in clustering cheap boxes for the various purposes. 16 boxes draw alot of heat and alot of power.

I would consider the 9850BE and underclock it a bit, and undervolt it, until you have the performance/watt you want. And in the winter you can always overclock them. Keep the HT link untouched though. If i'm not mistaken it is the first AMD CPU with a HT at 2Ghz. Below that they all work at 1.8Ghz or less. With the amount of small instructions this will be very important.

tradingguy said:

- I will probably not overclock since these guys will be running all the time.


Good. There will be alot of noise and heat.

tradingguy said:

- I tried GPU computing already with a nvidia GTX 280 but CUDA gives me a headeache, I rather build PCs.


Baby steps. OpenCL, SIMD programming it is still in its infancy. I would rather also wait a few years while development tools are getting ready and tested in your case.

tradingguy said:

- I don't have great benchmarks for my job but a Phenom 9950 core seems to run about 20% faster than a Xenon E5310 core for my application.


Arquitectural advantages in this case i guess.

tradingguy said:

On a side note, Dell is running a sale on their T605 servers. Dual Operton 2350's for around $800(no ram). Any thoughts on going this route for a slower processor but the benefit 8 cores/box, prebuilt with ECC ram?


If you dont have network problems (like you said ina earlier post) i would advice to save money and stay out of MP servers. People use Blade Servers for example, because rack space in a datacenter is expensive. If you have the space don't settle for less. For your case i doubt a Opteron would improve anything. if i recall correctly the big diference was Virtualization microcode.

tradingguy said:

Thancks,
Mike


Your welcome,
Radnor.
January 5, 2009 8:23:52 PM

1) Single-precision is 32-bit, and so long as the toolbox supports SSE, the Intels will be faster - floating point or integer.
2) If the toolbox assigns each core its own worker packet, that means threads are very much separated with their own memory spaces.
3) If a Phenom 9950 is found to be 20% faster than Xeon E5310 and the predominant reason is CPU core scaling, then a K10 core at 2.6 GHz approximates a Kentsfield core at 2.0 GHz for this task.
4) The 9950 and 9850 do run much hotter than a Q6600 and may require more expensive boards and power supplies to remain stable.

45-nm is an option, depending on price. The argument would be, since performance per watt is drastically improved, you can overclock relative to 65-nm while keeping to the same power envelope. The main unknown is how much of an effect reduced cache has on this workload. There are some Yorkfields with 4M instead of 12M cache, priced around the Q6600.
January 5, 2009 8:50:23 PM

WR said:
1) Single-precision is 32-bit, and so long as the toolbox supports SSE, the Intels will be faster - floating point or integer.
2) If the toolbox assigns each core its own worker packet, that means threads are very much separated with their own memory spaces.
3) If a Phenom 9950 is found to be 20% faster than Xeon E5310 and the predominant reason is CPU core scaling, then a K10 core at 2.6 GHz approximates a Kentsfield core at 2.0 GHz for this task.
4) The 9950 and 9850 do run much hotter than a Q6600 and may require more expensive boards and power supplies to remain stable.

45-nm is an option, depending on price. The argument would be, since performance per watt is drastically improved, you can overclock relative to 65-nm while keeping to the same power envelope. The main unknown is how much of an effect reduced cache has on this workload. There are some Yorkfields with 4M instead of 12M cache, priced around the Q6600.


I advice you to start reading.

Link: http://www.insidehw.com/Editorials/Columns/AMD-K10-Arch...

For the work he is doing a K10 core is better.
January 5, 2009 8:58:58 PM

Thanks for all the great replies, I appreciate it.

A quick PSU question ( I know nothing about PSUs). According to the PSU calculator at http://extreme.outervision.com/psucalculatorlite.jsp , my config only needs about 250 watts. Perhaps I will go with 300 or 350 watts for a little margin of error. Does that sound about right? I am trying to keep power draw as low as possible.
January 5, 2009 9:22:47 PM

Radnor,

Your link shows improvements of the K10 over K8 when it comes to SSE, and we see some of that through benchmarks. The problem is Conroe is still faster because K8 was so far behind. Did you not see that we already have a ballpark benchmark for the workload discussed?



As for the PSU, 300W is plenty; 250 should be enough as well. You'll likely be drawing 140-160W at the wall for extended periods (stock Q6600, IGP). I'd read reviews just to make sure the PSU model is not unusually prone to failure, then focus on efficiency and price. This is different from the usual advice here because most PSU requests are to power $1000+ systems.
January 5, 2009 9:41:07 PM

it seems you are pretty set on using CPUs for the calculations and using a quadcore per box. But as an alternate method, how about using GPU's, such as the 4870's to run the workload. I'm not sure if Matlab is intended for GPU's, but i think i remember something about them being supported kinda like F@H.

Since it sounds like you want to stick with CPU's, have you thought of going with AMD's quad socket, (quad-quad, 16 cores per box) solution. The new Shanghai processors are expensive, but have a low thermals and are very good at running 24/7, full load. They also have some serious memory bandwidth. Probably out of your finance league though.
January 5, 2009 9:41:22 PM

WR said:
Radnor,

Your link shows improvements of the K10 over K8 when it comes to SSE, and we see some of that through benchmarks. The problem is Conroe is still faster because K8 was so far behind. Did you not see that we already have a ballpark benchmark for the workload discussed?


Quoting InsideHardware:

Quote:

Regardless of K10 decoder’s inability to decode 4-5 commands per clock, as Core2’s one can do under specific circumstances, this will not be a limiting factor in program execution, because the average execution rate of these commands is usually under 3. On average, the K10 will decode an x86 instruction in fewer MOPs than a Core2, which, together with its 32-byte input rate, makes its decoder quite efficient.


Quote:

A cache architecture conceived like this is very opportune for multithread and multitasking execution; separate L2 and L3 cache improve intercommunication and reduce the possibility of “cache thrashing”, which is the only serious issue with the Core2 architecture. “Cache thrashing” in divided L2 cache can create a problem with cache access. Execution of the first process can significantly impair execution of another one if this happens. For example, if you let the computer do two separate tasks, the first one could affect the performance of the second one, and the total time needed to complete the two operations usually increases for about 20-30%. This happens because each of the processes uses one of the cores which try to access the L2 cache at the same time.


Yes, and for the ballpark estimate reveals that. Penryn or Conroe for that matter don't gain much due to prefetchers off between Cores. For the reason i state above , and for the reason i state below. Quoting InsideHardware:

Quote:

Unlike Intel’s Core2Quad, the K10 is monolithic in design, which means that a full-blown quad-core processor is placed on a single piece of silicon. Intel, on the other hand, uses the MCM (Multi Chip Module) packaging, which allows them to place two dual-core processors within the same LGA casing and connect them by the Front Side Bus, i.e. the processor bus. This type of connecting dual-core processors into a single quad has a drawback in the limited communication between different cores, as all traffic is done over the chipset. In order for L2 cache data from the first dual-core processor to reach the L2 cache of the second one, it has to pass through to the chipset and back.



Between Cores the prefetchers and other types of branch prediction are shutdown to save bandwith. To his work is vital.
I have some experience with the endevours his trying to do. Trust me on this one homie, a K10 arquitecture beats the crap of a yorkfield/penryn. Unless he overclocks everything to hell and back. But you better take overclock out of the equation for stability and error generating. And for electric bill sake. It is 16 machines.

WR said:

As for the PSU, 300W is plenty; 250 should be enough as well. You'll likely be drawing 140-160W at the wall for extended periods (stock Q6600, IGP). I'd read reviews just to make sure the PSU model is not unusually prone to failure, then focus on efficiency and price. This is different from the usual advice here because most PSU requests are to power $1000+ systems.


Here i agree. Get a nice 320W PSU with a 80% or 85% efficiency. And something that don't dramaticly explodes.
January 5, 2009 10:18:30 PM

Again Radnor, you're misinterpreting the workload.

His 16 PCs will be connected over gigabit Ethernet. Any bandwidth issues manifested in "cache thrashing" and FSB choking would be severe issues going across Ethernet, which is in the realm of 100 to 1000x slower.

The Matlab distributed server gives out individual packets with output going straight back to the control server. There is no interdependency whatsoever visible between work packets on a worker machine, and the server is probably unaware of the node setup, unlike HPC programmers. Native quad here means nothing, and prefetching data from a neighbor thread means nothing, too.

Did you even read the ballpark benchmark? A 9950 (2.6 GHz Phenom) is about 20% faster than a Xeon E5310 (1.6 GHz Kentsfield). This is not HPC. If it were HPC, we wouldn't be using a $170 gigE switch, but rather HyperTransport/Infiniband to connect sockets and nodes, respectively.
a b à CPUs
January 5, 2009 10:25:15 PM

Might use 2p or 4p Opterons with dual gigabit enternet. Hypertransport link will reduce the bottleneck. Less servers = less bottleneck, cost and power consumption
January 5, 2009 10:36:25 PM

WR said:
Again Radnor, you're misinterpreting the workload.
His 16 PCs will be connected over gigabit Ethernet. Any bandwidth issues manifested in "cache thrashing" and FSB choking would be severe issues going across Ethernet, which is in the realm of 100 to 1000x slower.


All depends what type of calculus he is doing. Iv'e seen chokes on things like Structural Analising, and Cloud Points Calculus. Both applied to civil engennering. Sorry if the names aren't correct but they are a literal translation to the Portuguese/Spanish terms. I saw that happening on 5000€ Dual Xeon (4c+4c) machine with a Raid SCSI 0+1.

WR said:

The Matlab distributed server gives out individual packets with output going straight back to the control server. There is no interdependency whatsoever visible between work packets on a worker machine, and the server is probably unaware of the node setup, unlike HPC programmers. Native quad here means nothing, and prefetching data from a neighbor thread means nothing, too.


This was my main doubt, glad you clear it. If there is no interdependency/interchange between threads ok then.

WR said:

Did you even read the ballpark benchmark? A 9950 (2.6 GHz Phenom) is about 20% faster than a Xeon E5310 (1.6 GHz Kentsfield). This is not HPC. If it were HPC, we wouldn't be using a $170 gigE switch, but rather HyperTransport/Infiniband to connect sockets and nodes, respectively.

[/quotemsg]

Server chips aren't more expensive just because the sticker and warranty. But yes the result sounded a bit too low. I don't have time to see what's failing, but i stand corrected i doubt it's hardly the cpu fault. Again i must refrain that, in my profissional experience much depends what he is trying to calculate. Even a 10Mbits switch might be enough depending on the databeing feeded.

Might recomend a Penryn, but check prices first. TDP it should do pretty much ok.
January 6, 2009 12:48:18 AM

Thanks again for the great feedback.

- I apologize if the ballpark 20% cpu difference is off. This is a genetic programming application and population members vary a lot. It has been hard to get a good number especially after the first generation.

- The cluster may be used for many things but right now it will be used for a quite a simple fitness function of a genetic programming problem. A lot of serial looping and addition of single precisions.

- I am sure our parallel code has all sorts of bottlenecks and problems. We are working to resolve the issues now. I am hoping to build a good cluster setup that will benefit from our code improvements on this project and be able to handle other types of projects as well.

- I think the AMD box will be about 10%+ cheaper based on rough Newegg estimates. From a performance standpoint I guess it comes down to whether I will get 10%+ more performance out of Intel at similar power consumption. I know it is tough to tell from my vague description but any final words will help. Any setup will be a vast improvement over what we have but I would like to start as strongly as possible. This cluster could expand to a few dozen nodes.

By the way, is a microATX case with 1 80mm fan ok from an airflow standpoint or should I go bigger or more fans. Footprint makes a big difference with 16 of these things.

Thanks,
Mike

a b à CPUs
January 6, 2009 5:54:28 AM

^ yes, a rack would be good for you. Very hard to find matx server boards.
January 6, 2009 7:26:49 AM

tradingguy said:
Thanks again for the great feedback.

- I apologize if the ballpark 20% cpu difference is off. This is a genetic programming application and population members vary a lot. It has been hard to get a good number especially after the first generation.


Do like we enthusiasts do, extended testing on all the environments possible.

tradingguy said:

- The cluster may be used for many things but right now it will be used for a quite a simple fitness function of a genetic programming problem. A lot of serial looping and addition of single precisions.


You can use those function is test. But i advice you to try calculus you did in the past, and try to cover the more fields or possible aplications you can. Read a benchmark on a gaming system, you'll see we do extended tests to determine what is best. TO only get the conclusion what is best is some cenarios.

tradingguy said:

- I am sure our parallel code has all sorts of bottlenecks and problems. We are working to resolve the issues now. I am hoping to build a good cluster setup that will benefit from our code improvements on this project and be able to handle other types of projects as well.


Great to ear it.

tradingguy said:

- I think the AMD box will be about 10%+ cheaper based on rough Newegg estimates. From a performance standpoint I guess it comes down to whether I will get 10%+ more performance out of Intel at similar power consumption. I know it is tough to tell from my vague description but any final words will help. Any setup will be a vast improvement over what we have but I would like to start as strongly as possible. This cluster could expand to a few dozen nodes.

By the way, is a microATX case with 1 80mm fan ok from an airflow standpoint or should I go bigger or more fans. Footprint makes a big difference with 16 of these things.


One idea for benchmarking what you will do, knowing that you have already a 9950 available, try to get a Q6600 with a similar setup (HDD/Ram). Get a Watt Meter you can plug to the wall and start testing on both boxes entensivly with all the type of diferent calculus/functions you do. If the benchmark reveals Q6600 = 9950 you got a bottleneck somewhere before, or in the code. I strongly advice you the test extensivly and do you homework if you don't want to have bad surprises (performance wise) up on the road.

Air flow standpoint, more important than fans is cable management :) 

Thanks,
Mike
a b à CPUs
January 8, 2009 12:36:57 AM

The Kill-a-Watt power measure is great. I have one.
!