Itanium 2, Xeon 5XXX or Opteron 2XX?

Wareva

Distinguished
Oct 3, 2006
8
0
18,510
Hi

I'm using a microcluster of 2 HP workstations (P4 3GHz 2GB RAM) to run scientific software (ANSYS, NASTRAN, Fluent, Starcd, etc...), mainly CFD (Computational Fluid Dynamics) and finite element method.
The problem is my systemis getting to old and to slow, what was once acceptable, is nowadays so damn slow (sometimes it takes up to 4 days to prosue convergent results).
I need to change my machine, so i was thinking of a 2xdual core system, with at least 8GB RAM (and it has to be very upgradable). The new Xeons seem to be competing quite nicely with the Opteron processors but 'm afraid Intel will upgrade its socket in some near future whilst AMD will keep the socket F for a long time. As for Itanium, i don't really know the CPU, but i think it's not x86 native, is it?

Summing up, i need a very upgradabe workstation 2xdual core, but i dont know which CPU to choose. The budget is not a big issue, so if i can get some opinions, i would be grateful.

Thx and be cool
 

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
Itanium is fast as 80486SX for x86 software, forget about it.
2P Xeon 5xxx is better choice than 2P Opteron. Maybe you should wait for the quadcore Clovertown if you really need the best performance from 2P server.
 

locust

Distinguished
Apr 6, 2005
14
0
18,510
I've been working for a couple of years with fluent (have no experience with the other software packages) so my advice is solely for fluent simulations:

-best bang for your bucks you'll get by buying desktop computers and put them in a cluster.
I am currently working on a cluster of 26 pentium 4 computers (3.2 Ghz 2 Gb ram), for the ammount of money we put into that cluster we might have bought one system with 4 opteron processors with 8Gb (1.5 year ago). This while the p4 cluster gives us more calculation power. I still have benchmarks lying arround with:

-pentium 4 computers
-dual athlon machine
-dual xeon (old ones, with i think 333 single channel DDR and 2Ghz processor)
-dual xeon (3.4 Ghz with double channel DDR)
-AMD 64
-pentium D
-AMD X2

My conclusions were: pentium's were faster, cause? i think their higher clockspeed. The xeon's were crap because they were limited by their memory. The AMD machines all have lower t-steps/hour and this seemed to be caused by their lower clockspeed.
2 pentium 4 computers in parallel loose arround 20% efficiency/computer, so you get 1.6 time the calculation power.
pentium D machines loose 10% efficiency when using 2 cores in parallel
4 p4 =2.5 times more time-steps/hour.


If you contact me tomorrow i can look into this data again. I dont have data on itanium's but they are awfully expensive and i dont think i'd advice you to buy them. New xeon's or new opterons... hmmm, waht's your budget?

Some other remarks:
-we usualy work on smaller domains and if possible in 2D, so 1 processor can calculate the job.
-when going to 4 and 6 processors i had stability issues, leading to fluent freezing, to solve this i just let him save data and log in once/day to restart if needed. Usualy i had a frozen case every 5-6 days so it isn't that difficult
-i work in windows environment because i'm not adequatly skilled in unix/linux. Other OS might improove stability.
-usualy time-dependant simulations with arround 1Gb ram used/processor, species+reaction. Our velocity fields are usualy solved in a couple of hours while the species/reaction takes a couple of days, and sometimes even weeks.
-as i said earlier: i have no experience with other CFD packages, are they as easily in parallel processing as fluent?

What is your budget? How many licences do you have? parallel processing only?
I'm not sure but i dont think a new rig will go THAT much faster, maybe if you buy another 2 p4's you'd get the same results? Or with your budget you buy 6 p4's, allowing you to have 2 clusters of 4 CPU's. Your simulation will probably go faster (3 days instead of 4?) AND you can run 2 simulations next to each other. A workstation with 2 dual core might work faster (2.5 days instead of 4?), will offer more stability, less power consumptium, less space,... but i think in the end you'll get less calculated.

(edit) Another option: pentium d's, one of their nodes offers the same performance as a p4's, they are cheaper. if you combine their 2 cores you'll loose arround 10% of their calculation power (compared to 20% for p4's). They are cheap and you can equip every Pentium D with 4 Gb ram.
Upgrade? just buy some new computers for in your cluster.
 

Dante_Jose_Cuervo

Distinguished
May 9, 2006
867
0
18,990
If I were you I'd go dual 51XX, they're quite powerful for the money and the true quad-core (45nm) upgrades aren't going to be available until Q3 2007 at earliest. My suggestion: go with a dual 5140 or higher machine. You won't be dissappointed.
 

aladar

Distinguished
Mar 29, 2005
31
0
18,530
As I don't know any of the applications I'm going to give a broad answer.

Itanic... oops, itanium: terribly expensive, and awful with x86 code (it runs on emulation mode). Excels with native code (but support isn't that good)

Xeon series 5000: pentium 4 derived. Not good. being EOL'ed.

Xeon series 5100: (finally) a very good processor from intel. all are dual core Suffers from being a new "architecture" from intel, and suffers from it (e.g. raid bug - in raid cpu usage goes to 30-50% usage). They excell at int calculations (usually very good at web serving, database and similar workloads). With quad-core i believe we'll start seeing the problems of the FSB again (now they have DIB - dual independent bus that relieve the FSB problems), but being a FSB dependent arch the memory bandwidth is always the same.

Opteron series 200: has single and dual core. Has been king for 3 years now. Very stable now, and with broad support (platforms/chipsets). They excell at fp calculations (usually very good at science calculations, encryption and similir workloads). With "direct connect" (NUMA platform) hasn't any FSB trashing problems, and being NUMA memory escalates almost in a linear manner. It's being EOL'ed.

Opteron series 2000: all are dual core. Continues on stable of 200 series, and with same arch has the same advantages.

servers with 2 sockets aren't the same as a uniprocessor pc, even though there are some similarities.

If I was you, I would look at Xeon 5100 or Opteron 2000.

Depending on you really need, if I was you i'd look at a single cpu machine (they are usally cheaper).

As for dual sockets right now I'm buying Opterons 200 (I take 4 months to validate a platform) and for what I've seen from intels 5100 and amds 2000 we'll continue to buy AMD (raid bug is really a very bad thing for me), but our 2S servers are for webserving, and we only serve HTTPS (SSL uses alot more CPU than generating the page).
 

Dante_Jose_Cuervo

Distinguished
May 9, 2006
867
0
18,990
There's nothing wrong with them, but a true quad-core would get better performance from the lack of FSB load (even though it's not slowing down kentsfield!). That and it lowers power consumption due to the smaller process.
 

aladar

Distinguished
Mar 29, 2005
31
0
18,530
and the true quad-core (45nm)
What's wrong with the untrue quad-cores (65nm)?

Really much more FSB trashing (all inter-core comunication goes through the FSB to the chipset).
 

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
and the true quad-core (45nm)
What's wrong with the untrue quad-cores (65nm)?

Really much more FSB trashing (all inter-core comunication goes through the FSB to the chipset).
1. what is FSB trashing?
2. have you seen the C2Q scaling in perfromance compared to C2D for SMT software?
3. Do you know how the 2 "glued" C2D on the C2Q are communicating?
 

Dante_Jose_Cuervo

Distinguished
May 9, 2006
867
0
18,990
To answer what you asked aladar:

FSB trashing is the unecessary communication across the FSB between cores or dies.

I think we all have seen how well it scales, but if the seperate conroes didn't have to communicate via the FSB the performance would be increased a little (not as much as presler to conroe).

C2D aren't "glued" they're placed in the same package so the cores communicated via the L2 cache. On the C2Q you have 2 conroes that are in the same package but seperated into their own dies. Thus you have to use the FSB if one conroe wants to communicate to the other. Now if one core wants to communicate with the core adjacent to it (on the same die) then it just uses the L2 cache.
 

locust

Distinguished
Apr 6, 2005
14
0
18,510
somthing else: when using a cluster in parallel you shouldnt see much difference when using Gb switches and network communications. Our cluster runs on 10/100 Mb lines/switches and when monitoring traffic i hardly use 10-20% of the max bandwidth.
 

aladar

Distinguished
Mar 29, 2005
31
0
18,530
1. what is FSB trashing?
Complete usage of the FSB with cache coenrecy, intercore comunication (all except memory<->cpu usage)
2. have you seen the C2Q scaling in perfromance compared to C2D for SMT software?
I haven't seen C2Q. However i've been seing Xeon 5100, and my code is massive multithreaded (sometimes 1000+ threads), and with these workloads the inter-core comm is usually very high (Xeon MP vs Xeon DP had sometimes lower performance, due to FSB thrashing)
3. Do you know how the 2 "glued" C2D on the C2Q are communicating?
Intel says is through the FSB (as it is worse than direct communications) I believe intel (remember intel DC performance with 2+ sockets, before intel 5100 or C2D?)
 

locust

Distinguished
Apr 6, 2005
14
0
18,510
i dont think FSB trashing is a issue for Fluent (and i think also all other CFD software packages), if it was i'd see a lote more performance loss between:
2 pentium 4's in parallel (communicating over network)
1 Pentium D using both it's cores (using it's FSB)

Fluent (and other CFD software should do the same) divides it's calculation area in 2 equal zones, each processor calculates it's dedicated zone independantly of the other (and uses it's own memory to store this) and the only data that needs to go from one processor to the other is the surface area between those 2 zones. Which only makes up 1% of the total.
FSB however works limiting because the complete case has to be stored in memory, if bandwidt to the memory is too low (as is the case with older xeons) your calculation will slow down because your processor can't read/write fast enough. It will not limit inter-processor data traffic.

Not sure if this is clear.


Fluent (not sure about the other packages wareva mentioned) is single threaded, or double threaded if you start it in parallel processing...
I think this is also one of the reasons why pentium 4's performed better than AMD 64's, despite the amd 64 being a far superior processor.
 

aladar

Distinguished
Mar 29, 2005
31
0
18,530
somthing else: when using a cluster in parallel you shouldnt see much difference when using Gb switches and network communications. Our cluster runs on 10/100 Mb lines/switches and when monitoring traffic i hardly use 10-20% of the max bandwidth.
Clustering was designed to have a huge number of nodes (in the uni i went there was a cluster with 5k nodes), and so the comms has to be highly optimized. Of course, more bandwidth, and lower latency is always better. But clusters tend to send the least possible.
 

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
1. what is FSB trashing?
Complete usage of the FSB with cache coenrecy, intercore comunication (all except memory<->cpu usage)
2. have you seen the C2Q scaling in perfromance compared to C2D for SMT software?
I haven't seen C2Q. However i've been seing Xeon 5100, and my code is massive multithreaded (sometimes 1000+ threads), and with these workloads the inter-core comm is usually very high (Xeon MP vs Xeon DP had sometimes lower performance, due to FSB thrashing)
3. Do you know how the 2 "glued" C2D on the C2Q are communicating?
Intel says is through the FSB (as it is worse than direct communications) I believe intel (remember intel DC performance with 2+ sockets, before intel 5100 or C2D?)
If the software that Wareva uses scales good enough on 2P Netburst and compares well to 2P Opteron, do you think that the FSB will bottleneck the perofmance of C2Q? I don't think so.
I think that the native quadcore will bring perofrmance improvements not becouse there will be no FSB trashing, but becouse there will be "glued" octo-cores.
IMO If he needs performance now, Clovertown is the best 2P choice for his purposes and with a reasonable performance/price factor. If he needs more perofmanace next year, he can replace the "glued" quad-cores with "glued" octo-cores.
 

locust

Distinguished
Apr 6, 2005
14
0
18,510
i work with fluent and untill now i limited parallel processing to 6 nodes (could go to 8 nodes), this because of stability AND because of limited ammount of licences. Wareva will not use 5000 nodes for his CFD calculations.
Fluent charges arround 1000Euro/licence each year. For every processor you need 1 licence (or you can use a parallel licence allowing you to use 8 cores on one single case, hence my 8 computer limitation).

I dont think he will need Gb connectivity of his cluster.
 

accord99

Distinguished
Jan 31, 2004
325
0
18,780
Xeon series 5100: (finally) a very good processor from intel. all are dual core Suffers from being a new "architecture" from intel, and suffers from it (e.g. raid bug - in raid cpu usage goes to 30-50% usage). They excell at int calculations (usually very good at web serving, database and similar workloads). With quad-core i believe we'll start seeing the problems of the FSB again (now they have DIB - dual independent bus that relieve the FSB problems), but being a FSB dependent arch the memory bandwidth is always the same.
Has anybody, other than one article on theinquirer shown that there is such a RAID5 problem?

Opteron series 200: has single and dual core. Has been king for 3 years now. Very stable now, and with broad support (platforms/chipsets). They excell at fp calculations (usually very good at science calculations, encryption and similir workloads). With "direct connect" (NUMA platform) hasn't any FSB trashing problems, and being NUMA memory escalates almost in a linear manner. It's being EOL'ed.
It doesn't have the FSB trashing problem, but it does have the HyperTransport trashing problem which starts showing up at 4S and the problem increases quadratically as you increase sockets.
 

thelvyn

Distinguished
Jul 16, 2006
222
0
18,690
Who actually uses mb raid on a server ? I certainly would not unless it is a very lightly loaded server.
Buy a real raid card.

As far as the HT thrashing anything up to 8 is fine, after 8 it is a problem.

And this doesnt apply to a cluster anyway which is what they have been talking about.

The people who only suggest intel for a server tend to be fanboys.

Anyway if you want 1 single system or if you want a cluster it doesnt really matter that much.

An opteron or woodcrest system will do nicely either way.
However if you want 2 or more sockets you should definately go with the AMD Opterons.

If you do go the xeon route be prepared to spend quite a bit more and do make sure you get woodcrest and not a netburst pos.
 

accord99

Distinguished
Jan 31, 2004
325
0
18,780
As far as the HT thrashing anything up to 8 is fine, after 8 it is a problem.
You can't go past 8 sockets gluelessly period. And it's not fine at 4S, since big cache and better logic already enables Xeon MP to outscale Opteron in important server tasks.

An opteron or woodcrest system will do nicely either way.
However if you want 2 or more sockets you should definately go with the AMD Opterons.

If you do go the xeon route be prepared to spend quite a bit more and do make sure you get woodcrest and not a netburst pos.
Tulsa-based Xeon MP systems are the fastest x86 servers for commercial applications. And unlike Opteron, they scale to 32S.
 

Wareva

Distinguished
Oct 3, 2006
8
0
18,510
1. My budget is around 7500€ (9000$ US)

2. Clustering is not really an option (we already have a 50 comp cluster, room is an issue here and so is licencing), although i know it's the ideal solution for starcd and fluent.

3. This has to be a single machine, Windows based. It's for those application who take two long on a personal computer, but aren't valid candidates (yet) for the cluster (booking the cluster is a real problem).

4. For what i read, Xeon 51XX or Opterons dual core are the answer. The software needs are, in order of importance CPU/RAM bandwidth, CPU floating point calculations capabilities, amount/speed of RAM.

5. I need this machine to be up and running by late october, and i need to be able to upgrade it (CPUs and RAM) for a period of 3/4 years.

Thank you all for replying
 

aladar

Distinguished
Mar 29, 2005
31
0
18,530
You can't go past 8 sockets gluelessly period. And it's not fine at 4S, since big cache and better logic already enables Xeon MP to outscale Opteron in important server tasks.

This sounds like typical fanboyism...

Tulsa-based Xeon MP systems are the fastest x86 servers for commercial applications. And unlike Opteron, they scale to 32S.

on 4S Xeon can get to the heels of opteron thanks to large L3 cache (up to 16MB) and QIB (quad independent bus) and are really much more expensive.

for 4S or 8S, only opteron is good.

I won't even mention temps/power usage.

1S or 2S depends on your typical workload (int is better intel, fp is better amd).

4S+ amd, always!
 

aladar

Distinguished
Mar 29, 2005
31
0
18,530
1. My budget is around 7500€ (9000$ US)

2. Clustering is not really an option (we already have a 50 comp cluster, room is an issue here and so is licencing), although i know it's the ideal solution for starcd and fluent.

3. This has to be a single machine, Windows based. It's for those application who take two long on a personal computer, but aren't valid candidates (yet) for the cluster (booking the cluster is a real problem).

4. For what i read, Xeon 51XX or Opterons dual core are the answer. The software needs are, in order of importance CPU/RAM bandwidth, CPU floating point calculations capabilities, amount/speed of RAM.

5. I need this machine to be up and running by late october, and i need to be able to upgrade it (CPUs and RAM) for a period of 3/4 years.

Thank you all for replying

(contais references to brands/products)

What you say is possible. And for the price you want I'd say you to see Sun's X2200 M2 or X4200 M2 depending on what exactly you want.

PS: I have no interest on SUN, just a happy costumer. This is just my personal opinion.
 

aladar

Distinguished
Mar 29, 2005
31
0
18,530
Who actually uses mb raid on a server ? I certainly would not unless it is a very lightly loaded server.
Buy a real raid card.
I use linux, and linux raid, specially after some problems with a raid controller (complete hardware solution), but long story short, even with an equal controller (different firmware) wasn't able to recover the data (lost almost 1 hour because of backups - this was an online shop).

After using linux software raid I haven't any problems (with recovery that is). Backup controller is any pc/server that support the hard drive.

PS: I use always raid 1. Even with the controller (data security is really important).
 

drifter_888

Distinguished
Jul 7, 2006
126
0
18,680
I was reading this thread and I was just curious, how do you utilize or access the processing power of different machines. I know they use this for several industrial programs instead of using supercomputers but I never heard of anyone doing it on a low level basis. How is this done? I assume it can't be done with applicaitons or can it?
 

chenBrazil

Distinguished
Mar 17, 2006
136
0
18,680
As you are dealing with Nastran and Ansys , both Xeon 5100 (up) and opteron 2000 will be great choices.
Opteron 200 series would also be a great choice if you can find a nice deal on them for their older socket...
You should consider all system cost in order to make your plans

Xeon will require more expensive memory, but have more horsepower in general, but also may have stability issues which an analysis with thousand knots in Ansys will not tolerate...
Opterons will deal with more cheap memory and system should be more stable, specially on 200 series... performance will be slower... but must check if you can handle with lower horsepower , with reducing the others gap in your system ( Quadro cards, more memory, SCSI HD ...etc)
Maybe you can run your system with SATA II HD , convert 6800 Ultra VGA into Quadro ...so you can have a powerfull system with reasonable money... of course depending on your company demands...
Here we use our PC's for windows based Nastran and Catia analysys, and we are starting to use Ansys also. Our tests with Opteron 165 helped us a lot in our experiments of our change from unix servers and silicon graphics workstations into windows based workstations... so maybe you can also do the same... (that's why my signature have so many 6800 Ultra, in order to save some on Quadro cards) ...
Of course...with higher budget...you will be able to assembly systems like Mac did with xeon (which by the way is a good reference for your new system if you choose xeon as Dell have a workstation similar to apple for windows ) ... as a Brazilian company...we had to improvise in order to get our projects done... but we did and opteron help us a lot on this ...so for limited budget...I would go for a maximum system with good deal with opteron 200 series... (hope to get soon a xeon to test...)
We are starting to assembly 280 opteron workstation with Tyan MB and 8 GB of memory ...so we made our choice.... luck on yours...