Download the Tom's Hardware App from the App Store
The reference for current tech news
Yes No
Ads
Tom's Hardware > Forum > CPU & Components > CPUs > AMD Opteron Server - Incredibly SLOW! What gives?

AMD Opteron Server - Incredibly SLOW! What gives?

Forum CPU & Components : CPUs AMD Opteron Server - Incredibly SLOW! What gives?

Word :    Username :           
 

I want to give forewarning that this post will be long. This problem has stumped us for weeks and we could seriously use some help! Thanks in advance for reading.

I work for an engineering company and we perform finite element simulations. For the simulations in question we are using a commercial code called ABAQUS/Standard (see here). We were in need for a new machine recently with a very large RAM capacity so our analyses would fit in-core and therefore went with a server. We purchased this machine earlier this month, and the original setup looked like this:

- SuperMicro 2U Server 2022G-URF (specs)
- SuperMicro H8DGU-F Motherboard (specs)
- 2x AMD Opteron 6134 Processors @ 2.3GHz (specs)
- 64GB Kingston DDR3 1333MHz ECC Registered RAM
- 2x Intel 320 120GB SSD drives
- Redhat Enterprise Server v6.2
- Linux Kernel 2.6.32

Right out of the box, we were shocked how slow this machine performed. The machines we've purchased over the past few years have mostly been quad core desktop units and tend to do quite well with our simulations. The last server we bought was about 4 years ago and has roughly the following specs:

- 2x Intel Xeon X5355 @ 2.6GHz (no Turbo Boost)
- 16GB DDR 667 MHz RAM
- Standard SATA spinning disk drives

Much to our surprise, the 4 year old server was performing the analysis out-of-core, on old crappy RAM and processors, reading and writing to a spinning disk, and was doing it faster than our new machine! The new machine was running the analysis in-core, with slightly slower processors (but not really once the AMD Turbo Core kicked in), better RAM, and using an SSD drive. The software we use is fairly read/write intensive so the faster read/write was a big plus.

Over the subsequent two weeks, here are some of the things we've tried in hopes of speeding up this server:

1) We exchanged the Opteron 6134 processors (2x 8 cores @ 2.3GHz) for the 6272 processors (2x 16 cores @ 2.1 GHz) and then later to the 6220 (2x 8 cores @ 3.0 GHz) and none of these changes helped. In fact, going from the 2.3GHz to 3.0GHz processors did very little at all to speed up our analyses.
2) Updated to the latest BIOS
3) Tried an array of Linux kernels from 2.6.32 to 2.6.39 and even the latest stable 3.2.1
4) We've had better luck in the past with SUSE Linux so we switched from Redhat to Suse Enterprise Server 11
5) Played with the BIOS settings to death, messing with just about every parameter but focusing primarily on the power-saving options.
6) Modified the automatic CPU scaling mode and parameters from within the OS (it uses powernow-k8). We tried 'performance mode' where it leaves all of the CPUs at 3.0GHz all the time. If you leave it on it's standard default mode it fluctuates the CPU speeds between 1.4 and 3.0GHz to save power. Also I should note that while running the analysis we will typically only use one or two out of the 16 processors so the AMD Turbo Core kicks in and ramps us up to nearly 3.6GHz at times.
7) We wanted to eliminate the possibility of a hardware issue. We swapped out the motherboard, RAM and even power supplies (I'll explain why we swapped the power supplies later) and none of that helped.
8) We created a RAID 0 on our two SSDs to rule out I/O as a potential bottleneck. We clocked this raided disk at nearly 700MB/sec and had the OS installed on it and that did not help either.


While playing with the BIOS settings I noticed something quite strange. There is an option to control the maximum fan speed. The default is a 'Balanced' mode. I ramped up the fan speed to 'High Performance' and then later 'Full Speed'. I have a little 90 second benchmark problem that I've used to monitor system performance through all of these changes. With the fans in Balanced mode, the benchmark takes ~92 seconds. If I change the fans to High Performance mode, the benchmark takes ~94 seconds. If I change the fans to Full Speed mode the benchmark takes ~95 seconds. For some reason, increasing fan speed caused the analysis benchmark to slow down. As a result, we decided to try swapping out the power supplies. Oddly enough, we did this and rebooted the machine and the benchmark time reduced by about 11 seconds. It actually made the benchmark go faster in the short-term, however we've been running the full-scale analyses on this machine as well as the older server over the past few days and they are running very near the same speed, so this did not fix the overall issue.

Does anybody have any ideas what in the world may be going on with this machine? It doesn't seem possible to be remotely comparable with that older server. Since the I/O is so much quicker on our new server, we are inclined to think that this is a processor issue. Is it possible that the architecture of these AMD Opteron processors is literally THAT poor for our type of work? I would greatly appreciate any input or thoughts. We have invested a lot of money in this machine and would really like to see some return on that at some point. As it is now, this machine is almost useless for our needs!

Thanks in advance for the help.


Message edited by jwgolf on 01-30-2012 at 07:32:17 AM
Reply to jwgolf
Register or log in to remove.

I'm at work and just briefly looked over the program.

Here are my thoughts and questions.

Is their a per-core licensing?
Check your reseller

Can this be GPU Accelerated?
If it can be, forget about CPUs and slap a bunch GPUs in it.

Does the hardware pass their certification?
Their could be hardware specific extensions.

How many simultaneous threads can it support?
My 3D CAD is out of date and can only support 1 at the moment.

How is the cooling? Is the chipset and CPU receiving adequate cooling?
This will bog it down as it tries to save itself.

They seem to mention something about Intels compiler, if so there may be Intel specific extensions being used.

Reply to rapidtransit

rapidtransit wrote :

I'm at work and just briefly looked over the program.

Here are my thoughts and questions.

Is their a per-core licensing?
Check your reseller

Yes there is, and it gets pretty expensive pretty fast as you go to multiple cores. We are stuck with 1 core per analysis at the moment, see below for why that is.


Can this be GPU Accelerated?
If it can be, forget about CPUs and slap a bunch GPUs in it.

Yes it can, we are doing this on another machine with the Tesla M2090 and seeing great results so far. The software were using only supports 1GPU at the moment. Unfortunately the problem were running right now on this machine is not solver-intensive and therefore the GPU only accelerates about 25% of the total computation time.

Does the hardware pass their certification?
Their could be hardware specific extensions.

I don't know about this. It's possible that certain hardware types are known to be poorly compatible with certain software? Ill look more closely at the software developers website.

How many simultaneous threads can it support?
My 3D CAD is out of date and can only support 1 at the moment.

It can support multiple, however we have a user sub routine that is essential to running the analysis that is not configured for multiple threads yet.

How is the cooling? Is the chipset and CPU receiving adequate cooling?
This will bog it down as it tries to save itself.

The cooling is extremely good. I've never felt the CPU heatsinks get even remotely warm to the touch.

They seem to mention something about Intels compiler, if so there may be Intel specific extensions being used.

We're using Intel compilers for Fortran and C++. If I recall correctly (I'll have to check when I get to work later this morning) but the Intel compilers were using are version 10.1 and the slightly older version.




Thanks a bunch for the reply! See my responses embedded above.

Message quoted 2 times
Message edited by jwgolf on 01-30-2012 at 04:10:22 PM
Reply to jwgolf

jwgolf wrote :

Bump. Any thoughts anyone?



Don't...
Bump posts
http://www.tomshardware.com/forum/283384-33-read-first

------------------------------ http://img545.imageshack.us/img545/3995/bl11.gif
Reply to Mousemonkey

jwgolf wrote :

Thanks a bunch for the reply! See my responses embedded above.



I think I have an idea why you are seeing what you are seeing. You are running a single-threaded program compiled with Intel compilers. Intel's compilers, especially the ones made before the big European antitrust trial, do not like AMD's CPUs very much and give them markedly un-optimized code paths, which is one reason a much older Intel CPU can keep up with newer AMD CPUs. The Socket G34 Opterons, especially the newer Opteron 6200s, are made for very heavily multithreaded tasks. The 6100s are pretty low-clocked, whereas the 6200s have much higher clock speeds (especially with Turbo CORE kicking in) but don't have the greatest single-threaded performance. So, they are not all that great for single-threaded tasks.

So, here are your options from how I see it.

1. Optimize your program for your highly-multithreaded hardware. You would need to rewrite your subroutine to support multithreading and increase your per-core licensing as well. I'd also compile the program with the Open64 compiler instead of Intel's compilers to get good performance on AMD CPUs. On the upside, once you do this, your performance should certainly go up as the multithreading should improve throughput a lot.

2. Sell the Opteron gear and get yourself something like a single-socket LGA2011 X79 board with a Core i7-3820 or 3930K. These can support 8 DIMMs per CPU so you can put the same 64 GB on that setup as you have in the dual G34 Opteron setup. This will have markedly more single-threaded performance in your current single-threaded program/Intel compiler environment than a pair of G34 Opterons.

------------------------------ Workstation: 2x Opteron 6128, ASUS KGPE-D16, 8x2 GB PC3-10600U ECC
File server: 2x Xeon 5150, MSi MS-91A1, 2x2 GB PC2-5300R FB-DIMMS
HTPC: 2x Xeon LV 2.00 Sossaman, TYAN i7520SD, 2x512 MB PC2-5300R
Reply to MU_Engineer

comparatively amd are slow at maths. 2 similar performing cpus like the intel e6400 and amd 6000x2, have a huge difference in counting iterations of pi for instance...
at stock i think the differences are something like this... 1million calcs, intel 17 seconds while the amd does the same task in nearly 29 seconds. thats a huge difference on a very small (in computational terms) set of numbers. this isnt a bash at amd but it does show that in the right circumstances where a program is set up for certain hardware the intel can really stretch out a lead over amd
over the last 4 or so years the performance gap has grown wider even though the time gap has shortens some. most new i5's will do sub 10 seconds while amd are still in the high teens early 20's. if you move over to operons and xeons the gap is really pronounced. due in some part to the bigger and faster cache access in the intel parts, as well as higher transistor counts and the intels innate ability to do more per clock cycle.
amd are great for cheap servers to run games and web servers on but if your doing high end math, intel is a full gen ahead.
if you had of gone for the new gen of intel xeons you would have saved some money by the looks of it.

there is 1 interesting point in rapidtransits post... moving over to cuda. 4 midrange gpus will out perfom any cpu config when programmed correctly. you can literally get super computer performance with the right gpu setup in a desktop size case... (something to consider on your next upgrade)

like above says. either do a complete rewrite of your application or get it updated to get the best out of what you currently have or, swap platform... its a shame that over-clocking isnt real an option as servers tend to produce a lot of heat anyways without adding more by running at faster clock speeds.
dielectric liquid could be used to cool the server but that really isnt a cost effective option when over clocking.

nah i think i would go for option 2 sell the operon and jump platform.


Message quoted 1 times
Message edited by HEXiT on 01-30-2012 at 05:53:07 PM
------------------------------ |i7 920 D0@3.6 |CNPS10 flex |ex58-ud5 |6Gig 1333 C7 |HD 5870
|G11 keyboard|X-fi Xtreme |G930 |3x1tb SpinPoint f3's raid 0 |thermaltake tp 850watt |antec 902
|Rat 7 contagion |Razer Destructor mat |bamboo pen'N'touch | 3d pro stick |nitro wheel |360 pad
Reply to HEXiT

HEXiT wrote :

comparatively amd are slow at maths. 2 similar performing cpus like the intel e6400 and amd 6000x2, have a huge difference in counting iterations of pi for instance...
at stock i think the differences are something like this... 1million calcs, intel 17 seconds while the amd does the same task in nearly 29 seconds. thats a huge difference on a very small (in computational terms) set of numbers. this isnt a bash at amd but it does show that in the right circumstances where a program is set up for certain hardware the intel can really stretch out a lead over amd
over the last 4 or so years the performance gap has grown wider even though the time gap has shortens some. most new i5's will do sub 10 seconds while amd are still in the high teens early 20's. if you move over to operons and xeons the gap is really pronounced. due in some part to the bigger and faster cache access in the intel parts, as well as higher transistor counts and the intels innate ability to do more per clock cycle.
amd are great for cheap servers to run games and web servers on but if your doing high end math, intel is a full gen ahead.
if you had of gone for the new gen of intel xeons you would have saved some money by the looks of it.



SuperPi is the program you are talking about here and is an absolutely horrendous program for evaluating CPU performance. The only people who really use it are goofballs who overclock their CPUs to well beyond their stability limits and want something that takes a very short time to run (so the CPU doesn't overheat or hang) and doesn't stress the system very much (so the system doesn't hang.)

Quote :

there is 1 interesting point in rapidtransits post... moving over to cuda. 4 midrange gpus will out perfom any cpu config when programmed correctly. you can literally get super computer performance with the right gpu setup in a desktop size case... (something to consider on your next upgrade)



If you read his reply, he is using a Tesla M2090 (GPGPU). His algorithm doesn't use the GPU heavily because >75% of it cannot run on the GPU and requires the much more flexible CPU to perform the calculations. GPUs are great at a few certain operations...and stink horribly at others. They are really very specialized tools.

Quote :

like above says. either do a complete rewrite of your application or get it updated to get the best out of what you currently have or, swap platform... its a shame that over-clocking isnt real an option as servers tend to produce a lot of heat anyways without adding more by running at faster clock speeds.
dielectric liquid could be used to cool the server but that really isnt a cost effective option when over clocking.



Overclocking is a HORRIBLE idea in his case. Yes, it can get work done faster. But you risk the work being incorrect, and it sounds like he is doing production work. Production work isn't the same as playing games, you risk ruining your entire output with math errors due to overclocking with real work.

------------------------------ Workstation: 2x Opteron 6128, ASUS KGPE-D16, 8x2 GB PC3-10600U ECC
File server: 2x Xeon 5150, MSi MS-91A1, 2x2 GB PC2-5300R FB-DIMMS
HTPC: 2x Xeon LV 2.00 Sossaman, TYAN i7520SD, 2x512 MB PC2-5300R
Reply to MU_Engineer

thats why i said overclocking wasnt a good idea or did you miss that part... also overclocking should not lead t errors if done within reason and is stable... after all if overclocking lead to errors it wouldnt be long b4 we all had corrupted HDD's full of bad data...

yes i know super pi doesnt stress the cpu but it is math intensive repeating small code repeatedly but it does show off the fpus ability to do math quickly... which is the real reason why the program was written in the first place... its also single threaded so will show the potential on a per core basis... so not completely usless...

and gpu does have its limitations, most imposed by pci-e bandwidth but it can produce extream performance in the right circumstances.


Message edited by HEXiT on 01-30-2012 at 09:07:21 PM
------------------------------ |i7 920 D0@3.6 |CNPS10 flex |ex58-ud5 |6Gig 1333 C7 |HD 5870
|G11 keyboard|X-fi Xtreme |G930 |3x1tb SpinPoint f3's raid 0 |thermaltake tp 850watt |antec 902
|Rat 7 contagion |Razer Destructor mat |bamboo pen'N'touch | 3d pro stick |nitro wheel |360 pad
Reply to HEXiT

Everyone beat me to the punch but to sum it up...

I would look at using a consumer grade Intel processor, a high grade one with solid caps etc, cautiously overclock just a bit, don't go nuts. Try running the same test over a regular clocked and overclocked a couple of times to make sure you are getting the right results and run SuperPi for 24 hours to check stability.

Also while you are at it check the memory usage sometimes, just because you have more than enough memory the system won't always address all of it. If it is not look at a I/O intensive PCIe Storage solution like a Revo drive.

Also in all honesty I would never run RAID with an AMD SB I've had over 5 fail for no apparent reason. These were all RAID 1 too...

Reply to rapidtransit

you dont run super pi to check stability in most cases. you can use it to do quick and dirty tests of the ram stability but thats all....
you best use burn test and prime 95 for stability testing as a general rule..
super pi just tells you how fast your pc is able to do simple math... for anything else its pretty much useless.
the reason it works for a quick and dirty mem test on 32m is that it uses checksums to make sure the result is correct. if you get mismatched results it means there is an error somewhere. if it locks up the system you can only be 90% sure its a ram problem as the ram is about the only thing being stressed. unlike prime 95 which will tell you whats unstable, depending on what test you use.


not being funny, just clearing that up...

------------------------------ |i7 920 D0@3.6 |CNPS10 flex |ex58-ud5 |6Gig 1333 C7 |HD 5870
|G11 keyboard|X-fi Xtreme |G930 |3x1tb SpinPoint f3's raid 0 |thermaltake tp 850watt |antec 902
|Rat 7 contagion |Razer Destructor mat |bamboo pen'N'touch | 3d pro stick |nitro wheel |360 pad
Reply to HEXiT

HEXiT wrote :

you dont run super pi to check stability in most cases. you can use it to do quick and dirty tests of the ram stability but thats all....
you best use burn test and prime 95 for stability testing as a general rule..
super pi just tells you how fast your pc is able to do simple math... for anything else its pretty much useless.
the reason it works for a quick and dirty mem test on 32m is that it uses checksums to make sure the result is correct. if you get mismatched results it means there is an error somewhere. if it locks up the system you can only be 90% sure its a ram problem as the ram is about the only thing being stressed. unlike prime 95 which will tell you whats unstable, depending on what test you use.


not being funny, just clearing that up...


I've been out of the loop for a while, I confused the two programs

Reply to rapidtransit
Register or log in to remove.
Tom's Hardware > Forum > CPU & Components > CPUs > AMD Opteron Server - Incredibly SLOW! What gives?
Go to:

There are 1845 identified and unidentified users. To see the list of identified users, Click here.

  • Ask the community now
  • Publish
Ad
Ads
Latest best answer
Case with filters. Recommendations?
By al360ex, 4 hours ago:

Then I'd go with one of these cases. If you choose the HAF 932 Advanced Edition, you...

Best offers
They won a badge
Join us in greeting them
Top experts