AMD Opteron Server - Incredibly SLOW! What gives?

jwgolf

Distinguished
Jan 29, 2012
7
0
18,510
I want to give forewarning that this post will be long. This problem has stumped us for weeks and we could seriously use some help! Thanks in advance for reading.

I work for an engineering company and we perform finite element simulations. For the simulations in question we are using a commercial code called ABAQUS/Standard (see here). We were in need for a new machine recently with a very large RAM capacity so our analyses would fit in-core and therefore went with a server. We purchased this machine earlier this month, and the original setup looked like this:

- SuperMicro 2U Server 2022G-URF (specs)
- SuperMicro H8DGU-F Motherboard (specs)
- 2x AMD Opteron 6134 Processors @ 2.3GHz (specs)
- 64GB Kingston DDR3 1333MHz ECC Registered RAM
- 2x Intel 320 120GB SSD drives
- Redhat Enterprise Server v6.2
- Linux Kernel 2.6.32

Right out of the box, we were shocked how slow this machine performed. The machines we've purchased over the past few years have mostly been quad core desktop units and tend to do quite well with our simulations. The last server we bought was about 4 years ago and has roughly the following specs:

- 2x Intel Xeon X5355 @ 2.6GHz (no Turbo Boost)
- 16GB DDR 667 MHz RAM
- Standard SATA spinning disk drives

Much to our surprise, the 4 year old server was performing the analysis out-of-core, on old crappy RAM and processors, reading and writing to a spinning disk, and was doing it faster than our new machine! The new machine was running the analysis in-core, with slightly slower processors (but not really once the AMD Turbo Core kicked in), better RAM, and using an SSD drive. The software we use is fairly read/write intensive so the faster read/write was a big plus.

Over the subsequent two weeks, here are some of the things we've tried in hopes of speeding up this server:

1) We exchanged the Opteron 6134 processors (2x 8 cores @ 2.3GHz) for the 6272 processors (2x 16 cores @ 2.1 GHz) and then later to the 6220 (2x 8 cores @ 3.0 GHz) and none of these changes helped. In fact, going from the 2.3GHz to 3.0GHz processors did very little at all to speed up our analyses.
2) Updated to the latest BIOS
3) Tried an array of Linux kernels from 2.6.32 to 2.6.39 and even the latest stable 3.2.1
4) We've had better luck in the past with SUSE Linux so we switched from Redhat to Suse Enterprise Server 11
5) Played with the BIOS settings to death, messing with just about every parameter but focusing primarily on the power-saving options.
6) Modified the automatic CPU scaling mode and parameters from within the OS (it uses powernow-k8). We tried 'performance mode' where it leaves all of the CPUs at 3.0GHz all the time. If you leave it on it's standard default mode it fluctuates the CPU speeds between 1.4 and 3.0GHz to save power. Also I should note that while running the analysis we will typically only use one or two out of the 16 processors so the AMD Turbo Core kicks in and ramps us up to nearly 3.6GHz at times.
7) We wanted to eliminate the possibility of a hardware issue. We swapped out the motherboard, RAM and even power supplies (I'll explain why we swapped the power supplies later) and none of that helped.
8) We created a RAID 0 on our two SSDs to rule out I/O as a potential bottleneck. We clocked this raided disk at nearly 700MB/sec and had the OS installed on it and that did not help either.


While playing with the BIOS settings I noticed something quite strange. There is an option to control the maximum fan speed. The default is a 'Balanced' mode. I ramped up the fan speed to 'High Performance' and then later 'Full Speed'. I have a little 90 second benchmark problem that I've used to monitor system performance through all of these changes. With the fans in Balanced mode, the benchmark takes ~92 seconds. If I change the fans to High Performance mode, the benchmark takes ~94 seconds. If I change the fans to Full Speed mode the benchmark takes ~95 seconds. For some reason, increasing fan speed caused the analysis benchmark to slow down. As a result, we decided to try swapping out the power supplies. Oddly enough, we did this and rebooted the machine and the benchmark time reduced by about 11 seconds. It actually made the benchmark go faster in the short-term, however we've been running the full-scale analyses on this machine as well as the older server over the past few days and they are running very near the same speed, so this did not fix the overall issue.

Does anybody have any ideas what in the world may be going on with this machine? It doesn't seem possible to be remotely comparable with that older server. Since the I/O is so much quicker on our new server, we are inclined to think that this is a processor issue. Is it possible that the architecture of these AMD Opteron processors is literally THAT poor for our type of work? I would greatly appreciate any input or thoughts. We have invested a lot of money in this machine and would really like to see some return on that at some point. As it is now, this machine is almost useless for our needs!

Thanks in advance for the help.
 

rapidtransit

Distinguished
Jan 30, 2012
7
0
18,510
I'm at work and just briefly looked over the program.

Here are my thoughts and questions.

Is their a per-core licensing?
Check your reseller

Can this be GPU Accelerated?
If it can be, forget about CPUs and slap a bunch GPUs in it.

Does the hardware pass their certification?
Their could be hardware specific extensions.

How many simultaneous threads can it support?
My 3D CAD is out of date and can only support 1 at the moment.

How is the cooling? Is the chipset and CPU receiving adequate cooling?
This will bog it down as it tries to save itself.

They seem to mention something about Intels compiler, if so there may be Intel specific extensions being used.
 

jwgolf

Distinguished
Jan 29, 2012
7
0
18,510



Thanks a bunch for the reply! See my responses embedded above.
 


I think I have an idea why you are seeing what you are seeing. You are running a single-threaded program compiled with Intel compilers. Intel's compilers, especially the ones made before the big European antitrust trial, do not like AMD's CPUs very much and give them markedly un-optimized code paths, which is one reason a much older Intel CPU can keep up with newer AMD CPUs. The Socket G34 Opterons, especially the newer Opteron 6200s, are made for very heavily multithreaded tasks. The 6100s are pretty low-clocked, whereas the 6200s have much higher clock speeds (especially with Turbo CORE kicking in) but don't have the greatest single-threaded performance. So, they are not all that great for single-threaded tasks.

So, here are your options from how I see it.

1. Optimize your program for your highly-multithreaded hardware. You would need to rewrite your subroutine to support multithreading and increase your per-core licensing as well. I'd also compile the program with the Open64 compiler instead of Intel's compilers to get good performance on AMD CPUs. On the upside, once you do this, your performance should certainly go up as the multithreading should improve throughput a lot.

2. Sell the Opteron gear and get yourself something like a single-socket LGA2011 X79 board with a Core i7-3820 or 3930K. These can support 8 DIMMs per CPU so you can put the same 64 GB on that setup as you have in the dual G34 Opteron setup. This will have markedly more single-threaded performance in your current single-threaded program/Intel compiler environment than a pair of G34 Opterons.
 
comparatively amd are slow at maths. 2 similar performing cpus like the intel e6400 and amd 6000x2, have a huge difference in counting iterations of pi for instance...
at stock i think the differences are something like this... 1million calcs, intel 17 seconds while the amd does the same task in nearly 29 seconds. thats a huge difference on a very small (in computational terms) set of numbers. this isnt a bash at amd but it does show that in the right circumstances where a program is set up for certain hardware the intel can really stretch out a lead over amd
over the last 4 or so years the performance gap has grown wider even though the time gap has shortens some. most new i5's will do sub 10 seconds while amd are still in the high teens early 20's. if you move over to operons and xeons the gap is really pronounced. due in some part to the bigger and faster cache access in the intel parts, as well as higher transistor counts and the intels innate ability to do more per clock cycle.
amd are great for cheap servers to run games and web servers on but if your doing high end math, intel is a full gen ahead.
if you had of gone for the new gen of intel xeons you would have saved some money by the looks of it.

there is 1 interesting point in rapidtransits post... moving over to cuda. 4 midrange gpus will out perfom any cpu config when programmed correctly. you can literally get super computer performance with the right gpu setup in a desktop size case... (something to consider on your next upgrade)

like above says. either do a complete rewrite of your application or get it updated to get the best out of what you currently have or, swap platform... its a shame that over-clocking isnt real an option as servers tend to produce a lot of heat anyways without adding more by running at faster clock speeds.
dielectric liquid could be used to cool the server but that really isnt a cost effective option when over clocking.

nah i think i would go for option 2 sell the operon and jump platform.


 


SuperPi is the program you are talking about here and is an absolutely horrendous program for evaluating CPU performance. The only people who really use it are goofballs who overclock their CPUs to well beyond their stability limits and want something that takes a very short time to run (so the CPU doesn't overheat or hang) and doesn't stress the system very much (so the system doesn't hang.)

there is 1 interesting point in rapidtransits post... moving over to cuda. 4 midrange gpus will out perfom any cpu config when programmed correctly. you can literally get super computer performance with the right gpu setup in a desktop size case... (something to consider on your next upgrade)

If you read his reply, he is using a Tesla M2090 (GPGPU). His algorithm doesn't use the GPU heavily because >75% of it cannot run on the GPU and requires the much more flexible CPU to perform the calculations. GPUs are great at a few certain operations...and stink horribly at others. They are really very specialized tools.

like above says. either do a complete rewrite of your application or get it updated to get the best out of what you currently have or, swap platform... its a shame that over-clocking isnt real an option as servers tend to produce a lot of heat anyways without adding more by running at faster clock speeds.
dielectric liquid could be used to cool the server but that really isnt a cost effective option when over clocking.

Overclocking is a HORRIBLE idea in his case. Yes, it can get work done faster. But you risk the work being incorrect, and it sounds like he is doing production work. Production work isn't the same as playing games, you risk ruining your entire output with math errors due to overclocking with real work.
 
thats why i said overclocking wasnt a good idea or did you miss that part... also overclocking should not lead t errors if done within reason and is stable... after all if overclocking lead to errors it wouldnt be long b4 we all had corrupted HDD's full of bad data...

yes i know super pi doesnt stress the cpu but it is math intensive repeating small code repeatedly but it does show off the fpus ability to do math quickly... which is the real reason why the program was written in the first place... its also single threaded so will show the potential on a per core basis... so not completely usless...

and gpu does have its limitations, most imposed by pci-e bandwidth but it can produce extream performance in the right circumstances.
 

rapidtransit

Distinguished
Jan 30, 2012
7
0
18,510
Everyone beat me to the punch but to sum it up...

I would look at using a consumer grade Intel processor, a high grade one with solid caps etc, cautiously overclock just a bit, don't go nuts. Try running the same test over a regular clocked and overclocked a couple of times to make sure you are getting the right results and run SuperPi for 24 hours to check stability.

Also while you are at it check the memory usage sometimes, just because you have more than enough memory the system won't always address all of it. If it is not look at a I/O intensive PCIe Storage solution like a Revo drive.

Also in all honesty I would never run RAID with an AMD SB I've had over 5 fail for no apparent reason. These were all RAID 1 too...
 
you dont run super pi to check stability in most cases. you can use it to do quick and dirty tests of the ram stability but thats all....
you best use burn test and prime 95 for stability testing as a general rule..
super pi just tells you how fast your pc is able to do simple math... for anything else its pretty much useless.
the reason it works for a quick and dirty mem test on 32m is that it uses checksums to make sure the result is correct. if you get mismatched results it means there is an error somewhere. if it locks up the system you can only be 90% sure its a ram problem as the ram is about the only thing being stressed. unlike prime 95 which will tell you whats unstable, depending on what test you use.


not being funny, just clearing that up...
 

rapidtransit

Distinguished
Jan 30, 2012
7
0
18,510

I've been out of the loop for a while, I confused the two programs
 

zeppelin_2010

Honorable
Jul 13, 2012
4
0
10,510
There are plenty of Opteron systems that run fine, so there should be a solution to this.

Finite element simulations can access memory in a "messy" manner and cases typically do not fit in cache, so you could be looking at a memory issue here.

According to the H8DGU-F manual a maximum of 64GB unregistered ECC RAM is supported, so it matches your need for RAM. Did you try swapping the registered ECC RAM for unregistered, giving the memory controller direct access to the RAM? (You would need registered memory for >64GB RAM.)

Also, did you use a 16x4GB memory configuration or a 8x8GB memory configuration? In other words, did you try what is the result when all memory slots are filled with registered ECC memory, instead of only half of them?

Did you ask AMD?
 

directql

Honorable
Mar 30, 2012
13
0
10,510
Hi,

I would confirm what you're saying. I think there is some kind of performance issue. My company bought bunch of SuperMicro servers running AMD 6200 and we're seeing performance 4x lower than my Intel Core i3 at home. These servers were purchased for virtualization. For example, when testing network IO performance between KVM host and VM, I got 5 gbps receive and 1 gbps transmit. With the i3 (the least expansive) and identical setup we got 19 gbps both ways. Something MUST be wrong! There are also other people reporting similar terrible performance issues. One thing I see in common is that it's all SuperMicro hardware.
 

zeppelin_2010

Honorable
Jul 13, 2012
4
0
10,510
Hi directql (and others),

Thanks for the reply. I'll add some information that might help.

The reason I ask about the (un)registered memory is that I privately own an Opteron 6200 server, with SuperMicro motherboard, that runs molecular dynamics simulations (LAMMPS) with very similar performance as my Phenom X6 1090T system, which is somewhat slower than Intel but definitely not a factor 4. Such codes have quite random and intensive memory access.
I know of another system (Opteron 6100), also with SuperMicro motherboards, that runs the same code about a factor 1.5 slower. The primary difference between these two is that my system uses unregistered memory, and the other uses registered. So I'd put my bets on that, although of course there can be a faulty batch of SuperMicro motherboards on the market.
By the way, in my experience Opterons perform comparatively good when the system is under high load (hardly slows down single thread performance), I assume because the (usually) high memory bandwidth.

The initial post by jwgolf mentions swapping the 6100 type processor for a 6200 type, and replacing the motherboard (with the same type, presumably), but not swapping the type of memory.

The specifications of my own (fine) system are as follows:

SuperMicro H8SGL-F motherboard
AMD Opteron 6212 processor
Kingston KVR1333D3E9S/4G memory, either 4 or 8 dimms (so either in total 16G or 32G memory, giving the same performance)

There are quite a few Opteron systems in the top500, so I can't imagine the problem is with the processor; then nobody would select it.
 

zeppelin_2010

Honorable
Jul 13, 2012
4
0
10,510
For those who want to test their memory bandwidth, there is a small open source tool here:
http://alasir.com/software/ramspeed/

On my (apparently fine) Opteron 6212 system specified above, with 8 dimms, I get:

ramsmp -b 3 -m 128
RAMspeed/SMP (Linux) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09

8Gb per pass mode, 2 processes

INTEGER Copy: 9808.80 MB/s
INTEGER Scale: 14063.76 MB/s
INTEGER Add: 11760.55 MB/s
INTEGER Triad: 12084.78 MB/s
---
INTEGER AVERAGE: 11929.47 MB/s


ramsmp -b 6 -m 128
RAMspeed/SMP (Linux) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09

8Gb per pass mode, 2 processes

FL-POINT Copy: 13575.68 MB/s
FL-POINT Scale: 13906.56 MB/s
FL-POINT Add: 13084.84 MB/s
FL-POINT Triad: 13176.16 MB/s
---
FL-POINT AVERAGE: 13435.81 MB/s

Whether the source of a bandwidth problem is with the memory modules themselves or with the motherboard I cannot tell.
 
The source of the OP's problem was already discussed by MU. It involved the compiler used, specifically the math library's linked for their code. Intel's math library's are widely distributed and even if you use a non-Intel compiler many people still link to their library's for functions. Intel's compiler itself, especially prior to the big settlement, will generate two sets of code, one is SSE optimized the other is plan jane generic i386 code. On an Intel CPU it will run the code in SSE mode utilizing all the advancements, especially in the area of math and number crunching. On a non-Intel CPU it will run it using 1985 era i386 code and ignore any advancements the CPU might have.

As you can imagine this results in incredibly poor performance on any non-Intel CPU, so poor that people think the CPU's are the fault without realizing it's the code. Eventually a lawsuit settled the issue and modern Intel compilers don't cripple non-Intel CPU's nearly as much as they used to (SSE2 it used instead of i386 for non-Intel CPUs).
 

zeppelin_2010

Honorable
Jul 13, 2012
4
0
10,510
Since the OP didn't confirm having found a solution (or confirming which compiler was used, after checking), I considered the case unsolved. But you're most likely right about the original poster.
I've seen a very similar problem in one of two separate systems described above, using the gcc compiler v4 (and no Intel mkl) and the same code in both cases. So I posted this experience in the hope it would be useful.
But again, you're most probably right for the original poster.
 

directql

Honorable
Mar 30, 2012
13
0
10,510
Sorry, I just noticed updates to the thread. I found a few problems mainly related to the BIOS configuration and also found a performance regression in qemu-kvm 1.0.1 and up.

AMD's recomendations:

http://developer.amd.com/Assets/51803A_OpteronLinuxTuningGuide_SCREEN.pdf

I also used various performance testers along with numactl command to force CPU affinity. All that put together improved performance a lot but it just seems like something is still missing. My little Core i3 still performs better.
 

parkke

Honorable
Mar 28, 2013
1
0
10,510


I had a similar experience on a new Opteron build for Abaqus simulations. I ran NovaBench on the opteron machine and on the previous i7 machine in addition to the same simple abaqus on both. What was interesting to me was that the opteron took 1.7 times the amount of time to run the same abaqus case, and the NovaBench benchmark showed the same 1.7 ratio when comparing integer operations per second. The Opteron was faster for floating point though.
 

Douglas Miles

Honorable
Aug 17, 2013
1
0
10,510
Right out of the box, I am shocked how slow this SUPERMICRO machine performs! The machines we've purchased over the past few years have mostly been dual core desktop units and tend to do quite well with my simulations. This year I bought a SUPERMICRO H8QGi+-F link http://www.supermicro.com/Aplus/motherboard/Opteron6000/SR56x0/H8QGi_-F.cfm

4 x Opteron 6344s (each are 12 Core 2.8Ghz-3.2Ghz(TurboBoost) )
128GB RAM (8x16gb ddr3 @ 1600mhz)

-RAID0 4x128GB SSD
- 1TB SSHD (hybrid)

Much to our surprise, my 6 year old Intel E8400 server was performing the analysis while swapping ram, on old crappy RAM and processors, reading and writing to a spinning disk, and was doing it faster [at times] than my new machine! The new machine was running the analysis in ram, with slightly slower processors (but not really once the AMD Turbo Core kicked in), better RAM, and using an SSD drives. And even running in RAMDISK for the I/O workfiles it has to process.

Over the subsequent two months, here are some of the things we've tried in hopes of speeding up this server:

1) Played with the BIOS settings to death, messing with just about every parameter not just the power-saving options.

2) Modified the automatic CPU scaling mode and parameters from within the OS (it uses powernow-k8). We tried 'performance mode' where it leaves all of the CPUs at 3.2GHz all the time. If you leave it on it's standard default mode it fluctuates the CPU speeds between 1.6 and 3.2GHz to save power. Also I should note that while running the analysis I will typically only use 5 or 8 out of the 48 processors so the AMD Turbo Core kicks in and ramps us up to nearly 3.2 GHz at all times.

3) reduced RAM to only two stick in each of the 4 banks

4) wanted to eliminate the possibility of a hardware issue. swapped out the CPUs and even power supplies

As mentioned the ram drive programs (after all i have 28gb ram to spare!).. and I ran some random read/write benchmarks on the ram drives and discovered I performed at times the same speed as the RAID0 .. I was expected to at least be faster since the SSD drives have extra overhead of a trip into the same ram the ramdisks are already present.


Does anybody have any ideas what in the world may be going on with this machine? It doesn't seem possible to be remotely comparable with that older server. Since the I/O is so much quicker on our new server, we are inclined to think that this is a processor issue. Is it possible that the architecture of these AMD Opteron processors is literally THAT poor for our type of work? I would greatly appreciate any input or thoughts. Invested a lot of money in this machine and would really like to see some return on that at some point. As it is now, this machine is almost useless my our needs!

I design semantic web calculators in lisp that usually need at least 30GB to bootstrap an empty ontology (that means no instance data just empty database scheme) All of the calculations I do involves pointer data equality.. never floating points or even summation (beyond incrementing a counter.lol) Oh, I sometimes compute a 128bit hash key (used later of set joining) (via xor-ing 100s of values at the end of pointer locations).. What I need most is to jump around 100GB randomly. (I should note when talking about comparing to older machine I can make my problems run in all ram with only 8GB memory) and when comparing the E8400 and the SUPERMICRO-6344, I am comparing the 8GB version


I too am very impressed with the I/O you and I have.. In fact this the first time I invested in some much on SSDs and even power supplys (have 4 cores and Full speed fans - I also thought I was starving something) I wanted to think it is the processors.. When I start my app I do, have the best load times I have ever seen.. And when run my app.. I have often have at least 5 out of 48 cores pinned at 25%. I am working on code to pin all 48 cores to get my calculations completed faster. My guess is most of the time (when not calculating) the CPU is being starved for RAM access.

I think that the ram bus on our machine might be poorly designed. A straight copy mem (simply copy a 1GB ram from one memory location to another ) runs at only 1/3 the speed of a E8400 with 800mhz ram. Or easier test.. install a ramdrive and copy some files arround on it.. Compare your SUPERMICRO machine to your older machines.. My SUPERMICRO cant even out perform my p4 with 2GB of 266mhz ram when flipping files in a ram disk!

So... this SUPERMICRO with 48 cores and 128GB of ram has been my desktop machine for browsing the web and checking email as well as running software design tools.. It's slow as hell but can take the abuse of 100s of processes.. where the other machines mentioned (the E8400s and the P4s can barely .. run Eclipse.. let alone 3 instances of 3 different IDEs at once) When I need straight calculation speed I go to the machines i built with E8400s and DDR2. I KNOW the processor is not the issue. It is the RAW speed in which the CPU can copymem() in which i can randomly access all 128GB (in one bank with only 12 cores)

I really need to figure out how to build a machine that is more useful for our purposes!

Actually my code is more often being debugged than spent in simulation time.. So really needed something with no less than 128GB. And I suppose now run-time speed of my app is not as important as debug time speed.


I am sure supermicro is aware of the issue and working hard on resolving it. Since 100% of their customers are probably blown away how slow (to the point of near uselessness) their G34 series motherboards are. I feel I've been mugged for 600$ (I spent 900$ on board but it is worth at least $300 in value.. 32 ram slots, all the cores and RAID controller its a very awesome board) Maybe we just all accidentally got a bad batch?

THOUGH: I dropped a CPU into one of the sockets and it dented it after about 5 months of re-retesting with very levels of CPUs.. recompiling apps .. just to avoid letting this board become my moms (granny's) next kick-arse adobe photoshop/ms word/internet/youtube/facebook (all at once even!) machine .. Since I still cant accept the performance can be so bad, I must have ran over 10 different speeds and styles of G34 AMD CPUS.. from the top and bottom ends .. so the CPU pins have gotten beat up to where 1 out of 8 luma nodes dont even work. In fact CPU2 (of CPU1-CPU4) (yes, the one which has the *only* access to the 2nd PCIe16x wont run.). So now I am stuck with only one GTS250 (only two monitors now) to display my 9 IDEs at once. (I've since sold my most of my cores and only run CPU1 and CPU4 populated since I am convinced the board leaves too many drawbacks to be worth more than a 2k in CPU investment. Also, secretly I think they might have imporessed themselves with the ability to put two dual socket motherboards only one U1.. together to make a quad (so i dont want to exacerbate SM's problem by using both dual boards in the same machine) I'd feel guilty trying to get supermicro to take it back due to the previous damage mentioned :( I bought a mule expecting a horse and other similar cliques. Heh, I'd gladly pay 900$ (or more) again if someone could promise me a G34 that held 2TB Ram board that could access ram faster than a 7 year old laptop with 666mhz ram.. OR fast as a laptop with the same speed opertons!