Sign in with
Sign up | Sign in
Your question

Maxing out Xeon memory bus, need architecture advice

Last response: in CPUs
Share
April 10, 2012 11:03:01 AM

We're running a C# quant library and on our old server we had 2x Xeon x5670s, giving us 12 logical cores. We've just upgraded to 4x Xeon x7550s, giving us 32 logical cores. However, when running the library with 32 threads the memory bus appears to be getting absolutely hammered and is affecting performance.

We need to find out whether our application is scalable on hardware. From a software perspective the performance bottleneck is locking on objects due to threading and it's been established we cannot re-write the software any other way. Performance upgrade can only come from hardware.

I am a little unsure what kind of hardware upgrade could help. Is this just a case of faster memory? Is there a better Intel socket? Would the latest Sandy Bridge socket help? Is our CPU-setup fine but we need specialist ram? I have performed overclocking on uniprocessors, but I had the impression memory for servers is unclockable. I have been looking at QPI and memory bus speed, but the x7550 doesn't appear to be lacking in these- although I did hear Intel seriously improved their memory architecture with Sandy Bridge!


The specs of the above CPUs are as follows:

x7550
2GHz
8 cores
8x 256kb L2 cache
18MB L3 cache
4 x 6.4GB/s QPI
4x DDR3-1333 memory

x5670
2.93GHz
6 cores
6x 256kb L2 cache
12MB L3 cache
2 x 6.4GB/s QPI
3x DDR3-1333 memory
April 10, 2012 7:17:04 PM

As long as you have the memory configured correctly E.G. install in sets of 16 DIMMs (4 per socket) and no more than 32 Dimms in total then there is very little extra you can do. Also check the Dimms speed as the 32GB dimms are typically a little slower than the 8gb due to buss speed/timings.
Related resources
April 11, 2012 9:23:55 AM

hollett said:
As long as you have the memory configured correctly E.G. install in sets of 16 DIMMs (4 per socket) and no more than 32 Dimms in total then there is very little extra you can do. Also check the Dimms speed as the 32GB dimms are typically a little slower than the 8gb due to buss speed/timings.


Thanks for this.

8GB Dimms could well be an answer to making it faster then!

Is there a date for the new Ivy Bridge server release?
a c 96 à CPUs
April 12, 2012 2:46:43 AM

faa88 said:
We're running a C# quant library and on our old server we had 2x Xeon x5670s, giving us 12 logical cores. We've just upgraded to 4x Xeon x7550s, giving us 32 logical cores. However, when running the library with 32 threads the memory bus appears to be getting absolutely hammered and is affecting performance.

We need to find out whether our application is scalable on hardware. From a software perspective the performance bottleneck is locking on objects due to threading and it's been established we cannot re-write the software any other way. Performance upgrade can only come from hardware.

I am a little unsure what kind of hardware upgrade could help. Is this just a case of faster memory? Is there a better Intel socket? Would the latest Sandy Bridge socket help? Is our CPU-setup fine but we need specialist ram? I have performed overclocking on uniprocessors, but I had the impression memory for servers is unclockable. I have been looking at QPI and memory bus speed, but the x7550 doesn't appear to be lacking in these- although I did hear Intel seriously improved their memory architecture with Sandy Bridge!


The specs of the above CPUs are as follows:

x7550
2GHz
8 cores
8x 256kb L2 cache
18MB L3 cache
4 x 6.4GB/s QPI
4x DDR3-1333 memory

x5670
2.93GHz
6 cores
6x 256kb L2 cache
12MB L3 cache
2 x 6.4GB/s QPI
3x DDR3-1333 memory


Sounds like your application has hit a scaling limit somewhere between 12 and 32 threads. There is a lot of bus bandwidth in four X7550s, since the bus bandwidth increases proportionally with the number of CPUs. It sounds like you either have a program that simply doesn't scale well past somewhere between 12 and probably 20-24 cores. Or perhaps your program doesn't utilize NUMA very well and is hammering the memory bus because it doesn't have good thread/memory localization. You could try tweaking the NUMA node interleaving- turning it on or off, or using something like numactl() or taskset() to limit what cores certain threads run on to try to increase thread/memory localization to decrease memory bus traffic. If none of that works, you probably are best served by getting the hardware with the absolute fastest 16 or so cores you can get (probably two Xeon E5-2687Ws) because your performance will only scale with per-core performance, not number of cores.
April 12, 2012 9:24:19 AM

MU_Engineer said:
Sounds like your application has hit a scaling limit somewhere between 12 and 32 threads. There is a lot of bus bandwidth in four X7550s, since the bus bandwidth increases proportionally with the number of CPUs. It sounds like you either have a program that simply doesn't scale well past somewhere between 12 and probably 20-24 cores. Or perhaps your program doesn't utilize NUMA very well and is hammering the memory bus because it doesn't have good thread/memory localization. You could try tweaking the NUMA node interleaving- turning it on or off, or using something like numactl() or taskset() to limit what cores certain threads run on to try to increase thread/memory localization to decrease memory bus traffic. If none of that works, you probably are best served by getting the hardware with the absolute fastest 16 or so cores you can get (probably two Xeon E5-2687Ws) because your performance will only scale with per-core performance, not number of cores.


Hey, thanks for your reply. I concluded the same as you regarding alternative CPU. I presume they will bring out 4-socket Sandy/Ivy bridge CPUs?

Are there any specific profiling tools which would be good for checking the hardware whilst the code is running (the code has been profiled) to see how the memory/cache is performing?

a c 96 à CPUs
April 12, 2012 12:56:53 PM

faa88 said:
Hey, thanks for your reply. I concluded the same as you regarding alternative CPU. I presume they will bring out 4-socket Sandy/Ivy bridge CPUs?


There are four-socket Sandy Bridges, the E5-46xx series. They are the "lower end" 4P units as they have only two QPI links and so require a two-hop square bus topology rather than a one-hop "X" topology that the 4 QPI link E7s (or Opteron 6000s). There will also be Sandy Bridge-based E7 four-socket units as well, but for now the LGA2011 E7s are Westmere-based.

Quote:
Are there any specific profiling tools which would be good for checking the hardware whilst the code is running (the code has been profiled) to see how the memory/cache is performing?


There are, but I can't seem to remember the names of any of these applications of the top of my head.
April 12, 2012 1:02:19 PM

MU_Engineer said:
There are, but I can't seem to remember the names of any of these applications of the top of my head.


Thanks.

I discovered valgrind/cachegrind, but it's for C/C++ on Linux, not C# on Windows.
!