Sign in with
Sign up | Sign in
Your question

Threaded Application Performance on a Multi-core System

Last response: in CPUs
Share
May 23, 2011 7:40:49 PM

I wrote a multi-threaded CPU intensive and memory bound application and found to my suprise that running 2 threads on my dual core system was 20% slower than 1 thread. I made sure that the problem was not locks around critical regions. So the only thing I can think of that explains this behavior is that there is not enough memory bandwidth to support 2 threads and that the threads are fighting over the memory bus. This is on a Mac PowerBook Pro so I guess that is a reasonable result.

I would like to know if anyone has tried running a multi-threaded memory bound application on an Intel/AMD machine and what kind of performance they had as they increased the number of threads?

I considered getting an Opteron based system with 4 CPUs amd 32 cores but am not sure that I would really be able to run 32 threads on a system like that.
May 23, 2011 11:47:31 PM

What library did you use? What language? Since you said you wrote it on your mac, way it C,C++, Objective-C or some other?
What kind of application was it? And are you sure that there were no locks? If you tried to acces the same variable in your class or you had an endless loop check, that could be the reason for slower performance...
I am a programer and I write multi-threaded software all the time. Adding more threads get you around logarithmic performace increases at beast (2 threads about double, 4 threads about 3 times the performacne, 8 threads about 4 times performance, 16 threads about 5 times the performance and so on... puting it simple with a formula performace_gain=(ld(thread_count)+1-thread_management(thread_count))*(ld(core_count)+1) ). The more threaded yout applicatin is, the more complex your lock and syncronising mechanisms need to be. With heavy threaded apps they too are heavy on the resources, but are mostly encapsulated in the libraries so you don't have too much control.
AMD or Intel makes no difference, unles you use an optimised compiler (but that is not a good idea if you want to run your program on both machines).

As for geting an Opteron system with 32 cores, that is overkill. I highly doubt it that you can make an application so highly optimised to use even half of those cores. Sistems like that are for runing numerous applications at the same time, never a single application.
Only a beginer would ask a question like that. Just open Task Manager->Processes and display the thread count. As you will see, your current setup is probably running over a few hundred threads or maybe over a thousand ... (Example, just the windows System process on any system will always run over 100 threads)
m
0
l
May 24, 2011 6:02:20 AM

The application was origionally written in C and I have modified it so that each thread is a class and the whole thing is in C++. This allows me to have 1-1000 or more threads based on a command line switch. I am using Apple's gnu c++ compiler and the OS is Apple's variant of UNIX.

The application is very close to a gate level simulator with each thread simulating 1 vector in parallel. This makes each thread pretty much identical and they access all of the binary netlist (chip description) every vector. There is very little IO and memory allocation done after the startup is completed. It is possible that the code could be found in the cache frequently but highly unlike that the data would be found there.

I inserted counters to find out how often it was waiting for locks and found a very small number. That is why I am positive the threads are not waiting for access to locked shared data.

I checked to ensure that the application wasn't swapping and saw that both cores were kept pegged at 100% utilization. I also did not hear a lot of disk accesses while it was running.
m
0
l
May 24, 2011 8:28:05 AM

I workd on mac, but for multithreading I used boost library. It has better multithreading management and you can use the same code also on linux and windows.

For gate level simulation isn't it easier to use VHDL and their tools? I have some expirience with that. To get the best resaults you should thy with a much smaller thread count, meaning you should simulate a minimum amount of vectors(less threads) with the most accurate timing model.
From a programers view, you got too many threads checking if there are resources available for them to work on. A simple sleep(n) when a thread is not working will free up a lot of processor time and stop the thread from constantly checking if it can start working.

"It is possible that the code could be found in the cache frequently but highly unlike that the data would be found there."

Yes, that could be the case, as you probalby keep constantly checking, so try the above if no data found, just put the thread to sleep.

"I inserted counters to find out how often it was waiting for locks and found a very small number. That is why I am positive the threads are not waiting for access to locked shared data."

Most comon mistake with counters is that they can only check on thread if not made to be a pointer or global counter, so are you sure you are not counting in only one thread? If you are, you will see that that is a very large nubmer when you get the number of threads in account.

"I checked to ensure that the application wasn't swapping and saw that both cores were kept pegged at 100% utilization. I also did not hear a lot of disk accesses while it was running."

No suprise there with utilization, that is common, but as said beafore, check that there are no too many checks. As for disk access, you said it that there is almost no IO after startup is completed, so only if your program uses a large amounth of RAM (over 50% of installed RAM) or needs to save data to a file there will be disk IO. Also take in account that there could be some other process or program in the background that is accessing the disk.
m
0
l
!