Sign in with
Sign up | Sign in
Your question

A glimpse at the difficulties of multi threading

Tags:
Last response: in CPUs
Share
August 16, 2008 5:32:06 PM

In its article A Chip Too Far (playing on the WWII movie title A Bridge Too Far) Fortune explains some of the difficulties of multi threading.



Hmmm......where have I heard this before?

Michael Copeland, senior writer, Fortune
August 14, 2008: 5:47 AM EDT

Quote:
The change was set in motion four years ago when Intel (INTC, Fortune 500) and others reached a point where they could no longer make single processors go faster. So they began placing multiple processors (or "cores") on a single chip instead.

That design, however, dramatically raises the level of difficulty for software developers. If they want to get the full oomph out of multicore chips, their applications need to break tasks apart into chunks for each core to work on, a process known as parallel computing. (Programs running on supercomputers have employed this technique for years to simulate atomic explosions or model complex aerodynamics.)

But programming in parallel is simply too complex for the average code writer, who has been trained in a very linear fashion. In conceptual terms, traditional coding could be compared to a woman being pregnant for nine months and producing a baby. Parallel programming might take nine women, have each of them be pregnant for a month, and somehow produce a baby.



Quote:
Even the creators of multicore chips admit they're causing trouble. "I accept that parallel computing is a big problem," says Sean Maloney, executive vice president for sales and marketing at Intel. He says the company is hiring "a lot" of software people to tackle the challenge.



So much for the kiddies 'all new games will be quad core optimized' theories
a c 126 à CPUs
August 16, 2008 5:41:57 PM

The most interesting thing is some games are multithreaded but do not take full advantage of what they have available.

A good example is VALVe Source engine. It has a consol comman that you can put in (MQM) that allows it wo take advantage of more than just one core or possible advantage of the dual/quads. It boosts the FPS in the games a lot but is not 100% stable.

VALVe has stated they are working on it for their next game Left 4 Dead but thats to be seen.
August 16, 2008 5:43:24 PM

This is what got me laughing when all those reports came out saying Intel was investing 20 million $'s in it. Admirable as it is, it isnt enough for the upcoming needs.Good to see the word "alot" in the quote
Related resources
August 16, 2008 5:48:50 PM

Well, I hope this will stem some of the 'unicorn wishes' that every new app or game coming out will be quad optimized.

Honestly, you would think the fact that dual core has been around for years yet most programs are still single threaded would get that point accross, but it appears the 'quantity is better than quality' mentality can overwhelm logic.
August 16, 2008 5:52:15 PM

Similar to Ghz, regardless of IPC. People generally either dont have a clue, or dont actually look into it deep enough
August 16, 2008 6:07:12 PM

You think thats un-nerving I read an interview with the president of Intel a year ago or so. He said that there next step after the multiple cores is, they can apply a magnetic field to the computer. This forces all the elctrons to the outer edges of the mbo and the other chips (gcard etc.). By pushing them to the edges you reduce all the paths of the electrons making them straight. He said that this could produce a desktop size PC that ran at 10 GHZ or more! This is no-bull I read it in Discover. You guys discussion reminded me of it, I havent tried researching it in a while, but come on this was out of the mouth of the guy who started Intel. It also showed a graph of the increase in processor speed then and what was coming up (Dual Core) I wish I could find that issue to see if that graphs predictions of the futer came true. Anyway thought you peeps might find that interesting Peace.
August 16, 2008 6:15:04 PM

JAYDEEJOHN said:
Similar to Ghz, regardless of IPC. People generally either dont have a clue, or dont actually look into it deep enough


True that.

Though, as Endyen likes to point out, and correctly so, IPC is an antiquated term. Its just that theres no easier way to describe the differences in work accomplished per cycle, and "IPC" is so easy to use to get a point accross, even if its no longer technically accurate.
August 16, 2008 6:15:42 PM

Carmack said, give me 1 cpu that delivers 10GHz over a quad that does 4Ghz, or something similar. Its true, the need for speed cant be optimized very easily in an OoO world. Thats why graphics need to take a spot in this solution, as gfx cards run in parallel. Theres apps out there, and plenty, where these cards will help. The cpu and the gpu need to find a cozy coexistence until like the above post comes true. Making the electrons straight would increase speed, reduce friction, and allow for better thruput. Making chips for this would be ideal. HKMG might even be overkill if such a thing was practical today
August 16, 2008 6:18:54 PM

Theory is great. Applying theory is something else. Reminds me of the 10GHz P4s we never saw.
August 16, 2008 6:39:05 PM

Hmmm..

Question.

Then why does my system (XP home & Vista 64 & Linux) support up to 32 cores on a single socket?

I mean.. so I have a quad system that will never shine as bright as my old P4 northy and my dad's XPAthlon 2100+? :cry: 
a c 108 à CPUs
August 16, 2008 6:41:22 PM

As an old SMP dude we've been hearing about the next killer parallel apps being 'right around the corner' for about ... 10 or more years now? - lol

It started heating up when NT popped and you could get 2p slot1 mobos relatively cheap (compared to the Pentium Pro). Starting getting really hot with Win2k and smokin' with XP Pro. Caught fire with dual-core CPUs and then quads ...

But in consumer land it's pretty much been a big yawn. There really isn't any demand for parallel multithreading on Mom & Pop's Desktop. I can't blame the software developers for it, really. And the game developers really have to thread lightly into the brave new multicore world.

I know it's frustrating for the gamers but I wouldn't count on any quantum leaps in the ever-expanding short term. Most likely any perceivable gains will be seen more from 'load balancing' across cores as opposed to concurrent multiple threads.

Let's face a little reality check, here. A game developer would be creating market suicide if their latest version would only run on dual- and/or quad-core CPUs or at substantially reduced capacity on a single-core. There is still a stout single-core market segment that's willing to pay $40-$50 for a PC game but is not about to pay $1,000+ for a new computer just to play games.

They will buy a $300-$400 game console before they do that ...



a b à CPUs
August 16, 2008 7:27:06 PM


Ok so if multi threading is so far away for the average programmer then why arent the development tools - ie Havoc working with multi threaded capabilities thrown in..


For example 1 core working on physics, another on colision, another on user interaction and the other on game variables.

Surely this would make the game run faster as processors work on the task together - or am I missing the point of multi threaded applications..
August 16, 2008 7:33:35 PM

Theres already multi threaded games out that severely under perform with 1 core. Gaming is the most cutting edge, so multi isnt a question of whether to use it for sales, more of cost in doing so
August 16, 2008 7:34:27 PM

Grimmy said:
Hmmm..

Question.

Then why does my system (XP home & Vista 64 & Linux) support up to 32 cores on a single socket?

I mean.. so I have a quad system that will never shine as bright as my old P4 northy and my dad's XPAthlon 2100+? :cry: 


In short, thats not nessasarily the case, not becuase of the core count, but because of the processor generation. While you might pump the P4 to a higher freq than the quad (depending which P4 you have) the odds that you can get it fast enough to 'do more work' than a single core on the quad are low.

The long answer:


Interesting territory that delves into, of all things, licencing and packaging as well as logical processors. Remember the old debates about the costs of XP and Vista in relation to multicore because of the socket issue? And the one about XP home supporting only a single core?

First, How does MS define the number of CPUs? By core count, socket or logical count?

Gotta look at the M$ legal stuff for that:

http://www.microsoft.com/licensing/highlights/multicore.mspx

Quote:
Licensing on a per-processor rather than a per-core basis ensures that customers will not face additional software licensing requirements or incur additional licensing fees when they choose to adopt multicore processor technology. Customers who use software from vendors that license by individual core, as other software vendors currently do, may face increased software costs when they upgrade to multicore processor systems. Multicore processor systems licensed on a per-processor basis will also help make this new enterprise computing technology affordable to midsize and small business customers.


Now that we know how M$ 'counts' CPUs (by socket) how do they support multicore?

Here is a good chart, taken from Paul Thurrotts "Supersite for Windows"

http://www.winsupersite.com/showcase/winvista_editions_final.asp





Now, according to this, there is no limit on the number of cores Vista will support per socket. Unfortunately, it doesnt address logical CPUs. The number of physical CPUs has nothing to do (not really true) with the number of logical devices'cpus'. If I recall correctly, Vista is limited to 32 logical devices. I dont have a link to prove that, and I very well may be wrong.

The big thing is, just because the OS supports multicore, doesnt mean it makes use of all those cores. Its like a gas tank. You may have a 30 gallon tank in your car, but that doesnt mean it has 30 gallons of gas in it at the moment, only that it can hold up to 30.




August 16, 2008 7:38:52 PM

Gmae dev is costly, thats the main thing holding it back. Most devs struggle using a multi approach, and those thatre good/acceptable at it, it still runs into costs/overhead scenarios. Hell, in gaming, theyve ruined many a game by diminishing the story line, while keeping eye candy in, only because of cost. In a FPS game, all they need is explosions and blood guts and glory to sell. Multi, story line, and even eyecandy are many times not even a consideration let alone a must. Most games are ported from console, which is a multi solution to begin with, so doing it for gaming comes down mainly to costs
August 16, 2008 7:53:16 PM

as a programer, i can tell you multi threading is hard, it just not worth it. dont forget that two thread can run on one core or two, it is just 2 line of codes, which can never meet.

the prolem is memory addressing a debugging, one thread cant access another memory. having said for games you only reall need to multithread the engine once, and just reuse it.

on top of that multi core/multithread system are still great as you can run many diffrent apps at the same time.
a b à CPUs
August 16, 2008 7:58:18 PM

Multi-threading an application seems easy: make one core do that and the other do that ... In fact it's pretty damn complex since they all have to be synchronized; you cannot render a particle until you know where in space it should be and things like that. Also you have to make sure every thread do not lock a resource the other needs (called dead-locks). Finally, multi-threaded applications are a hell to debug since since you have to deal with what's called "race conditions", some bugs will only show-up in very weird circumstances (ex: when you open an application while performing video encoding) just because some threads will become slower and not answer in time ... looping to my first challenge: synchronization.
August 16, 2008 8:00:48 PM

What I meant by shine as bright... usually the single core CPU usage at 100% is easier to obtain, not that it can perform better. :lol: 

Also, I've mentioned it before, my Vista seems to treat the multi-cores differently, well at least the quad. Kinda like there's a different threshold on the usage. Even Linux has the acts the same to where 1 core has to be fully loaded before the other core takes on another task. Never did take the time to figure out if there's a way to assign a core to a specific task on linux.

Like for example, Super_PI on XP when ran, will only load one core to 100%. On my Vista acts differently, the usages varies from 0-3. The only way to make it act like XP is to set Super_PI affinity to 1 core.

But it's a shame if they can't figure out a way to make mulit-core CPU's better, game wise. Although if you have a 10Ghz CPU, can't imagine how much copper you need to keep it cool. :lol: 
August 16, 2008 8:01:41 PM

Hellboy said:
Ok so if multi threading is so far away for the average programmer then why arent the development tools - ie Havoc working with multi threaded capabilities thrown in..


For example 1 core working on physics, another on colision, another on user interaction and the other on game variables.

Surely this would make the game run faster as processors work on the task together - or am I missing the point of multi threaded applications..



Which is intersting in how it affects the dynamics of marketing or 'marketeering'

Take for example the Ageia Physics processor

http://www.tomshardware.com/news/ageia-physx,2490.html
Sounds great, right? But wait!!! Why not just use the other CPU cores for that (as you suggest), rather than spending extra money on the Ageia product?

Well, in this latter Ageia Physics processor article by THG, they showed that the requirement to have a Ageia Physics processor to run Cell factor could be bypassed, and Cell Factor could be run without the Ageia Physics processor.

http://www.tomshardware.com/reviews/ageias-physx-failing,1285-5.html


Quote:
We took several runs of each test to come to the averaged frames per second results shown in the chart below. The raw scores were all within a frame or two of each other, with each card.




Quote:
It is clear from the reported scores that there isn't much difference in the performance with and without the card. Of course this is better than having the scores with the card enabled lower than those with the card disabled, as was the case in the Ghost Recon Advanced Warfighter results in the last article. But still, this is not good news for Ageia. The Shader Model 2.0 Radeon X850XT was obviously rendering the scenes using a lower code path, but the results were the same. No matter what card we put into that test platform, we came up with the same 2-3 frames per second difference.


So, if THGs work is to be trusted, the Ageia Physics processor is basically a waste of money. How much more of a waste would it be if the functions it performs were threaded into softeware to be run on the CPU, or a GPU? Then, the Ageia Physics processor would be uncontestably worthless. Putting that company out of business.

The point to this? Im with you. Thread what you can to the CPU cores. Unfortunately, some stuff just cant be put on the CPU cores.....but it could be put on the GPU, which complicates the problems of multithread developement even further, and its hard enough as it is...

Quote:
2.1.....Multithreading

Another factor to consider concerning software when deciding which CPU to purchase is multithreading. Today, the majority of new desk top CPUs available are multicore, meaning they have more than one core, or processor, on the die. Mores cores can improve processing speed, however, the vast majority of software currently available is single threaded….meaning it can only run on one core. As such, with most software, more cores will not offer a significant advantage.

There are 3 basic types of software multi-threading: Course, Fine and Hybrid

Course multithreading is an instance where a program is specifically written to use multiple cores. It is the most effective type of multithreading, but it is also the most difficult and time consuming to write. It is also the most limited in terms of core scaling. If you have a program written for 2 cores, going from a dual core CPU to a quad core CPU will not provide an appreciable increase in performance. It will add some small increase, as backround applications can be run on a third core, but even though you have doubled the cores, it will not ‘double’ the performance. Conversly, if you are using course threaded software optimized for 4 cores, running it on a dual core will limit the speed of the program to ½ its maximum potential.

Fine multithreading is multi threading that uses loops. Any where a repetitive process occurs within a program, which does not have to call for data or provide data during the process, the loop can be assigned to its own core. The number of cores this type of multi threading can use is limited by the number of independent loops within the software, so this software will scale relative to the number of cores better than course multi threading, but it is not as efficient as course multithreading.

Hybrid multithreading is a combination of Course and Fine multi threading.


....not to mention that if you thread everything you can to the CPU, you deny the possibility of selling extra products like Ageia Physics processor.

It all comes down to time and money. Minimzie the time to get the app/game to market, maximize the profit.
August 16, 2008 8:06:26 PM

Grimmy said:
What I meant by shine as bright... usually the single core CPU usage at 100% is easier to obtain, not that it can perform better. :lol: 

Also, I've mentioned it before, my Vista seems to treat the multi-cores differently, well at least the quad. Kinda like there's a different threshold on the usage. Even Linux has the acts the same to where 1 core has to be fully loaded before the other core takes on another task. Never did take the time to figure out if there's a way to assign a core to a specific task on linux.

Like for example, Super_PI on XP when ran, will only load one core to 100%. On my Vista acts differently, the usages varies from 0-3. The only way to make it act like XP is to set Super_PI affinity to 1 core.

But it's a shame if they can't figure out a way to make mulit-core CPU's better, game wise. Although if you have a 10Ghz CPU, can't imagine how much copper you need to keep it cool. :lol: 


Well, if super pie runs similarly to Prime, you have to run 1 instance for every core. The embeded 'tasking' apps in TAT are the same. I played with that stuff about 3 years ago on an E6600 (yes, E, not Q). They wouldnt share the load of the instance, but they would spread the indvidual instances themselves accross the cores.
August 16, 2008 8:21:05 PM

turpit said:
Well, if super pie runs similarly to Prime, you have to run 1 instance for every core. The embeded 'tasking' apps in TAT are the same. I played with that stuff about 3 years ago on an E6600 (yes, E, not Q). They wouldnt share the load of the instance, but they would spread the indvidual instances themselves accross the cores.


I just brought my XP system up. Brought up prime to run 1 worker thread on my dual core. One core remained at 100%.
Super PI is a single thread app, so again, on XP, one core is loaded, even though the affinity is to use any core.

Now on Vista, using prime to run 1 worker thread, it spead the load over all 4 cores to around 30-40%. Its almost the same deal with Super PI, since its set to use any core. So to me, the threshold on Vista doesn't matter as much as my XP or Linux system.

So the OS does have an effect on the loads for the cores.
August 16, 2008 8:31:32 PM

Zenthar said:
Multi-threading an application seems easy: make one core do that and the other do that ... In fact it's pretty damn complex since they all have to be synchronized; you cannot render a particle until you know where in space it should be and things like that.


I've been writing multi-threaded applications since the late 80s; while it can be complex, there are various methods of dramatically reducing that complexity.

The simplest is to use message queues between threads, with one thread producing messages while the other thread processes them (e.g. on one side might be your physics thread(s) determining where objects are, the other side might be the render thread(s) actually drawing those objects on the screen). That eliminates the synchronisation problems since the only place where you have to synchronise is in the queue code; the downside is that if the system is poorly designed the overhead of the message passing can be greater than the benefit of the extra thread.

So yes, if you keep trying to traditional write single-threaded code and stick multiple threads into it, that's a huge pain in the ass; but that's like complaining that coding in assembler is a huge pain in the ass compared to C++... you're using the wrong tools for the job.
a b à CPUs
August 16, 2008 9:04:31 PM

turpit said:
Which is intersting in how it affects the dynamics of marketing or 'marketeering'

Take for example the Ageia Physics processor

http://www.tomshardware.com/news/ageia-physx,2490.html
Sounds great, right? But wait!!! Why not just use the other CPU cores for that (as you suggest), rather than spending extra money on the Ageia product?

Well, in this latter Ageia Physics processor article by THG, they showed that the requirement to have a Ageia Physics processor to run Cell factor could be bypassed, and Cell Factor could be run without the Ageia Physics processor.

http://www.tomshardware.com/reviews/ageias-physx-failing,1285-5.html


Quote:
We took several runs of each test to come to the averaged frames per second results shown in the chart below. The raw scores were all within a frame or two of each other, with each card.



http://img.tomshardware.com/us/2006/07/14/is_ageias_physx_failing/chart-cellfactor.gif

Quote:
It is clear from the reported scores that there isn't much difference in the performance with and without the card. Of course this is better than having the scores with the card enabled lower than those with the card disabled, as was the case in the Ghost Recon Advanced Warfighter results in the last article. But still, this is not good news for Ageia. The Shader Model 2.0 Radeon X850XT was obviously rendering the scenes using a lower code path, but the results were the same. No matter what card we put into that test platform, we came up with the same 2-3 frames per second difference.


So, if THGs work is to be trusted, the Ageia Physics processor is basically a waste of money. How much more of a waste would it be if the functions it performs were threaded into softeware to be run on the CPU, or a GPU? Then, the Ageia Physics processor would be uncontestably worthless. Putting that company out of business.

The point to this? Im with you. Thread what you can to the CPU cores. Unfortunately, some stuff just cant be put on the CPU cores.....but it could be put on the GPU, which complicates the problems of multithread developement even further, and its hard enough as it is...

Quote:
2.1.....Multithreading

Another factor to consider concerning software when deciding which CPU to purchase is multithreading. Today, the majority of new desk top CPUs available are multicore, meaning they have more than one core, or processor, on the die. Mores cores can improve processing speed, however, the vast majority of software currently available is single threaded….meaning it can only run on one core. As such, with most software, more cores will not offer a significant advantage.

There are 3 basic types of software multi-threading: Course, Fine and Hybrid

Course multithreading is an instance where a program is specifically written to use multiple cores. It is the most effective type of multithreading, but it is also the most difficult and time consuming to write. It is also the most limited in terms of core scaling. If you have a program written for 2 cores, going from a dual core CPU to a quad core CPU will not provide an appreciable increase in performance. It will add some small increase, as backround applications can be run on a third core, but even though you have doubled the cores, it will not ‘double’ the performance. Conversly, if you are using course threaded software optimized for 4 cores, running it on a dual core will limit the speed of the program to ½ its maximum potential.

Fine multithreading is multi threading that uses loops. Any where a repetitive process occurs within a program, which does not have to call for data or provide data during the process, the loop can be assigned to its own core. The number of cores this type of multi threading can use is limited by the number of independent loops within the software, so this software will scale relative to the number of cores better than course multi threading, but it is not as efficient as course multithreading.

Hybrid multithreading is a combination of Course and Fine multi threading.


....not to mention that if you thread everything you can to the CPU, you deny the possibility of selling extra products like Ageia Physics processor.

It all comes down to time and money. Minimzie the time to get the app/game to market, maximize the profit.



On Medal of Honour - Airbourne Assault it installs a Ageia Physics card driver wether or not and Ageia card is fitted...

Its a fun game btw..

Now I recommended this game to a mate of mine who is into wargames and such and had a P4 3.2 Ghz and a 7800gs AGP card and it ran like an absolute dog... Wasnt even playable, unlike bf2 and bf:2142 which runs like a dream on his system...

I put this down to the fact that physics will not run on a single core system... The 7800GS all though slow compared to some still cuts some mustard...

Anyway hes got him self a PS3 and the game for this, but the games shabby compared to my pc version - I guess down to the physics "emulation"as I am sure its not programmed in to the PS3 version..

On my C2D 6700 with a 8800GTX it runs unbeliveably smooth with no slow down, so to me it would seem that physics is either running on the second core or with the guts of the 8800GTX..

So this goes further to prove that Ageia cards are an unecessary for a dedicated physic card if the system is hi spec enough.. I mean how many THG frequenters boast about having a Ageia card...

Its about as much use as a 200 dollar network card..

There was talk of an Indiana Jones games with physics built in to the games engine - but in the light of Lucasarts going the way of many of the others, the titles will be passed on to either EA or Activision ( I cant remember who ) it might not even see the light of day..

What would do well for Ageia is to set the standard and license the code in a game creation tool for multi core processors.. Ageia Physics Compliant for example like Nvidia did with TWIMTBP. This maybe a re-invention of the wheel we are looking for whilst the rest of the gaming publishing communities are working out how to get the most of a Q6600 let alone a Skull Trail ( who in their right mind would buy on of these when Nahelem ( oops i7) will out run these) system to give us an enhanced experience over the consoles which are dominating our release schedules..

Unfortunatly, I hate to say it but there is not that much appart from Consoles to PC nowadays - yeah the graphics are better on pc but the game almost plays the same.....Some games like Car racing games, soccer and other sports games are better than pc due to the control pad and steering wheels which never caught on to the pc market as it should of done..

On a second note, I do honestly believe that THG work very hard at giving us benchmarks that are honest, non biased ( some who I would like to mention im sure would disagree ) , true and believable compared to some other un reliable sources which I am sure we all could mention...

All though I'm not a chip, or even a electronic engineer - I have been in this game for nearly 24 years now as a profession diagnosing faults etc and building pcs and have seen things come and go, but the pc is at a bit of a stigmata right now which I would never thought I would see, and needs some clever tools and software to move to the next level to take advantage of what we got, other wise its at a stale mate...

Unfortunatly Vista which promised everything has failed miserably in doing so in becoming the next big thing - more like a little jump and still doing the same thing but looking prettier but with a few complications thrown in ( UAC anyone )

All because your new girlfriend is better looking, doesnt mean her cookings any better !



August 16, 2008 9:10:48 PM

X264 scales near linearly with # of cores and Mhz :) 
August 16, 2008 9:11:06 PM

Grimmy said:
I just brought my XP system up. Brought up prime to run 1 worker thread on my dual core. One core remained at 100%.
Super PI is a single thread app, so again, on XP, one core is loaded, even though the affinity is to use any core.

Now on Vista, using prime to run 1 worker thread, it spead the load over all 4 cores to around 30-40%. Its almost the same deal with Super PI, since its set to use any core. So to me, the threshold on Vista doesn't matter as much as my XP or Linux system.

So the OS does have an effect on the loads for the cores.



Well, I dunno what to tell you. Vista may handle things differently, though I doubt it. I suspect your doing something errr....off key, or youre looking at the CPU usage which is the summation of both cores and not the individual loading. You have to look at the CPU usage history to get individual core loading....at least in XP

Im not home, but I just ran prime on my LT, and without further chatter, here it is:



Note ther was no noatble load sharing. Core 1 was loaded, core 2 was runnin g everything else, until I stared running the second instance
August 16, 2008 9:15:02 PM

I took the time to do my screen caps:

XP Prime 1 worker thread:



Vista Prime 1 worker thread:



I also used taskmanger to show that all cores are selected.

Edit:

So in order for me to make Vista run like XP, I have to un-select cores 1-2-3, to only have core 0 at 100%.

Does that make it more clear in what I'm saying?
August 16, 2008 9:20:24 PM

Well, on the XP run, only core one is loaded, in the Vista run it does look like its distributing the load amongst the cores---but it shouldnt. Try running 3 and 4 instances in vista and see what it does to the loading. Make sure youre running the FPU intensive in and not loading the memory
August 16, 2008 9:28:28 PM

Heh.. that was the FPU stress test.

I don't know why you won't take my word for it.

If I run 4 worker threads.. all 4 cores will be at 100%

Running 2-3 worker threads, it will act the same, it will spread the load through the other cores. I'll do another screen cap with a history.
August 16, 2008 9:45:38 PM

Grimmy said:
Heh.. that was the FPU stress test.

I don't know why you won't take my word for it.

If I run 4 worker threads.. all 4 cores will be at 100%

Running 2-3 worker threads, it will act the same, it will spread the load through the other cores. I'll do another screen cap with a history.
Oh come on, you know I rarely take anyones word for anything unless they have the proof to back it up :kaola: 

But your results are indicative that vista does distribute the load....which I find difficult to believe because it is Vista and it is M$. But the thing is, and why I singled out FPU, the memory games Vista plays......I dunno...I just cant see Vista having some form of inherent fine Multithreading....its M$
August 16, 2008 9:49:13 PM

Quote:
The obvious issue with multithreading is not exactly the difficulty. The major problem with multithreading is that you can't charge more for the same SW. SO if you sell games for $49 and it costs X dollars to make it singlethreaded but 2X dollars to make it multithreaded, the game makers make half the money.


Why would a game cost 2x as much to write with multithreading? Maybe if you have a gang of low-skilled code monkeys who don't know what they're doing, but there's no good reason why a mulithreaded game engine should add tens of millions of dollars to your development costs.

Quote:
Games are really the only thing that will need multi-threading on the desktop, but people want a 10GHz CPU for $100 (talk about devaluing something) so they're not going to pick up the extra costs for th more complex development cycle.


Video compression, 3D rendering, video playback, etc, etc, etc. Anything CPU-intensive will benefit from multi-threading; it's just that most software most people run these days isn't CPU-intensive.
August 16, 2008 9:49:17 PM

Grimmy said:


So in order for me to make Vista run like XP, I have to un-select cores 1-2-3, to only have core 0 at 100%.

Does that make it more clear in what I'm saying?

LOL. in order to make Vista run like XP you'll have to add enough bloatware to consume 5% more processor resources, 25% more memory resources, incur a 5~10% perfromance hit and take up 10 more gig of HDD space
August 16, 2008 10:03:09 PM

turpit said:
LOL. in order to make Vista run like XP you'll have to add enough bloatware to consume 5% more processor resources, 25% more memory resources, incur a 5~10% perfromance hit and take up 10 more gig of HDD space


Had a feeling you'd say that. I did have dreamscape wallpaper running on what I showed the 1st one, and that does take up CPU resources.

Okay, now I took the time to make sure my idle was low (0-4%) jumps around. Also I used another program to draw a history. Its small, but you still can see how its historgram works:

Vista - Idle - before 1 worker thread set:



Vista - load - 1 worker thread:



Vista - Idle - 2 worker thread set:



Vista - Idle - 2 worker threads load:



That draw a better picture on the load balance is different?
August 16, 2008 10:08:41 PM

no
August 16, 2008 10:12:38 PM

:cry: 

Edit:

Whelp... believe what you want. I'm not trying to pull anyones leg on this. Just trying to show what I see between my 2 systems, which doesn't contain bloat ware. :lol: 

Whelp.. I need to get some sleep. :sleep: 
August 16, 2008 10:58:16 PM

For some applications multi threading is nearly impossible, and some just eat it up :) 

Its always going to be that way.

Multi threading in software seems to me that it would require at least 3 threads 2 threads doing work and 1 thread doing traffic control or being memory management thread.

Although I still often wonder why cores cannot be organized more like a bucket brigade for software each core carries the bucket a little ways. I can see this as being a great where the source and destination are only in one place but when you move the target/source around one bucket would work best....
a c 126 à CPUs
August 16, 2008 11:40:40 PM

turpit it may be hard to believe but Vista does seem to load all the cores very well when doing tasks. Heck I play TF2 and I see core 0 at about 30-50% load and while playing in intense battles the other 3 cores tend to jump between 10-20% load.

Maybe Vista does have the stuff M$ says it does. You never know....
August 17, 2008 12:27:01 AM

Grimmy said:
:cry: 

Edit:

Whelp... believe what you want. I'm not trying to pull anyones leg on this. Just trying to show what I see between my 2 systems, which doesn't contain bloat ware. :lol: 

Whelp.. I need to get some sleep. :sleep: 


np.. based on what you showed, im going to have to dig around for a liitle somethinng..we'll see what I find. See its still doesnt doesnt add up. Prime should load all of of all your cores to 100%. That its only loading to to 1/4 total usage still indicates that its still behaving normally, but the question is...how would it "know" what the capctiy is...its confusing.

EDIT--------

BTW, I dont think your trying to pull anyones leg...
August 17, 2008 2:43:07 AM

jimmysmitty said:
turpit it may be hard to believe but Vista does seem to load all the cores very well when doing tasks. Heck I play TF2 and I see core 0 at about 30-50% load and while playing in intense battles the other 3 cores tend to jump between 10-20% load.

Maybe Vista does have the stuff M$ says it does. You never know....


Loading cores per app is easy. Trying to externally 'share' the load of a single app is something different.
August 17, 2008 2:57:08 AM

Quote:
That would be based on a HW scheduling technique across cores. It would make missed branches even worse for some things. That would only work in cases where there's no shared data, but that rarely happens. Though I do remember reading about an initiative that allowed shared data without locks which would solve the deadlock problem.



Yeah I think trans coding video is about the only ap I can think of that can make use of almost unlimited cores (you can break a movie up into frames and assign them to a core for processing with out corrupting what another core is working on)

Edit: Hmm maybe mirroring the data in ram so each CPU has its own complete data set for some programs ?

This is an interesting topic to say the least :) 

Edit 2: I think different techniques would be needed for different problems (reinventing the wheel here I am sure lol but its a good mental exercise and you never know when you may stumble onto something new)

Branch prediction across CPU's is a cool idea to tell the truth it might have problems with some applications but others would work really well. (easily predicted things like video)
a c 126 à CPUs
August 17, 2008 3:56:42 AM

turpit said:
Loading cores per app is easy. Trying to externally 'share' the load of a single app is something different.


Thats what I mean. Source itself is not fully multi-threaded even though certain things are such as the Physics and particle systems. When I play TF2 or any Source based game I have nothing esle running in the background and all of my cores are active. Yes only the main core, core 0, has the most load but the rest still are getting a bit of a load and it confuses me when there is nothing else running.

I think a great example of this is the low FPS people get in TF2 even with a C2D @ 3GHz+ and a 8800+ GPU. Its a weird occurance but it does happen to some people. Yet I can take my Q6600 @ 3GHz and my HD2900 Pro 1GB and play at 1920x1080p on my 40" TV with everything maxed including 16x AA and 8x AF and still get a smooth 60FPS.

IDK. I am just saying it could be a possibility but we wont know unless we do some intensive research.
August 17, 2008 5:36:25 AM

Grimmy said:
:cry: 

Edit:

Whelp... believe what you want. I'm not trying to pull anyones leg on this. Just trying to show what I see between my 2 systems, which doesn't contain bloat ware. :lol: 

Whelp.. I need to get some sleep. :sleep: 


np.. based on what you showed, im going to have to dig around for a liitle somethinng..we'll see what I find. See its still doesnt doesnt add up. Prime should load all of of all your cores to 100%. That its only loading to to 1/4 total usage still indicates that its still behaving normally, but the question is...how would it "know" what the capctiy is...its confusing.

EDIT--------

BTW, I dont think your trying to pull anyones leg...

----NEW ENTRY-----

Grimmy,

Ive been doing some digging, but havent found anything yet.

Heres the thing, and its difficult to explain textually....

If Vista is doing what you beleive its doing, if you run 1 instance of prime95, single threaded, you should be seeing 100% load on all four cores.

If Vista is load distributing the actual single thread routine, then immediatly after you start prime, as the load on the primary core builds, some of it will be shunted to the other cores, leaving some of the prim ary cores resourses free. But because prime will consume all available resourses, those resourses 'freed' in the primary core by distributing the load to the other cores should still be consumed, and as the load builds it should continue to be distributed, until the resources of all 4 cores are completely consumed.

In short, given the way you think Vista is handling apps, what you should see is a near instantanious load cascade with all four cores sequentially building to full capacity. UNLESS, there is a limit to the resources prime95 will consume. Now, I have never heard of a limit to the FPU calcs prime will run, but I have heard of memory limitations, which is why I noted to run FPU only. So, if there is an FPU limit, then what you have shown should prove what you beleive to be true. But then, if there was a FPU limit, we should have known about it before now. That 'limit', would have shown for example, as a 50% distribution for an E6600 running prime in Vista. It should also show as something less than ~25% for say, a Q9550...maybe 17`19% accross all for cores (a loose guesstimate).

Now, assuming there is no limit to the FPU calcs on prime, then what youve shown indicates to me that something fishy is going on somewhere, either with prime or with Vista and this is the confusing part, becuase if vista is load sharing as you beleive, it shouldnt be limiting prime to 25%/core.

Guess where I think the fish stink is coming from. ;)  And no, its not you, you know better than that. I dont think youre pulling anyones leg, but I think M$ may be pulling all our legs. It wouldnt be the first time and it certainly wont be the last.
August 17, 2008 1:38:39 PM

turpit said:
np.. based on what you showed, im going to have to dig around for a liitle somethinng..we'll see what I find. See its still doesnt doesnt add up. Prime should load all of of all your cores to 100%. That its only loading to to 1/4 total usage still indicates that its still behaving normally, but the question is...how would it "know" what the capctiy is...its confusing.

EDIT--------

BTW, I dont think your trying to pull anyones leg...

----NEW ENTRY-----

Grimmy,

Ive been doing some digging, but havent found anything yet.

Heres the thing, and its difficult to explain textually....

If Vista is doing what you beleive its doing, if you run 1 instance of prime95, single threaded, you should be seeing 100% load on all four cores.

If Vista is load distributing the actual single thread routine, then immediatly after you start prime, as the load on the primary core builds, some of it will be shunted to the other cores, leaving some of the prim ary cores resourses free. But because prime will consume all available resourses, those resourses 'freed' in the primary core by distributing the load to the other cores should still be consumed, and as the load builds it should continue to be distributed, until the resources of all 4 cores are completely consumed.

In short, given the way you think Vista is handling apps, what you should see is a near instantanious load cascade with all four cores sequentially building to full capacity. UNLESS, there is a limit to the resources prime95 will consume. Now, I have never heard of a limit to the FPU calcs prime will run, but I have heard of memory limitations, which is why I noted to run FPU only. So, if there is an FPU limit, then what you have shown should prove what you beleive to be true. But then, if there was a FPU limit, we should have known about it before now. That 'limit', would have shown for example, as a 50% distribution for an E6600 running prime in Vista. It should also show as something less than ~25% for say, a Q9550...maybe 17`19% accross all for cores (a loose guesstimate).

Now, assuming there is no limit to the FPU calcs on prime, then what youve shown indicates to me that something fishy is going on somewhere, either with prime or with Vista and this is the confusing part, becuase if vista is load sharing as you beleive, it shouldnt be limiting prime to 25%/core.

Guess where I think the fish stink is coming from. ;)  And no, its not you, you know better than that. I dont think youre pulling anyones leg, but I think M$ may be pulling all our legs. It wouldnt be the first time and it certainly wont be the last.


Well.. first off, I think my MB died. After last night, it simply won't post anymore. I think the old NV 650i just gave out on me.

Now about the prime95. Its not the older version, its the updated version "25.6.1.0" which you can find from the OC guide on the quad.

In order to load all cores, you assign worker threads as I stated. On my first screen cap on XP, you should noticed only one core loaded since I used 1 worker thread. Now if I use 2 worker thread, the dual core will be at full CPU usage. The same goes for my quad when I assign 4 worker threads, all 4 cores will be fully loaded.

So... when I use 1 worker thread on Vista, it will not load on one core, UNLESS I assign one core using the affinity setting, which I could assign to any core (0/1/2/3). With Super PI, its a single threaded app. Now on XP like I explain (running Super PI 1MB), its loads only one core, but on Vista it doesn't do that, unless again, I use the affinity setting.

I can tell you that when I run 1MB iteration on Super PI without setting the affinity, it will take longer (.500 ms). So it is switching different cores around, kinda like 'hot potato' the process is being pushed by difference cores at time, or keeping the potato in the air so to speak.

But... atm I can't do anymore tests, unless someone here with a quad can do some test for Turpit.

Looks like I'll be looking around for a P35 or P45 chipset this week.

Edit:

Forgot I my OC testing.. here's Prime95 loading all 4 cores with 4 worker threads:

August 18, 2008 5:43:13 AM

Grimmy said:
Well.. first off, I think my MB died. After last night, it simply won't post anymore. I think the old NV 650i just gave out on me.

Now about the prime95. Its not the older version, its the updated version "25.6.1.0" which you can find from the OC guide on the quad.

In order to load all cores, you assign worker threads as I stated. On my first screen cap on XP, you should noticed only one core loaded since I used 1 worker thread. Now if I use 2 worker thread, the dual core will be at full CPU usage. The same goes for my quad when I assign 4 worker threads, all 4 cores will be fully loaded.

So... when I use 1 worker thread on Vista, it will not load on one core, UNLESS I assign one core using the affinity setting, which I could assign to any core (0/1/2/3). With Super PI, its a single threaded app. Now on XP like I explain (running Super PI 1MB), its loads only one core, but on Vista it doesn't do that, unless again, I use the affinity setting.

I can tell you that when I run 1MB iteration on Super PI without setting the affinity, it will take longer (.500 ms). So it is switching different cores around, kinda like 'hot potato' the process is being pushed by difference cores at time, or keeping the potato in the air so to speak.

But... atm I can't do anymore tests, unless someone here with a quad can do some test for Turpit.

Looks like I'll be looking around for a P35 or P45 chipset this week.

Edit:

Forgot I my OC testing.. here's Prime95 loading all 4 cores with 4 worker threads:

http://members.cox.net/fade.2.black/p6n/Q6600-final-3200ghz.jpg



Grimmy,

I understand how prime works. I used the same version to run the demos on my laptop the other day.

What Im trying to explain is that the program for the prime calculation is linear. You can run multiple threads, but those are only individual instances of the same calculation.

How you seem to be seeing this is that the prime FPU cacluation is finite, that is that there is a limit to the number of loops it will generate, like a game. Again, if this was so, we would have seen that limit a long time ago, on any core CPU, single, dual or quad. But prime consumes all the resourses it can if you set it to.

Vista cant 'load share' a program of its own accord...it can only do what the program tells it to do. If the program is written in such a manner that it will allow the OS to distribute the load (multithreaded) AND the OS is capable of such tasking, then that can be accomplished. Regardless, Vista itself cant 'break' apart a program to distribute operations among cores, it can only distribute what the program 'tells' it can be distributed. In the case of prime, were Vista to run like you think it is, shunting calculation loops of its own accord, then all 4 cores should load to 100% on a single thread. We both know this is not happening....thats not neither the question nor whats confusing.

Whats confusing is how Vista is presenting the information on the loading...essentially, it looks like Vista is lying, which frankly would not be suprising.


Here the math behind the prime calcs, from the mersenne site itself.

http://www.mersenne.org/math.htm

From the Great Internet Mersenne Prime Search
Quote:
The next step is to eliminate exponents by finding a small factor. There are very efficient algorithms for determining if a number divides 2P-1. For example, let's see if 47 divides 223-1. Convert the exponent 23 to binary, you get 10111. Starting with 1, repeatedly square, remove the top bit of the exponent and if 1 multiply squared value by 2, then compute the remainder upon division by 47.

Remove Optional
Square top bit mul by 2 mod 47
------------ ------- ------------- ------
1*1 = 1 1 0111 1*2 = 2 2
2*2 = 4 0 111 no 4
4*4 = 16 1 11 16*2 = 32 32
32*32 = 1024 1 1 1024*2 = 2048 27
27*27 = 729 1 729*2 = 1458 1

Thus, 223 = 1 mod 47. Subtract 1 from both sides. 223-1 = 0 mod 47. Since we've shown that 47 is a factor, 223-1 is not prime.

One very nice property of Mersenne numbers is that any factor q of 2P-1 must be of the form 2kp+1. Furthermore, q must be 1 or 7 mod 8. A proof is available. Finally, an efficient program can take advantage of the fact that any potential factor q must be prime.

The GIMPS factoring code creates a modified sieve of Eratosthenes with each bit representing a potential 2kp+1 factor. The sieve then eliminates any potential factors that are divisible by prime numbers below 40,000 or so. Also, bits representing potential factors of 3 or 5 mod 8 are cleared. This process eliminates roughly 95% of potential factors. The remaining potential factors are tested using the powering algorithm above.

Now the only question remaining is how much trial factoring should be done? The answer depends on three variables: the cost of factoring, the chance of finding a factor, and the cost of a primality test. The formula used is:

factoring_cost < chance_of_finding_factor * 2 * primality_test_cost

That is, the time spent factoring must be less than the expected time saved. If a factor is found we can avoid running both the first-time and double-check primality tests.
Looking at past factoring data we see that the chance of finding a factor between 2X and 2X+1 is about 1/x. The factoring cost and primality test costs are computed by timing the program. At present, the program trial factors to these limits:
Exponents Trial
up to factored to
--------- -----------
3960000 260
5160000 261
6515000 262
8250000 263
13380000 264
23390000 265
29690000 266
37800000 267
47450000 268
58520000 269
75670000 270
96830000 271



--------------------------------------------------------------------------------

P-1 Factoring

--------------------------------------------------------------------------------

There is another factoring method that GIMPS uses to find factors and thereby avoid costly primality tests. This method is called Pollard's (P-1) method. If q is a factor of a number, then the P-1 method will find the factor q if q-1 is highly composite - that is, it has nothing but small factors.

This method when adapted to Mersenne numbers is even more effective. Remember, that the factor q is of the form 2kp+1. It is easy to modify the P-1 method such that it will find the factor q whenever k is highly composite.

The P-1 method is quite simple. In stage 1 we pick a bound B1. P-1 factoring will find the factor q as long as all factors of k are less than B1 (k is called B1-smooth). We compute E - the product of all primes less than B1. Then we compute x = 3E*2*P. Finally, we check the GCD (x-1, 2P-1) to see if a factor was found.

There is an enhancement to Pollard's algorithm called stage 2 that uses a second bound, B2. Stage 2 will find the factor q if k has just one factor between B1 and B2 and all remaining factors are below B1. This stage uses lots of memory.

GIMPS has used this method to find some impressive factors. For example:

22944999-1 is divisible by 314584703073057080643101377.
314584703073057080643101377 is 2 * 53409984701702289312 * 2944999 + 1.
The value k, 53409984701702289312, is very smooth:
53409984701702289312 = 25 * 3 * 19 * 947 * 7187 * 62297 * 69061


So how does GIMPS intelligently choose B1 and B2? We use a variation of the formula used in trial factoring. We must maximize:

chance_of_finding_factor * 2 * primality_test_cost - factoring_cost

The chance of finding a factor and the factoring cost both vary with different B1 and B2 values. Dickman's function (see Knuth's Art of Computer Programming vol 2) is used to determine the probability of finding a factor, that is k is B1-smooth or B1-smooth with just one factor between B1 and B2. The program tries many values of B1 and if there is sufficient available memory several values of B2, selecting the B1 and B2 values that maximize the formula above.


--------------------------------------------------------------------------------

Lucas-Lehmer testing

--------------------------------------------------------------------------------

The Lucas-Lehmer primality test is remarkably simple. It states that for P > 2, 2P-1 is prime if and only if Sp-2 is zero in this sequence: S0 = 4, SN = (SN-12 - 2) mod (2P-1). For example, to prove 27 - 1 is prime:
S0 = 4
S1 = (4 * 4 - 2) mod 127 = 14
S2 = (14 * 14 - 2) mod 127 = 67
S3 = (67 * 67 - 2) mod 127 = 42
S4 = (42 * 42 - 2) mod 127 = 111
S5 = (111 * 111 - 2) mod 127 = 0


To implement the Lucas-Lehmer test efficiently, one must find the fastest way to square huge numbers modulo 2P-1. Since the late 1960's the fastest algorithm for squaring large numbers is to split the large number into pieces forming a large array, then perform a Fast Fourier Transform (FFT), a squaring, and an Inverse Fast Fourier Transform (IFFT). See the "How Fast Can We Multiply?" section in Knuth's Art of Computer Programming vol. 2. In a January, 1994 Mathematics of Computation article by Richard Crandall and Barry Fagin titled "Discrete Weighted Transforms and Large-Integer Arithmetic", the concept of using an irrational base FFT was introduced. This improvement more than doubled the speed of the squaring by allowing us to use a smaller FFT and it performs the mod 2P-1 step for free. Although GIMPS uses a floating point FFT for reasons specific to the Intel Pentium architecture, Peter Montgomery showed that an all-integer weighted transform can also be used.

As mentioned in the last paragraph, GIMPS uses floating point FFTs written in highly pipelined, cache friendly assembly language. Since floating point computations are inexact, after every iteration the floating point values are rounded back to integers. The discrepancy between the proper integer result and the actual floating point result is called the convolution error. If the convolution error ever exceeds 0.5 then the rounding step will produce incorrect results - meaning a larger FFT should have been used. One of GIMPS' error checks is to make sure the maximum convolution error does not exceed 0.4. Unfortunately, this error check is fairly expensive and is not done on every squaring. There is another error check that is fairly cheap. One property of FFT squaring is that:

(sum of the input FFT values)2 = (sum of the output IFFT values)

Since we are using floating point numbers we must change the "equals sign" above to "approximately equals". If the two values differ by a substantial amount, then you get a SUMINP != SUMOUT error as described in the readme.txt file. If the sum of the input FFT values is an illegal floating point value such as infinity, then you get an ILLEGAL SUMOUT error. Unfortunately, these error checks cannot catch all errors, which brings us to our next section.
What are the chances that the Lucas-Lehmer test will find a new Mersenne prime number? A simple approach is to repeatedly apply the observation that the chance of finding a factor between 2X and 2X+1 is about 1/x. For example, you are testing 210000139-1 for which trial factoring has proved there are no factors less than 264. The chance that it is prime is the chance of no 65-bit factor * chance of no 66 bit factor * ... * chance of no 5000070 bit factor. That is:

64 65 5000069
-- * -- * ... * -------
65 66 5000070

This simplifies to 64 / 5000070 or 1 in 78126. This simple approach isn't quite right. It would give a formula of how_far_factored divided by (exponent divided by 2). However, more rigorous work has shown the formula to be (how_far_factored-1) / (exponent times Euler's constant (0.577...)). In this case, 1 in 91623. Even these more rigourous formulas are unproven.


--------------------------------------------------------------------------------

Double-checking

--------------------------------------------------------------------------------

To verify that a first-time Lucas-Lehmer primality test was performed without error, GIMPS runs the primality test a second time. During each test, the low order 64 bits of the final SP-2 value, called a residue, are printed. If these match, then GIMPS declares the exponent properly double-checked. If they do not match, then the primality test is run again until a match finally occurs. The double-check, which takes just as long as a first-time test, is usually done about 2 years after the first-time test. GIMPS assigns double-checks to slower PCs because the exponents are smaller than first-time tests, resulting in work units that can complete in a reasonable time on a slower PC.

GIMPS double-checking goes a bit further to guard against programming errors. Prior to starting the Lucas-Lehmer test, the S0 value is left-shifted by a random amount. Each squaring just doubles how much we have shifted the S value. Note that the mod 2P-1 step merely rotates the p-th bits and above to the least significant bits, so there is no loss of information. Why do we go to this trouble? If there were a bug in the FFT code, then the shifting of the S values ensures that the FFTs in the first primality test are dealing with completely different data than the FFTs in the second primality test. It would be near impossible for a programming bug to produce the same final 64 bit residues.

Historically, the error rate for a Lucas-Lehmer test where no serious errors were reported during the run is around 1.5%. For Lucas-Lehmer tests where an error was reported the error rate is in the neighborhood of 50%. For the record, I don't count the "ILLEGAL SUMOUT" error as a serious error


If Vista were to load share of its own accord, 1 threads worth of the above should load every core to max.
August 18, 2008 9:31:55 AM

I believe the way to deal with parallel programming is a much more modular design, not just in width, but depth - where a common information dataset is frequently updated for use in all threads.

Basically, a larger number of shallower threads, that allow greater crossflow of information.



For instance, a racing game.

Your car is affected by a number of different parameters, such as: your tyres, your suspension, steering angle, throttle setting/gear, brake setting, lateral g, longitudinal g, roll centre, centre of gravity, aerodynamic centre.


Fundamentally, the user controls just three of these, steering, throttle and brake. Thus, a control variable thread can be set up that updates these properties dependant on controller input.

There are also fixed variables, like centre of gravity.


After which, separate threads could be run for each independent variable, using the other independent variable values from the previous clock cycle (over a very short time cycle, the step variation tends to zero so it is valid to assumption the previous values carry - and there is a natural inertia anyway).

With the threads concentrating on one aspect of the car only, they are shallow, allowing the frequent updates of information.


That is the physics aspect broken up into several threads.


Then for AI, each car can be run on separate threads, again, with crossflow of info so they are racing the other AI cars.



Parallel programming is not easy, but in reality, much of what happens in games is parallel in nature - one thing ripples out to affect the rest.

Shortening the lag from initial perturbation to the change cascading down to affect everything is the key. I think shallower threads allows that.
August 18, 2008 2:04:57 PM

turpit said:
Grimmy,

I understand how prime works. I used the same version to run the demos on my laptop the other day.

If Vista were to load share of its own accord, 1 threads worth of the above should load every core to max.

Edited to make the quote shorter.


Heh.. I'm okay at math, but thats quite abit for me to take in. So... the only thing I understand, is that you think Vista lying. I suppose that could be true, but it seems when I watch the temps, the corresponding load does cause the same cores temp to go up. And using the affinity can put the entire load on what ever core is chosen.

Just wish I had my quad system up and running. I'll be getting my MB replacement this week, and it's my vacation. Perhaps I could do some more tests once I get it back up and running.

If you can think of any other test to try, I'll try to do it with a screen cap.

I'm surprised I'm the only one seeing this. Can anyone else with a quad on Vista contribute some info or your experience?
August 18, 2008 4:04:04 PM

turpit said:
Well, I hope this will stem some of the 'unicorn wishes' that every new app or game coming out will be quad optimized.

Honestly, you would think the fact that dual core has been around for years yet most programs are still single threaded would get that point accross, but it appears the 'quantity is better than quality' mentality can overwhelm logic.


You're right. Because something is difficult to do means it will never happen, so we should all just throw in the towel and forget about multicore. And move back into caves while we're at it.
August 20, 2008 3:41:44 AM

Grimmy said:
Heh.. I'm okay at math, but thats quite abit for me to take in. So... the only thing I understand, is that you think Vista lying. I suppose that could be true, but it seems when I watch the temps, the corresponding load does cause the same cores temp to go up. And using the affinity can put the entire load on what ever core is chosen.

Just wish I had my quad system up and running. I'll be getting my MB replacement this week, and it's my vacation. Perhaps I could do some more tests once I get it back up and running.

If you can think of any other test to try, I'll try to do it with a screen cap.

I'm surprised I'm the only one seeing this. Can anyone else with a quad on Vista contribute some info or your experience?


Grimmy,

If the primary prime (no pun intended) calc is fine threaded, it would allow Vista to shunt those loops, which would explain distribution, but Vista should not be able to 'cap' the number to 1/4 usage per core.....soemthin wierd is going on....I think I have to aks someone to run some tests on this, though I doubt they will.
August 20, 2008 3:55:07 AM

I get my MB today, which will prolly take me 3 to 6 hours to at least get up and running.

I might try to do a mini movie to show ya what the histogram on task manager is doing.

I know what your saying, that is weird. That was might first thought when I discovered it. I did have a mini movie uploaded, but the site limited it since it was a.. umm.. lil too big? :lol: 

Edit:

<--needs to go to work. I'll talk to ya guys later.
August 20, 2008 4:00:35 AM

snarfies1 said:
You're right. Because something is difficult to do means it will never happen, so we should all just throw in the towel and forget about multicore. And move back into caves while we're at it.


That wasnt my intent, and I appologize if it came off it that way.

My meaning was this: Every day, here in the forum, anytime asks the question "quad or dual core?", invariably there are several people who will answer "get the quad, all new games are being to be written for quads"

This is simply untrue, and will always be untrue. It is the result of the "more is better" mentaility...making statements without knowing the facts. Eventually many games will be optimized for multicore., but not all. The simple little games will never need multicore. The more in depth games will see benefits from multicore. But as so many others stated in the thread, writing mutlithreaded is more difficult, time consuming and costly. If, for a given game, a single thread will provide the same results as a multithread, there is not point to expending the extra resources.

What this means for the time being is that rather than just running out and buying as many cores as you can, you should look at the apps/games you want to see what they will benefit the most from. If someone wants to play FSX, or render heavy 3D, or transcode, a slower quad will benefit (depending how slow) them more than a fast dual. Conversley, if they want to play Doom 3, RTCW, soloaire, surf the web or run a wordprocessor, they will get more beneift from the faster dual. Contrary to all the "everything is going to be quad" kiddies 'unicorn wishes', by the time there is sufficient tri or quad threaded software to undeniably offset the advantages of clock speed, every CPU now on the market will be obsolete....thus negating the 'future proofing' aurgument.
August 20, 2008 4:04:42 AM

Grimmy said:
I get my MB today, which will prolly take me 3 to 6 hours to at least get up and running.

I might try to do a mini movie to show ya what the histogram on task manager is doing.

I know what your saying, that is weird. That was might first thought when I discovered it. I did have a mini movie uploaded, but the site limited it since it was a.. umm.. lil too big? :lol: 

Edit:

<--needs to go to work. I'll talk to ya guys later.


Grimmy,

No need, I got the picture from your screen shots. Honestly, how or why Vista could or would do this is beyond me. I only know, it shouldnt happen. I need to converse with the heavy guns of THG about this.....with all the crap M$ did to Vista to improve the perfromance, or rather to minimize the loss in performance relative to XP, I cant help but wonder if this is some bizzre side effect...of just some coincidental 'collision' of prime and Vista....or something else.
!