AMD CPU speculation... and expert conjecture - Page 60
Tags:
-
AMD
-
CPUs
Last response: in CPUs
gamerk316 said:
]From your wiki link:
Quote:
The distinguishing difference between the two forms is the maximum number of concurrent threads that can execute in any given pipeline stage in a given cycle. In temporal multithreading the number is one, while in simultaneous multithreading the number is greater than one.Yes, you misread that, what it means is, specifically what it says in the simplest terms. TMT will only function on one thread at a time, but it will use context switches to change the active thread throughout the clock cycle. SMT means that you can have more than one thread at a time running concurrently by using a virtual core or other set of resources.
This is the difference between a physical core using context switches and a virtual core running an operation using partial resources from a physical core. One is a round robin type system working on one thread at a time, but hitting several in the same clock cycle, the other is working on 2 threads simultaneously, but one at an accelerated rate.
I think you just misread that, or didn't read far enough into it...either way.
Quote:
Secondly, AMD uses a CMT scheme, not TMT. Essentially, they duplicate most of the resources of a CPU core, except the scheduler and some FP units.http://www.behardware.com/articles/833-2/amd-bulldozer-...
One BD Module is a "core" with SMT in the classical sense. But since there's two separate register contexts, Task Manager and the OS see "two" cores per BD module. Same concept as Intel HTT, where a HTT core is shown on Task Manager.
Erm...you mean CMP I am guessing? CMT is a poorly executed TV music channel so far as I know. As for CMP, of course the architecture is CMP (Chip level multiprocessing)...that is true of any multicore CPU. Intel chips utilize CMP as well as SMT...do you honestly think that AMD can't utilize both CMP architecture and TMT process? At one point there was discussion that BD would have SMT + TMT and it would have CMP by default because of architecture design, just like Intel has CMP and SMT.
Remember the discussion earlier? "Multiprocessing != Multithreading!" This is the distinction you are alluding to, you're just mixing the 2 together now.
Quote:
But those resources (the second Integer scheduler, for instance) can't be used on a single thread, so they do nothing but increase power draw. Resources are useless if not used. Hence why BD does so poorly in single-threaded apps, as almost HALF the resources in a BD module go unused.Secondly, switching threads is VERY computationally expensive. You want to avoid undergoing a context-switch whenever possible. During a context switch, the CPU is doing NOTHING. Finally, context switches have been around since, I don't know, the first 8-bit CPU's? Its not like its an AMD exclusive feature here...
http://en.wikipedia.org/wiki/Context_switch
Aha...but now you see how the inefficiency in the AMD architecture becomes magnified by misprediction errors and cache clears from improper execution, huh? This isn't something new, or unique to AMD, intel went with SMT, and AMD went with TMT. However, you have to consider as well from an AMD business standpoint. Designing architecture to use SMT and have it be a completely new architecture from the ground up is a daunting task to be sure. So they took the cheaper, albeit lazier, way out and decided to use 15 year old less efficient technology relying simply on raw horsepower, where they have a 2 to 1 advantage in sheer brute force.
Now, that didn't work out so well with BD because the architecture was far less efficient than they initially suspected...so PD was rolled out pretty quickly to stem the tide. Now, they have the time to sit down and do steamroller correctly so that it turns out to be what it should. The issue is, when you use TMT, you have to make the hardware extremely efficient, or you lose time and resources...as you well pointed out above and below. Which means that every % point gained in efficiency is basically worth twice as much in terms of performance.
If they had gome SMT like intel, they would have released BD later, but it likely would have been far simpler to tweak and adjust because SMT is far more forgiving on the hardware's internal logic systems, and so mispredictions and logical errors cost far less.
Quote:
Firstly, against, context switches are expensive to execute and should be avoided at all costs. If you want a different thread to run, it should be handled by the OS scheduler. Having to context switch in hardware basically means the pipeline stalled (for whatever reason), necessitating the entire state be saved, a new thread loaded in, its state restored, and continuing with that thread. It keeps the CPU going, but its MUCH cheaper to run one thread until the OS schedules a different thread to run.Yes, but if you're not setup for SMT, TMT is the only other way to multithread...CMP is not a legitimate way to carry a full load because you're limited by 1 thread per core. No realistic engineer is going to bet on purely 8 threads and nothing more. You always have to have a plan B. TMT was an easy implement...that they underestimated how much would be used in my own estimations...hence PD arriving so quickly on the heels of BD.
The switches are inefficient, but you notice how much Cache is in BD/PD? They did that for a reason....to accelerate the ability to store the threads for the cores to be able to come back to them. That's why it's relatively large compared to previous generation. It's all to accomodate the low level multithreading design of TMT and try to get as much out of it as they can.
Quote:
AMD gains in multithreaded apps mainly due to having a more powerful SMT implementation (~80% performance for CMT, ~15% performance for HTT) and a significantly faster base clock (3.4 versus 3.8). Even then, the poor per-core performance keeps the BD arch from trouncing i7's, despite more cores at a higher speed.You're partially right...greater CMP and higher clock speed amounts to more "brute force"...but SMT and architecture efficiency catch intel back up where TMT holds AMD back because the hardware is not tightly designed well enough to eliminate as many errors as they could have. SR will fix an enormous portion of those issues.
Quote:
No you don't, because no more then 8 threads can execute at any given time. So no division of resources is necessary.Seriously, this isn't that hard to grasp. To run a thread, you have to manipulate data. To manipulate data, you have to load the data into CPU registers. 8 sets of registers means you can only run 8 threads at a time. This is Computer Architecture 101 here people...
So, my question to you is...Why are there 4 register pipelines per core in PD and 8 in SR? If they are not executing multiple threads per clock cycle...why would they need more pipelines? Seems like an awful waste of engineering effort to design something with no purpose. Also, in SR, they're increasing the register queue from 8 to 16 places.
That's a lot of engineering for a system that isn't multithreading...don't you think?
They are using TMT.
Quote:
1: again, AMD uses a CMT scheme, discussed above.AMD and intel have CMP CPUs...it's the architecture they used to design them by making them multicore. It has nothing to do with the price of goats in africa when we're discussing multithreading.
This is what you're referring to: http://en.wikipedia.org/wiki/Multi-core_(computing)
Quote:
2: Fixing a branch predictor will help their worst-case performance, but won't do a damn to improve best-case PD numbers.I disagree, there will be places where the hardware was bottlenecking itself, and the SR improvements will make a huge difference.
Quote:
Not true. HTT needs special coding, due to the lack of anything other then the extra register context, which limits what a HTT core can do. But thats no a limitation of SMT.Fair enough, I overgeneralized lumping SMT and HTT together...point noted. But it doesn't change what I said about HTT being true...as you concede clearly.
Quote:
Simple example: An intel core and a HTT core share the ALU. So only one core can handle math operations at a time. If you have two instructions that both require the ALU, guess what? One has to wait. Hence why HTT typically doesn't add much extra processing power. [That being said, register-register arithmetic could theoretically scale to 100%. I still occasionally use bit-shifts in my code for exactly this reason.]Yes, that is where HTT has a weakness versus more cores...I have pointed this out to some on several occasions.
Quote:
As I pointed above, AMD uses CMP. I fixed it for you...
Again, nothing to do with multithreading...that's multiprocessingQuote:
Secondly, you also have to take into account Drivers, configs, and the like when talking about Linux in general. I mean, comparing CCC to Nouveau when doing a NVIDIA to AMD comparison on Linux would hardly be fair...Yes, my statement was broad and generalized, and honestly outside of redhat/fedora, debian and ubuntu...I haven't really played with too many other linux distributions out there. So I cannot comment on the ones that I have not messed with, but the ones I have used were significantly faster. So I will concede your point of being too broad, perhaps...there are literally thousands of different linux versions out there.
-
Reply to 8350rocks
Related resources
- Tek Syndicate: Expert Conjecture and Speculation - Forum
- AMD Steamroller rumours ... and expert conjecture - Forum
- Will the 4+4 pin CPU power on the Corsair RM850 Gold reach to 8 pin header on the Asus Z87- Expert in the Corsair Obsidian 750 - Tech Support
- I need a CPU expert's opinion on this - Tech Support
- 5/5 on CPU expert, but no badge? - Forum
Lenovo might have the first Richland desktops.
http://shop.lenovo.com/gb/en/desktops/essential/h-serie...
http://shop.lenovo.com/gb/en/desktops/essential/h-serie...
-
Reply to Cazalan
Cazalan said:
Lenovo might have the first Richland desktops. http://shop.lenovo.com/gb/en/desktops/essential/h-serie...
Nice find! I'm curious to see how far people will be able to push these next Gen APUs! Although, I'm sure Kaveri will make Richland look like child's play. lol
-
Reply to griptwister
lilcinw
April 26, 2013 6:07:54 PM
@8350
I think I understand where you are coming up with your '44 threads' theory (even though I am fairly certain that there is no real world case where that level of efficiency is achieved and 'engineered maximums' are only relevant in a programmers nirvana where The Code is handed down from on high by an ascended Linus Torvalds and the unicorns poop rainbows and pee energy drinks).
What I don't understand is a statement you made a while back that SR is supposed to include additional integer resources. The way I understand the current PD arch is that each integer cluster has 2 ALUs and 2 AGUs which is duplicated in each module and has a shared floating point unit consisting of 2 128-bit FMAC units and 2 MMX units. AMD has stated that they have removed one of the MMX units from the SR FPU which will yield 4[2(2xAGU + 2xALU) + 2FMAC +1MMX] = 44 'pipelines' whereas the current PD has 48 'pipelines' (4 additional MMX).
Wouldn't that mean that SR is being nerfed instead of buffed?
I think I understand where you are coming up with your '44 threads' theory (even though I am fairly certain that there is no real world case where that level of efficiency is achieved and 'engineered maximums' are only relevant in a programmers nirvana where The Code is handed down from on high by an ascended Linus Torvalds and the unicorns poop rainbows and pee energy drinks).
What I don't understand is a statement you made a while back that SR is supposed to include additional integer resources. The way I understand the current PD arch is that each integer cluster has 2 ALUs and 2 AGUs which is duplicated in each module and has a shared floating point unit consisting of 2 128-bit FMAC units and 2 MMX units. AMD has stated that they have removed one of the MMX units from the SR FPU which will yield 4[2(2xAGU + 2xALU) + 2FMAC +1MMX] = 44 'pipelines' whereas the current PD has 48 'pipelines' (4 additional MMX).
Wouldn't that mean that SR is being nerfed instead of buffed?
-
Reply to lilcinw
lilcinw
April 26, 2013 6:10:18 PM
Cazalan said:
Lenovo might have the first Richland desktops. http://shop.lenovo.com/gb/en/desktops/essential/h-serie...
Ooohhh... What is the AMD 740? Is that a Richland based Athlon? I wonder if that will ever be available outside of OEM channels.
Edit: Nevermind, it is the Trinity that Tom's has been trying to get their hands on for months for a review and never seems to be available.
-
Reply to lilcinw
JAYDEEJOHN said:
Lets not turn this into a he said she said.Sometimes we say things and need to learn, no big deal.
No ones stupid here, or drunk.
We come to learn and share, step it up you learned ones, its you who are failing
Well lets not go that far.
Again going with gamers Programming knowledge i'm gonna have to side with him. Also 44IPC or what ever isn't going to improve much of anything since most programs don't even have that kind of ILP. Also what do you guys think TMT is i Google it and i get "TMT provides tools and services to rapidly and easily develop software that is fully parallelized and scales to perform to industry standards on “ManyCore CPU” architectures"
-
Reply to jdwii
griptwister said:
Cazalan said:
Lenovo might have the first Richland desktops. http://shop.lenovo.com/gb/en/desktops/essential/h-serie...
Nice find! I'm curious to see how far people will be able to push these next Gen APUs! Although, I'm sure Kaveri will make Richland look like child's play. lol
Notice a couple K models were listed. That's unusual to find overclockable models from a mainstream vendor.
-
Reply to Cazalan
^ TMT is ™Truegenius
i.e, TradeMarkTruegenius
@8350rock
do you meam that
option 1) we can execute more than 1 thread on 1 core concurently i.e, all 44 threads will be under execution at every instance of time instead of waiting state
or you mean
option 2) all 44 threads will be loaded in memory but will undergo execution in 1 thread at any instance of time i.e, schedular is doing the job to make us feel that all threads are under execution
for example
lets take a single core of any cpu without ht/smt
and if we run 1 thread of 7zip and 1 thread of winrar
do they will compete for cpu time or will they get full cpu time as cpu can run multiple threads ?
if latter is your answer then why we get performance hit of 50% on both application when running both application at same time.
i.e, TradeMarkTruegenius
@8350rock
do you meam that
option 1) we can execute more than 1 thread on 1 core concurently i.e, all 44 threads will be under execution at every instance of time instead of waiting state
or you mean
option 2) all 44 threads will be loaded in memory but will undergo execution in 1 thread at any instance of time i.e, schedular is doing the job to make us feel that all threads are under execution
for example
lets take a single core of any cpu without ht/smt
and if we run 1 thread of 7zip and 1 thread of winrar
do they will compete for cpu time or will they get full cpu time as cpu can run multiple threads ?
if latter is your answer then why we get performance hit of 50% on both application when running both application at same time.
-
Reply to truegenius
lilcinw
April 26, 2013 8:22:47 PM
For clarity's sake here is the 'nerf' I am referring to:
Bulldozer's 10 'pipes' per module:
![]()
Piledriver's 12 with added(?) MMX units:
![]()
Steamroller's 11:
![]()
Wikipedia mentions the MMX units in the original Bulldozer architecture so maybe they just weren't exciting enough to make the marketing materials at the time (it was all about AVX/FMA IIRC).
Regardless it is reduced from 48 'pipelines' per chip to 44.
Bulldozer's 10 'pipes' per module:

Piledriver's 12 with added(?) MMX units:

Steamroller's 11:

Wikipedia mentions the MMX units in the original Bulldozer architecture so maybe they just weren't exciting enough to make the marketing materials at the time (it was all about AVX/FMA IIRC).
Regardless it is reduced from 48 'pipelines' per chip to 44.
-
Reply to lilcinw
mayankleoboy1
April 26, 2013 9:02:56 PM
8350rocks said:
Well, then provide your source for your information. I have provided tons of sources already.
I have researched and as near as I can tell, it depends entirely on the distribution. The newest releases of ubuntu 10.04+ should support HT, but you have to enable it in the "acpi=" command line, however, versions of other linux do not necessarily support it (I can only assume based on whichever version of kernel they chose to use).
Just point me to the source/sources which says Ubuntu has poor support of HTT, and linux programs have poorer/no support of HTT, unless paid by Intel to do so, and you have to specifically edit the boot settings to use HTT.
And i apologize for the bad language. My provocation was unneeded and uncalled for.
-
Reply to mayankleoboy1
lilcinw said:
mentions the MMX units in the original Bulldozer architecture so maybe they just weren't exciting enough to make the marketing materials at the time (it was all about AVX/FMA IIRC).Regardless it is reduced from 48 'pipelines' per chip to 44.
Bulldozer had MMX as well, but it's rather out dated now that SSE is being used more heavily.
Here's a more detailed Bulldozer slide.

-
Reply to Cazalan
Cazalan said:
griptwister said:
Cazalan said:
Lenovo might have the first Richland desktops. http://shop.lenovo.com/gb/en/desktops/essential/h-serie...
Nice find! I'm curious to see how far people will be able to push these next Gen APUs! Although, I'm sure Kaveri will make Richland look like child's play. lol
Notice a couple K models were listed. That's unusual to find overclockable models from a mainstream vendor.
I noticed that too. But I highly doubt the OEM Models will allow OCing. It'll probably have some sort of restriction if it does.
Also, it's nice to see everyone playing nicely in the sandbox full of Einsteins we call Tom's Hardware.
-
Reply to griptwister
juanrga said:
It is not really possible to isolate cpu/apu power consumption. And hwmonitor does not really measure power consumption, for measuring it you need a power meter.
The A10-5800k is one of the recommended chips for SFFs
http://www.anandtech.com/show/6490/holiday-2012-small-f...
it is possible to isolate apu power consumption. you need a multimeter to measure voltage and current readings. hwmonitor reads from the apus'/cpus' internal sensors, before other parts' power use get into the mix. even if you don't trust them, you should be able to trust amd at least - their setting tdp @100 for 5800k and 65w for 5700 means that in load scenarios those apus will use that amount of power respectively, which, i've already shown. the smaller the enclosure gets, heat dissipation per unit volume increases. and it's easier to dissipate 65w of heat with stock cooler - keeps noise lower.
at does recommend 5800k for sff, but you should see in which kind of scenarios. their conditions for using 5800k is that where the 5800k is not loaded, they don't mention anything on gaming use except performance. they recommend pc parts but they don't actually build the thing and test its temps. i am discussing load scenario. if you don't load the cpu or apu even a hypothetical amd fx8300(95w) with a hypothetical mini itx motherboard(e.g. 880g chipset) can run inside a sff pc. heck, if you don't put any load, a pentium or a core i3 will do. in reality, more people use amd's e-350/450 apus or sb/ivb pentiums and core i3 inside their sff pcs - but that's a different discussion.
-
Reply to de5_Roy
8350, I believe your confusing x86 threads and the actual micro-ops that get processed internally. In x86 CPU land it's 1 register stack = 1 thread, period end of story.
Now the problem is that not all instructions are equal, some take longer then others or have more complex dependencies. By implementing redundant CPU resources we can effectively process multiple instructions per thread at once though only one thread gets ownership of that CPU context (register stack) at any one point in time. When your looking at the BD/PD CPU design your seeing the internal resources that process micro-ops not x86 machine code, so your "44" instructions are micro-op instructions not x86 binary. Very large difference between the two. Also CPU's have whats known as a register file, it's a location inside internal CPU memory that allows for multiple register stacks to exist. It's used for rapid context switch's to process instructions from an additional thread while the previous thread is stalled waiting for some external I/O event. All this put together gives the illusion of the a single CPU element processing multiple threads.
Basically go read up on and learn what super-scalar architecture is.
Now the problem is that not all instructions are equal, some take longer then others or have more complex dependencies. By implementing redundant CPU resources we can effectively process multiple instructions per thread at once though only one thread gets ownership of that CPU context (register stack) at any one point in time. When your looking at the BD/PD CPU design your seeing the internal resources that process micro-ops not x86 machine code, so your "44" instructions are micro-op instructions not x86 binary. Very large difference between the two. Also CPU's have whats known as a register file, it's a location inside internal CPU memory that allows for multiple register stacks to exist. It's used for rapid context switch's to process instructions from an additional thread while the previous thread is stalled waiting for some external I/O event. All this put together gives the illusion of the a single CPU element processing multiple threads.
Basically go read up on and learn what super-scalar architecture is.
-
Reply to palladin9479
de5_Roy said:
juanrga said:
It is not really possible to isolate cpu/apu power consumption. And hwmonitor does not really measure power consumption, for measuring it you need a power meter.
The A10-5800k is one of the recommended chips for SFFs
http://www.anandtech.com/show/6490/holiday-2012-small-f...
it is possible to isolate apu power consumption. you need a multimeter to measure voltage and current readings. hwmonitor reads from the apus'/cpus' internal sensors, before other parts' power use get into the mix. even if you don't trust them, you should be able to trust amd at least - their setting tdp @100 for 5800k and 65w for 5700 means that in load scenarios those apus will use that amount of power respectively, which, i've already shown. the smaller the enclosure gets, heat dissipation per unit volume increases. and it's easier to dissipate 65w of heat with stock cooler - keeps noise lower.
at does recommend 5800k for sff, but you should see in which kind of scenarios. their conditions for using 5800k is that where the 5800k is not loaded, they don't mention anything on gaming use except performance. they recommend pc parts but they don't actually build the thing and test its temps. i am discussing load scenario. if you don't load the cpu or apu even a hypothetical amd fx8300(95w) with a hypothetical mini itx motherboard(e.g. 880g chipset) can run inside a sff pc. heck, if you don't put any load, a pentium or a core i3 will do. in reality, more people use amd's e-350/450 apus or sb/ivb pentiums and core i3 inside their sff pcs - but that's a different discussion.
The 5600K, 5700 and 5800K are essentially the same CPU with different settings. If you were to set the 5800K at the same settings as the 5700 then they would both use identical power draw, same with the 5600K. The reason they list a 100W TDP is because those two CPUs will both attempt to self-overclock (turbo) and that's the limit their designed around. If there was a price difference between the 5700 and 5800K then I'd recommend the 5700 but as it stands their the same price ($129 USD). So really just buy the 5800K and use it's multiplier setting to clock it at a lower multiplier if you need less TDP. With how amazing the 5800K is, I'm really wanting to see what else AMD can build for the SFF world.
-
Reply to palladin9479
palladin9479 said:
The 5600K, 5700 and 5800K are essentially the same CPU with different settings. If you were to set the 5800K at the same settings as the 5700 then they would both use identical power draw, same with the 5600K. The reason they list a 100W TDP is because those two CPUs will both attempt to self-overclock (turbo) and that's the limit their designed around. If there was a price difference between the 5700 and 5800K then I'd recommend the 5700 but as it stands their the same price ($129 USD). So really just buy the 5800K and use it's multiplier setting to clock it at a lower multiplier if you need less TDP. With how amazing the 5800K is, I'm really wanting to see what else AMD can build for the SFF world.
i agree. the main reason 5700 isn't recommended more often is because of the price similarity to 5800k. amd's skus that have lower tdp but sorta similar settings (same amount of shaders, in this case) seem to have higher price. that's why i said 'palatable' instead of 'must have'.
then again, if you're gonna downclock, what's the point of getting an unlocked apu (disregarding the price for now)? i know how it sounds... but i hear thing like this on a regular basis - 'if you're gonna buy fx/k, why shouldn't you oc?' etc. with the 5800k, you're already hitting 100w on load in an sff build. so oc isn't an option until better cooling is introduced - which often costs more, considering the conditions. that's... until you consider that both 5700 and 5800k cost same similar.
edit: newegg says there's $1 difference, so... inb4 3rd party microcorrection!
-
Reply to de5_Roy
juanrga
April 27, 2013 11:32:10 AM
palladin9479 said:
de5_Roy said:
juanrga said:
It is not really possible to isolate cpu/apu power consumption. And hwmonitor does not really measure power consumption, for measuring it you need a power meter.
The A10-5800k is one of the recommended chips for SFFs
http://www.anandtech.com/show/6490/holiday-2012-small-f...
it is possible to isolate apu power consumption. you need a multimeter to measure voltage and current readings. hwmonitor reads from the apus'/cpus' internal sensors, before other parts' power use get into the mix. even if you don't trust them, you should be able to trust amd at least - their setting tdp @100 for 5800k and 65w for 5700 means that in load scenarios those apus will use that amount of power respectively, which, i've already shown. the smaller the enclosure gets, heat dissipation per unit volume increases. and it's easier to dissipate 65w of heat with stock cooler - keeps noise lower.
at does recommend 5800k for sff, but you should see in which kind of scenarios. their conditions for using 5800k is that where the 5800k is not loaded, they don't mention anything on gaming use except performance. they recommend pc parts but they don't actually build the thing and test its temps. i am discussing load scenario. if you don't load the cpu or apu even a hypothetical amd fx8300(95w) with a hypothetical mini itx motherboard(e.g. 880g chipset) can run inside a sff pc. heck, if you don't put any load, a pentium or a core i3 will do. in reality, more people use amd's e-350/450 apus or sb/ivb pentiums and core i3 inside their sff pcs - but that's a different discussion.
The 5600K, 5700 and 5800K are essentially the same CPU with different settings. If you were to set the 5800K at the same settings as the 5700 then they would both use identical power draw, same with the 5600K. The reason they list a 100W TDP is because those two CPUs will both attempt to self-overclock (turbo) and that's the limit their designed around. If there was a price difference between the 5700 and 5800K then I'd recommend the 5700 but as it stands their the same price ($129 USD). So really just buy the 5800K and use it's multiplier setting to clock it at a lower multiplier if you need less TDP. With how amazing the 5800K is, I'm really wanting to see what else AMD can build for the SFF world.
I already explained this to him and give him power consumption deltas under load. If he does not OC, then the 5800k will have a slight power consumption becoming from the slightly higher clocks and turbos, the rest of the chip being the same.
Regarding future AMD products, what about Kabini SOCs already here?
-
Reply to juanrga
So, after a lot of reading, CMT is basically a slightly reworked TMT. What they did was allow for each module to split an additional thread between the 2 cores in the module. So, essentially in TMT each core operates on 1 core at time, but uses context switches to multithread, in CMT, each module does the same, but the 2 cores in the module can pickup an extra thread (1) between the 2 of them.
@Palladin, I am familiar with superscalars.
So, I have found the hard numbers after looking through loads of bulldozer information...there are 3 register stacks per core for BD/PD(this was not changed in PD)...so that's a total maximum of 24 threads that can be changed between context switches on cores, and because of CMT, you gain 1 additional per module (still unclear on how exactly this works, I could not find a technical schematic or logic flowchart for CMT)...for a total of 28.
Now, for SR, they have not said how many register stacks are per core, and they have said they are increasing register file size, efficiency, and decreasing the memory size of a single thread in the register file. So, without any hard schematics, or information it's hard to say what this will end up looking like in terms of performance increase.
@Palladin, I am familiar with superscalars.
So, I have found the hard numbers after looking through loads of bulldozer information...there are 3 register stacks per core for BD/PD(this was not changed in PD)...so that's a total maximum of 24 threads that can be changed between context switches on cores, and because of CMT, you gain 1 additional per module (still unclear on how exactly this works, I could not find a technical schematic or logic flowchart for CMT)...for a total of 28.
Now, for SR, they have not said how many register stacks are per core, and they have said they are increasing register file size, efficiency, and decreasing the memory size of a single thread in the register file. So, without any hard schematics, or information it's hard to say what this will end up looking like in terms of performance increase.
-
Reply to 8350rocks
^^ i am sorry, but i am really confused here. it's bad enough that i have limited know-how about cpu architecture but bulldozer's modules processing 3 threads was the last straw. i need some understandable explanations. from what i know, bd modules can process 2x integer instructions(operations?) and 1x 256bit(or 2x 128bit) floating point instructions at the same time. aren't (software) threads assigned by the os scheduler? i thought register stacks only process instructions or micro instructions. what constitutes a 'hardware thread' if that's a real thing?
-
Reply to de5_Roy
blackkstar
April 28, 2013 10:57:02 AM
de5_Roy said:
^^ i am sorry, but i am really confused here. it's bad enough that i have limited know-how about cpu architecture but bulldozer's modules processing 3 threads was the last straw. i need some understandable explanations. from what i know, bd modules can process 2x integer instructions and 1x 256bit(or 2x 128bit) floating point instructions at the same time. aren't (software) threads assigned by the os scheduler? i thought register stacks only process instructions or micro instructions. what constitutes a 'hardware thread' if that's a real thing?I think he is talking about context switching.
SPARC does this as well. On SPARC, there are 4 windows (I believe) and then 32 registers in a window. So, what happens is program A uses up the window of 32 registers and then program B uses another window with 32 registers.
So now, two programs can use the CPU's registers without having to go to cache or system memory or anything and the CPU can switch between the two programs without having to access cache or higher level memory.
I am guessing x86 CPUs do something similar so that they don't have to go to cache and stuff every single time it changes threads.
-
Reply to blackkstar
blackkstar said:
de5_Roy said:
^^ i am sorry, but i am really confused here. it's bad enough that i have limited know-how about cpu architecture but bulldozer's modules processing 3 threads was the last straw. i need some understandable explanations. from what i know, bd modules can process 2x integer instructions and 1x 256bit(or 2x 128bit) floating point instructions at the same time. aren't (software) threads assigned by the os scheduler? i thought register stacks only process instructions or micro instructions. what constitutes a 'hardware thread' if that's a real thing?I think he is talking about context switching.
SPARC does this as well. On SPARC, there are 4 windows (I believe) and then 32 registers in a window. So, what happens is program A uses up the window of 32 registers and then program B uses another window with 32 registers.
So now, two programs can use the CPU's registers without having to go to cache or system memory or anything and the CPU can switch between the two programs without having to access cache or higher level memory.
I am guessing x86 CPUs do something similar so that they don't have to go to cache and stuff every single time it changes threads.
Basically this.
Each core can work on as many as 3 single threads per clock cycle...but it will only work on them 1 at a time. The core will use context switching to rotate among the 3 threads per cycle. The interesting part is, how it accomodates an extra thread per module...I will have to find more technical sources on CMT to determine the hardware side of what's going on there...
-
Reply to 8350rocks
hcl123
April 28, 2013 5:53:02 PM
I think you are confusing work queues with multithreading.
Those are special registers that contain the address of the beginning of each thread, and that is how much its those so many threads at *the same time*.
In SPARC *block-multithreading* model is a different story because the TLB (translation lookaside buffer) can be warmed up, and also the L1 that may contain pre-decoded pre-fetched instructions are also warmed up, of those so many threads(current versions are 8 thread per core), but the pipeline only executes one thread at a time, dictated by its internal PC (process/instruction counter) logic not by the OS (operating system) scheduler. Basically it takes advantages of the so many pipeline bubbles that always exist in common code to perform internal fast context switches... ... in the end *the pipeline* only executes one thread at a time, but it can be quite efficient because even there was only 1 thread and only one to execute, those are always full of bubbles from some clock cycles to others, and if you have 2 or more threads on 1 core targeting those bubbles, the original thread can execute almost exactly in the same time frame as if it where alone, yet gives "the illusion" that is executing more than one thread at a time.
SMT (simultaneous multi-threading, a.k.a HTT in intel lingo) and AMD CMT (cluster multi-threading) are evolutions of this logic in that 2 threads are really executed at the same time by the pipeline( it has advantages and some drawbacks), being the difference between the 2 that CMT has dedicated hardware almost in a logic of co-processores(clusters) (actually the FlexFPU *IS* a co-processor, meaning it can track the evolution of a thread semi-independently). So 3 threads at the same time per module ( DON'T KNOW) could be possible but only by a fraction of a split second upon an OS dictated context switch, and this while the FPU just finishes to execute/writeback a few instructions left from a previous thread. In practice and in reason it should be said each module only executes 2 threads at each time because the pipelines must be flushed(any register/cache must be write back to memory to provide consistency) upon those OS context switches.
IMHO AMD CMT is clearly superior, not much because of having quite additional dedicated resources for 2 threads ( the integer cores of so much polemic), which could/should always boost 2 threads at the same time... but more because AMD uses multithreading logic for all its pipeline, i.e. its divided in thread <domains> in the sense explained above about block-multithreading... its *Vertical Multithreading*...
In BD and PD the logic is one cycle per thread on each domain of the pipeline, that is, one domain deals with one thread instruction on one cycle, then the next cycle it switches for the other thread, and on and on (very similar to the interleaving multithreading exercises of Cray). In Steamroller the logic is changed to 2 cycles per thread making it a true vertical block-multithreading scheme.
In this context, there aren't exactly 2 decoders on Steamroller... has above is as if, but is an illusion. As revealed in a RWT thread(rumor or not don't know) SR will have the same 4 decode pipes, naturally considerably beefed up but 4, and *the way i see the difference* is that it will have 2 dedicated decode domain input buffers, and 2 dedicated output buffers, that share the 4 decode pipes in a SMT (simultaneous multithreading- that is, decode from 2 threads simultaneously as the FlexFPU and the L2 but on a vertical logic) plus a block-multithreading fashion. It can be tremendously more effective for those same 4 pipes, courtesy of the vertical multithreading scheme.
This is just to show how versatile and superior this "vertical multithreading" is... as if each group of pipeline stages in a *domain bordered by input output buffers*, where in themselves like "vertical co-processores" ( well a bit exaggerated lol)... and how much more easy it will be to replace or improve those <domains> without having to re-design a whole chip, as in the traditional/synchronous pipelines of Intel cores. AMD BD uarch is asynchronous/semi-synchronous pipeline, could much more easily augment efficiency by making parts run-ahead... could much more easily change the resources/characteristics of each domain... could much more easily make each module with 3 or 4 Integer thread cores/cluster or a number of *heterogeneous cores/clusters*... could even not hard put ARM Arch64 integer cores where now are x86 cores lol...
Yes many opinions put BD uarch as a failure (propaganda is rampant is all competitive lucrative businesses), but IMHO is not a failure, its *POTENTIAL* clearly superior to anything Intel has... this asynchronousness and modularity *potential* is very very difficult to came by right(no wonder it took as rumor 8 years to finish first iteration), but it could provide an accelerated path for improvements in successive iterations.
Those are special registers that contain the address of the beginning of each thread, and that is how much its those so many threads at *the same time*.
In SPARC *block-multithreading* model is a different story because the TLB (translation lookaside buffer) can be warmed up, and also the L1 that may contain pre-decoded pre-fetched instructions are also warmed up, of those so many threads(current versions are 8 thread per core), but the pipeline only executes one thread at a time, dictated by its internal PC (process/instruction counter) logic not by the OS (operating system) scheduler. Basically it takes advantages of the so many pipeline bubbles that always exist in common code to perform internal fast context switches... ... in the end *the pipeline* only executes one thread at a time, but it can be quite efficient because even there was only 1 thread and only one to execute, those are always full of bubbles from some clock cycles to others, and if you have 2 or more threads on 1 core targeting those bubbles, the original thread can execute almost exactly in the same time frame as if it where alone, yet gives "the illusion" that is executing more than one thread at a time.
SMT (simultaneous multi-threading, a.k.a HTT in intel lingo) and AMD CMT (cluster multi-threading) are evolutions of this logic in that 2 threads are really executed at the same time by the pipeline( it has advantages and some drawbacks), being the difference between the 2 that CMT has dedicated hardware almost in a logic of co-processores(clusters) (actually the FlexFPU *IS* a co-processor, meaning it can track the evolution of a thread semi-independently). So 3 threads at the same time per module ( DON'T KNOW) could be possible but only by a fraction of a split second upon an OS dictated context switch, and this while the FPU just finishes to execute/writeback a few instructions left from a previous thread. In practice and in reason it should be said each module only executes 2 threads at each time because the pipelines must be flushed(any register/cache must be write back to memory to provide consistency) upon those OS context switches.
IMHO AMD CMT is clearly superior, not much because of having quite additional dedicated resources for 2 threads ( the integer cores of so much polemic), which could/should always boost 2 threads at the same time... but more because AMD uses multithreading logic for all its pipeline, i.e. its divided in thread <domains> in the sense explained above about block-multithreading... its *Vertical Multithreading*...
In BD and PD the logic is one cycle per thread on each domain of the pipeline, that is, one domain deals with one thread instruction on one cycle, then the next cycle it switches for the other thread, and on and on (very similar to the interleaving multithreading exercises of Cray). In Steamroller the logic is changed to 2 cycles per thread making it a true vertical block-multithreading scheme.
In this context, there aren't exactly 2 decoders on Steamroller... has above is as if, but is an illusion. As revealed in a RWT thread(rumor or not don't know) SR will have the same 4 decode pipes, naturally considerably beefed up but 4, and *the way i see the difference* is that it will have 2 dedicated decode domain input buffers, and 2 dedicated output buffers, that share the 4 decode pipes in a SMT (simultaneous multithreading- that is, decode from 2 threads simultaneously as the FlexFPU and the L2 but on a vertical logic) plus a block-multithreading fashion. It can be tremendously more effective for those same 4 pipes, courtesy of the vertical multithreading scheme.
This is just to show how versatile and superior this "vertical multithreading" is... as if each group of pipeline stages in a *domain bordered by input output buffers*, where in themselves like "vertical co-processores" ( well a bit exaggerated lol)... and how much more easy it will be to replace or improve those <domains> without having to re-design a whole chip, as in the traditional/synchronous pipelines of Intel cores. AMD BD uarch is asynchronous/semi-synchronous pipeline, could much more easily augment efficiency by making parts run-ahead... could much more easily change the resources/characteristics of each domain... could much more easily make each module with 3 or 4 Integer thread cores/cluster or a number of *heterogeneous cores/clusters*... could even not hard put ARM Arch64 integer cores where now are x86 cores lol...
Yes many opinions put BD uarch as a failure (propaganda is rampant is all competitive lucrative businesses), but IMHO is not a failure, its *POTENTIAL* clearly superior to anything Intel has... this asynchronousness and modularity *potential* is very very difficult to came by right(no wonder it took as rumor 8 years to finish first iteration), but it could provide an accelerated path for improvements in successive iterations.
-
Reply to hcl123
mayankleoboy1
April 28, 2013 6:15:27 PM
@hcl123 [Too much to quote
]
More or less right.
The reason Intel goes with the far weaker HTT is, because from a COST perspective, its REALLY cheap to implement (just an extra register stack) for decent gains in some tasks (register-register math). In short, its performance increase is greater then the die space it eats up.
CMT, by contrast, is almost a second core itself, just missing a few dedicated pieces of hardware. Much more powerful, but eats a LOT more of the die as a result (since its almost like adding a second core).
]More or less right.
The reason Intel goes with the far weaker HTT is, because from a COST perspective, its REALLY cheap to implement (just an extra register stack) for decent gains in some tasks (register-register math). In short, its performance increase is greater then the die space it eats up.
CMT, by contrast, is almost a second core itself, just missing a few dedicated pieces of hardware. Much more powerful, but eats a LOT more of the die as a result (since its almost like adding a second core).
-
Reply to gamerk316
what i don't like the most in amd module (bd arch) is the performance lose from both cores during full use of module
but intel htt does not seems to loose any performance on main thread
i.e,
amd = 80%+80% = 160% of dual core (loosing performance on both cores)
intel = 100%+20% = 120% of dual core (less efficient but still not loosing performance on main thread)
note: percentage difference is used to show efficuency and not to compare one arch performance to another
and since current games do heavy processing on few threads only so 'x' heavy with 'x' light threads makes more sense
and regarding die size, l3 reduction makes more sense to me
thats why i don't like bd arch
but intel htt does not seems to loose any performance on main thread
i.e,
amd = 80%+80% = 160% of dual core (loosing performance on both cores)
intel = 100%+20% = 120% of dual core (less efficient but still not loosing performance on main thread)
note: percentage difference is used to show efficuency and not to compare one arch performance to another
and since current games do heavy processing on few threads only so 'x' heavy with 'x' light threads makes more sense
and regarding die size, l3 reduction makes more sense to me
thats why i don't like bd arch
-
Reply to truegenius
Quote:
amd = 80%+80% = 160% of dual core (loosing performance on both cores)intel = 100%+20% = 120% of dual core (less efficient but still not loosing performance on main thread)
Except that is NOT how that works, it would be 60+60 for the Intel as you have two demanding threads both contending for CPU time. There is no preferred / virtual or whatever BS people want to use "core" in HTT. Both register stacks are treated the same from the OS perspective and are task switched accordingly. The difference is Intel worked with MS to ensure the NT kernel recognizes the "HTT" feature and is careful about scheduling multiple threads simultaneously on a single core. You can actually see this in high-demand integer workloads when comparing 8xxx to the i7. The SB/IB uArch is core cores with three ALU's each, HTT just lets them utilize those three ALU's more efficiently. BD/PD uArch is eight cores with two ALU's each and a shared scheduler / decoder and L2 cache.
-
Reply to palladin9479
Spoiler
Those are special registers that contain the address of the beginning of each thread, and that is how much its those so many threads at *the same time*.
In SPARC *block-multithreading* model is a different story because the TLB (translation lookaside buffer) can be warmed up, and also the L1 that may contain pre-decoded pre-fetched instructions are also warmed up, of those so many threads(current versions are 8 thread per core), but the pipeline only executes one thread at a time, dictated by its internal PC (process/instruction counter) logic not by the OS (operating system) scheduler. Basically it takes advantages of the so many pipeline bubbles that always exist in common code to perform internal fast context switches... ... in the end *the pipeline* only executes one thread at a time, but it can be quite efficient because even there was only 1 thread and only one to execute, those are always full of bubbles from some clock cycles to others, and if you have 2 or more threads on 1 core targeting those bubbles, the original thread can execute almost exactly in the same time frame as if it where alone, yet gives "the illusion" that is executing more than one thread at a time.
SMT (simultaneous multi-threading, a.k.a HTT in intel lingo) and AMD CMT (cluster multi-threading) are evolutions of this logic in that 2 threads are really executed at the same time by the pipeline( it has advantages and some drawbacks), being the difference between the 2 that CMT has dedicated hardware almost in a logic of co-processores(clusters) (actually the FlexFPU *IS* a co-processor, meaning it can track the evolution of a thread semi-independently). So 3 threads at the same time per module ( DON'T KNOW) could be possible but only by a fraction of a split second upon an OS dictated context switch, and this while the FPU just finishes to execute/writeback a few instructions left from a previous thread. In practice and in reason it should be said each module only executes 2 threads at each time because the pipelines must be flushed(any register/cache must be write back to memory to provide consistency) upon those OS context switches.
IMHO AMD CMT is clearly superior, not much because of having quite additional dedicated resources for 2 threads ( the integer cores of so much polemic), which could/should always boost 2 threads at the same time... but more because AMD uses multithreading logic for all its pipeline, i.e. its divided in thread <domains> in the sense explained above about block-multithreading... its *Vertical Multithreading*...
In BD and PD the logic is one cycle per thread on each domain of the pipeline, that is, one domain deals with one thread instruction on one cycle, then the next cycle it switches for the other thread, and on and on (very similar to the interleaving multithreading exercises of Cray). In Steamroller the logic is changed to 2 cycles per thread making it a true vertical block-multithreading scheme.
In this context, there aren't exactly 2 decoders on Steamroller... has above is as if, but is an illusion. As revealed in a RWT thread(rumor or not don't know) SR will have the same 4 decode pipes, naturally considerably beefed up but 4, and *the way i see the difference* is that it will have 2 dedicated decode domain input buffers, and 2 dedicated output buffers, that share the 4 decode pipes in a SMT (simultaneous multithreading- that is, decode from 2 threads simultaneously as the FlexFPU and the L2 but on a vertical logic) plus a block-multithreading fashion. It can be tremendously more effective for those same 4 pipes, courtesy of the vertical multithreading scheme.
This is just to show how versatile and superior this "vertical multithreading" is... as if each group of pipeline stages in a *domain bordered by input output buffers*, where in themselves like "vertical co-processores" ( well a bit exaggerated lol)... and how much more easy it will be to replace or improve those <domains> without having to re-design a whole chip, as in the traditional/synchronous pipelines of Intel cores. AMD BD uarch is asynchronous/semi-synchronous pipeline, could much more easily augment efficiency by making parts run-ahead... could much more easily change the resources/characteristics of each domain... could much more easily make each module with 3 or 4 Integer thread cores/cluster or a number of *heterogeneous cores/clusters*... could even not hard put ARM Arch64 integer cores where now are x86 cores lol...
Yes many opinions put BD uarch as a failure (propaganda is rampant is all competitive lucrative businesses), but IMHO is not a failure, its *POTENTIAL* clearly superior to anything Intel has... this asynchronousness and modularity *potential* is very very difficult to came by right(no wonder it took as rumor 8 years to finish first iteration), but it could provide an accelerated path for improvements in successive iterations.
hcl123 said:
I think you are confusing work queues with multithreading.Those are special registers that contain the address of the beginning of each thread, and that is how much its those so many threads at *the same time*.
In SPARC *block-multithreading* model is a different story because the TLB (translation lookaside buffer) can be warmed up, and also the L1 that may contain pre-decoded pre-fetched instructions are also warmed up, of those so many threads(current versions are 8 thread per core), but the pipeline only executes one thread at a time, dictated by its internal PC (process/instruction counter) logic not by the OS (operating system) scheduler. Basically it takes advantages of the so many pipeline bubbles that always exist in common code to perform internal fast context switches... ... in the end *the pipeline* only executes one thread at a time, but it can be quite efficient because even there was only 1 thread and only one to execute, those are always full of bubbles from some clock cycles to others, and if you have 2 or more threads on 1 core targeting those bubbles, the original thread can execute almost exactly in the same time frame as if it where alone, yet gives "the illusion" that is executing more than one thread at a time.
SMT (simultaneous multi-threading, a.k.a HTT in intel lingo) and AMD CMT (cluster multi-threading) are evolutions of this logic in that 2 threads are really executed at the same time by the pipeline( it has advantages and some drawbacks), being the difference between the 2 that CMT has dedicated hardware almost in a logic of co-processores(clusters) (actually the FlexFPU *IS* a co-processor, meaning it can track the evolution of a thread semi-independently). So 3 threads at the same time per module ( DON'T KNOW) could be possible but only by a fraction of a split second upon an OS dictated context switch, and this while the FPU just finishes to execute/writeback a few instructions left from a previous thread. In practice and in reason it should be said each module only executes 2 threads at each time because the pipelines must be flushed(any register/cache must be write back to memory to provide consistency) upon those OS context switches.
IMHO AMD CMT is clearly superior, not much because of having quite additional dedicated resources for 2 threads ( the integer cores of so much polemic), which could/should always boost 2 threads at the same time... but more because AMD uses multithreading logic for all its pipeline, i.e. its divided in thread <domains> in the sense explained above about block-multithreading... its *Vertical Multithreading*...
In BD and PD the logic is one cycle per thread on each domain of the pipeline, that is, one domain deals with one thread instruction on one cycle, then the next cycle it switches for the other thread, and on and on (very similar to the interleaving multithreading exercises of Cray). In Steamroller the logic is changed to 2 cycles per thread making it a true vertical block-multithreading scheme.
In this context, there aren't exactly 2 decoders on Steamroller... has above is as if, but is an illusion. As revealed in a RWT thread(rumor or not don't know) SR will have the same 4 decode pipes, naturally considerably beefed up but 4, and *the way i see the difference* is that it will have 2 dedicated decode domain input buffers, and 2 dedicated output buffers, that share the 4 decode pipes in a SMT (simultaneous multithreading- that is, decode from 2 threads simultaneously as the FlexFPU and the L2 but on a vertical logic) plus a block-multithreading fashion. It can be tremendously more effective for those same 4 pipes, courtesy of the vertical multithreading scheme.
This is just to show how versatile and superior this "vertical multithreading" is... as if each group of pipeline stages in a *domain bordered by input output buffers*, where in themselves like "vertical co-processores" ( well a bit exaggerated lol)... and how much more easy it will be to replace or improve those <domains> without having to re-design a whole chip, as in the traditional/synchronous pipelines of Intel cores. AMD BD uarch is asynchronous/semi-synchronous pipeline, could much more easily augment efficiency by making parts run-ahead... could much more easily change the resources/characteristics of each domain... could much more easily make each module with 3 or 4 Integer thread cores/cluster or a number of *heterogeneous cores/clusters*... could even not hard put ARM Arch64 integer cores where now are x86 cores lol...
Yes many opinions put BD uarch as a failure (propaganda is rampant is all competitive lucrative businesses), but IMHO is not a failure, its *POTENTIAL* clearly superior to anything Intel has... this asynchronousness and modularity *potential* is very very difficult to came by right(no wonder it took as rumor 8 years to finish first iteration), but it could provide an accelerated path for improvements in successive iterations.
Well, I think you added a lot of benefit to the conversation about CMT. Could you point me to your technical source about it? Just curious, I would love to read it.
So, based on what you're telling me then, in CMT, they no longer use context switches mid-cycle as the old days did (TMT)...unless dictated by a blocked thread, or FPU execution, etc. Instead, they dedicate a full cycle per thread instruction. Given the fact that cycles occur 4 times faster now than they did with the faster single core CPUs, I could see that being a similar effect of perception to the end user.
You mentioned in SR, they intend to increase this to 2 cycles per thread instruction with an increase to 4 (shortened) pipelines (I knew about the additional pipeline and the shortening of latency, but could not find any more information about this)...though, how are they going to accomodate the additional pipeline (beyond the additional decoder in the front end for efficiency)? I've read vague descriptions of increased register files and reworked internal register processes, etc.
Additionally, I would really like to know, is this essentially going to be 4 threads per 4 cycles per module (2 per module with context switches after 2 cycles), or is the logic changing in the internal system?
I understand the potential, I have been saying that all along. We are on the same page there, though I would really love to see the technical source for the changes in the architecture if you can find it off hand. Obviously, my google-fu has failed me somewhere, as all I can find is AMD press releases with vague mentions and a brief mention of CLMT in Wiki (server clusters). I've been out of the loop for a little while, evidently, and would love to catch up on AMDs tech. Honestly the internal information I have reviewed from AMD doesn't really get down to this level, it's more a "technical spec sheet" and BD/PD logic flowcharts...nothing on SR.
-
Reply to 8350rocks
60+60 can happen as we also hear that 'htt is causing performance loss'
but seems like it doesnot happen too often
i will need to some test on a system with htt capability to test if its 100+20 or 60+60 for majority of apps
but currently i only have 1090t
(and other classmates have core2duos
and some have i3 (as per my recomendation
) , means i have most powerfull processor in my group
)
if 4 th gen i3 can overclock well then i may sell my 1090t
and will test htt and performance loss.
but seems like it doesnot happen too often
i will need to some test on a system with htt capability to test if its 100+20 or 60+60 for majority of apps
but currently i only have 1090t
(and other classmates have core2duos
and some have i3 (as per my recomendation
) , means i have most powerfull processor in my group
)if 4 th gen i3 can overclock well then i may sell my 1090t
and will test htt and performance loss. -
Reply to truegenius
truegenius said:
60+60 can happen as we also hear that 'htt is causing performance loss'but seems like it doesnot happen too often
i will need to some test on a system with htt capability to test if its 100+20 or 60+60 for majority of apps
but currently i only have 1090t
(and other classmates have core2duos
and some have i3 (as per my recomendation
) , means i have most powerfull processor in my group
)if 4 th gen i3 can overclock well then i may sell my 1090t
and will test htt and performance loss.Not sure if you've heard, but haswell is the last generation of i3 CPUs from intel...
Broadwell will be all quad+ core in the desktop, in laptop/mobile, they will still have a few dual core i5's with HTT, but nothing branded i3.
http://www.semiaccurate.com/forums/archive/index.php/t-...
Also, skylake will supposedly be all BGA with the exception of the LGA2011 socket equivalent at that time.
-
Reply to 8350rocks
mayankleoboy1
April 28, 2013 9:12:08 PM
palladin9479 said:
Except that is NOT how that works, it would be 60+60 for the Intel as you have two demanding threads both contending for CPU time. There is no preferred / virtual or whatever BS people want to use "core" in HTT. Both register stacks are treated the same from the OS perspective and are task switched accordingly. The difference is Intel worked with MS to ensure the NT kernel recognizes the "HTT" feature and is careful about scheduling multiple threads simultaneously on a single core.
IIRC, in HTT, the second core has only enough registers to save the thread context, and not do any processing. The real and the virtual core only have a singe ALU between them. So the second thread is executed if the first one is I/O blocked or instruction blocked. Am i correct ?
-
Reply to mayankleoboy1
mayankleoboy1 said:
palladin9479 said:
Except that is NOT how that works, it would be 60+60 for the Intel as you have two demanding threads both contending for CPU time. There is no preferred / virtual or whatever BS people want to use "core" in HTT. Both register stacks are treated the same from the OS perspective and are task switched accordingly. The difference is Intel worked with MS to ensure the NT kernel recognizes the "HTT" feature and is careful about scheduling multiple threads simultaneously on a single core.
In some heavy threaded apps softwares, a 3770K gives ~1.6-1.8 times the perf of a 3570K. AFAIk , atleast since the introduction of SB architecture, enabling HTT causes no performance degradation in applications.
Actually, I saw a review where HTT in Crysis 3 actually caused FPS to drop by about 8-10 frames, it was a European site that did the review...if I run across it again I will post the link.
-
Reply to 8350rocks
hcl123 said:
Spoiler
I think you are confusing work queues with multithreading.
Those are special registers that contain the address of the beginning of each thread, and that is how much its those so many threads at *the same time*.
In SPARC *block-multithreading* model is a different story because the TLB (translation lookaside buffer) can be warmed up, and also the L1 that may contain pre-decoded pre-fetched instructions are also warmed up, of those so many threads(current versions are 8 thread per core), but the pipeline only executes one thread at a time, dictated by its internal PC (process/instruction counter) logic not by the OS (operating system) scheduler. Basically it takes advantages of the so many pipeline bubbles that always exist in common code to perform internal fast context switches... ... in the end *the pipeline* only executes one thread at a time, but it can be quite efficient because even there was only 1 thread and only one to execute, those are always full of bubbles from some clock cycles to others, and if you have 2 or more threads on 1 core targeting those bubbles, the original thread can execute almost exactly in the same time frame as if it where alone, yet gives "the illusion" that is executing more than one thread at a time.
SMT (simultaneous multi-threading, a.k.a HTT in intel lingo) and AMD CMT (cluster multi-threading) are evolutions of this logic in that 2 threads are really executed at the same time by the pipeline( it has advantages and some drawbacks), being the difference between the 2 that CMT has dedicated hardware almost in a logic of co-processores(clusters) (actually the FlexFPU *IS* a co-processor, meaning it can track the evolution of a thread semi-independently). So 3 threads at the same time per module ( DON'T KNOW) could be possible but only by a fraction of a split second upon an OS dictated context switch, and this while the FPU just finishes to execute/writeback a few instructions left from a previous thread. In practice and in reason it should be said each module only executes 2 threads at each time because the pipelines must be flushed(any register/cache must be write back to memory to provide consistency) upon those OS context switches.
IMHO AMD CMT is clearly superior, not much because of having quite additional dedicated resources for 2 threads ( the integer cores of so much polemic), which could/should always boost 2 threads at the same time... but more because AMD uses multithreading logic for all its pipeline, i.e. its divided in thread <domains> in the sense explained above about block-multithreading... its *Vertical Multithreading*...
In BD and PD the logic is one cycle per thread on each domain of the pipeline, that is, one domain deals with one thread instruction on one cycle, then the next cycle it switches for the other thread, and on and on (very similar to the interleaving multithreading exercises of Cray). In Steamroller the logic is changed to 2 cycles per thread making it a true vertical block-multithreading scheme.
In this context, there aren't exactly 2 decoders on Steamroller... has above is as if, but is an illusion. As revealed in a RWT thread(rumor or not don't know) SR will have the same 4 decode pipes, naturally considerably beefed up but 4, and *the way i see the difference* is that it will have 2 dedicated decode domain input buffers, and 2 dedicated output buffers, that share the 4 decode pipes in a SMT (simultaneous multithreading- that is, decode from 2 threads simultaneously as the FlexFPU and the L2 but on a vertical logic) plus a block-multithreading fashion. It can be tremendously more effective for those same 4 pipes, courtesy of the vertical multithreading scheme.
This is just to show how versatile and superior this "vertical multithreading" is... as if each group of pipeline stages in a *domain bordered by input output buffers*, where in themselves like "vertical co-processores" ( well a bit exaggerated lol)... and how much more easy it will be to replace or improve those <domains> without having to re-design a whole chip, as in the traditional/synchronous pipelines of Intel cores. AMD BD uarch is asynchronous/semi-synchronous pipeline, could much more easily augment efficiency by making parts run-ahead... could much more easily change the resources/characteristics of each domain... could much more easily make each module with 3 or 4 Integer thread cores/cluster or a number of *heterogeneous cores/clusters*... could even not hard put ARM Arch64 integer cores where now are x86 cores lol...
Yes many opinions put BD uarch as a failure (propaganda is rampant is all competitive lucrative businesses), but IMHO is not a failure, its *POTENTIAL* clearly superior to anything Intel has... this asynchronousness and modularity *potential* is very very difficult to came by right(no wonder it took as rumor 8 years to finish first iteration), but it could provide an accelerated path for improvements in successive iterations.
Those are special registers that contain the address of the beginning of each thread, and that is how much its those so many threads at *the same time*.
In SPARC *block-multithreading* model is a different story because the TLB (translation lookaside buffer) can be warmed up, and also the L1 that may contain pre-decoded pre-fetched instructions are also warmed up, of those so many threads(current versions are 8 thread per core), but the pipeline only executes one thread at a time, dictated by its internal PC (process/instruction counter) logic not by the OS (operating system) scheduler. Basically it takes advantages of the so many pipeline bubbles that always exist in common code to perform internal fast context switches... ... in the end *the pipeline* only executes one thread at a time, but it can be quite efficient because even there was only 1 thread and only one to execute, those are always full of bubbles from some clock cycles to others, and if you have 2 or more threads on 1 core targeting those bubbles, the original thread can execute almost exactly in the same time frame as if it where alone, yet gives "the illusion" that is executing more than one thread at a time.
SMT (simultaneous multi-threading, a.k.a HTT in intel lingo) and AMD CMT (cluster multi-threading) are evolutions of this logic in that 2 threads are really executed at the same time by the pipeline( it has advantages and some drawbacks), being the difference between the 2 that CMT has dedicated hardware almost in a logic of co-processores(clusters) (actually the FlexFPU *IS* a co-processor, meaning it can track the evolution of a thread semi-independently). So 3 threads at the same time per module ( DON'T KNOW) could be possible but only by a fraction of a split second upon an OS dictated context switch, and this while the FPU just finishes to execute/writeback a few instructions left from a previous thread. In practice and in reason it should be said each module only executes 2 threads at each time because the pipelines must be flushed(any register/cache must be write back to memory to provide consistency) upon those OS context switches.
IMHO AMD CMT is clearly superior, not much because of having quite additional dedicated resources for 2 threads ( the integer cores of so much polemic), which could/should always boost 2 threads at the same time... but more because AMD uses multithreading logic for all its pipeline, i.e. its divided in thread <domains> in the sense explained above about block-multithreading... its *Vertical Multithreading*...
In BD and PD the logic is one cycle per thread on each domain of the pipeline, that is, one domain deals with one thread instruction on one cycle, then the next cycle it switches for the other thread, and on and on (very similar to the interleaving multithreading exercises of Cray). In Steamroller the logic is changed to 2 cycles per thread making it a true vertical block-multithreading scheme.
In this context, there aren't exactly 2 decoders on Steamroller... has above is as if, but is an illusion. As revealed in a RWT thread(rumor or not don't know) SR will have the same 4 decode pipes, naturally considerably beefed up but 4, and *the way i see the difference* is that it will have 2 dedicated decode domain input buffers, and 2 dedicated output buffers, that share the 4 decode pipes in a SMT (simultaneous multithreading- that is, decode from 2 threads simultaneously as the FlexFPU and the L2 but on a vertical logic) plus a block-multithreading fashion. It can be tremendously more effective for those same 4 pipes, courtesy of the vertical multithreading scheme.
This is just to show how versatile and superior this "vertical multithreading" is... as if each group of pipeline stages in a *domain bordered by input output buffers*, where in themselves like "vertical co-processores" ( well a bit exaggerated lol)... and how much more easy it will be to replace or improve those <domains> without having to re-design a whole chip, as in the traditional/synchronous pipelines of Intel cores. AMD BD uarch is asynchronous/semi-synchronous pipeline, could much more easily augment efficiency by making parts run-ahead... could much more easily change the resources/characteristics of each domain... could much more easily make each module with 3 or 4 Integer thread cores/cluster or a number of *heterogeneous cores/clusters*... could even not hard put ARM Arch64 integer cores where now are x86 cores lol...
Yes many opinions put BD uarch as a failure (propaganda is rampant is all competitive lucrative businesses), but IMHO is not a failure, its *POTENTIAL* clearly superior to anything Intel has... this asynchronousness and modularity *potential* is very very difficult to came by right(no wonder it took as rumor 8 years to finish first iteration), but it could provide an accelerated path for improvements in successive iterations.
i like how it was clearer and more understandable (even though a bit of that went over my head). so in the end, it's more about thread execution, less about how many threads are in the queue. the module does process only two threads per cycle..... but it can process instructions from more than 2 threads...?
the arch design approach looks like it needs quite a bit of die space (seemingly unlike hyperthreading). does more space/resources needed mean more transistors -> more power use? so if amd doesn't provide effective, dynamic power management circuitry, the cpu may end up a powerhog (well, one of the many factors), yes?
personally i don't think anyone criticized bd uarch (well, i never have). that's something amd fanboys bring up every time fx cpus are being criticized. for example,"u makin fun of bd(8150)? how dare you make fun of bd (arch) (notice the addition and direction switch). it so much better than anything intel has and yet you buy $150 i7 cpus overpriced to $350(true, but nothing to do with 8150 or bd uarch lol)." iirc it was fx8150 (hyped as core i7 killer by amd fanboys quoting amd promo slides) that usually takes majority of the criticism.
gamerk316 said:
Spoiler
@hcl123 [Too much to quote
]
More or less right.
The reason Intel goes with the far weaker HTT is, because from a COST perspective, its REALLY cheap to implement (just an extra register stack) for decent gains in some tasks (register-register math). In short, its performance increase is greater then the die space it eats up.
CMT, by contrast, is almost a second core itself, just missing a few dedicated pieces of hardware. Much more powerful, but eats a LOT more of the die as a result (since its almost like adding a second core).
]More or less right.
The reason Intel goes with the far weaker HTT is, because from a COST perspective, its REALLY cheap to implement (just an extra register stack) for decent gains in some tasks (register-register math). In short, its performance increase is greater then the die space it eats up.
CMT, by contrast, is almost a second core itself, just missing a few dedicated pieces of hardware. Much more powerful, but eats a LOT more of the die as a result (since its almost like adding a second core).
so, core i7 prices are pretty much bs. we're paying over $100 for a so little. all it takes for intel to fire its laser at the cpu and voila, it's a core i5, $100 cheaper because it doesn't have hyperthreading! who knows, may be intel doesn't even laser off htt, just uses some kind of software trickery to disable it. in any case,
truegenius said:
Spoiler
60+60 can happen as we also hear that 'htt is causing performance loss'
but seems like it doesnot happen too often
i will need to some test on a system with htt capability to test if its 100+20 or 60+60 for majority of apps
but currently i only have 1090t
(and other classmates have core2duos
and some have i3 (as per my recomendation
) , means i have most powerfull processor in my group
)
if 4 th gen i3 can overclock well then i may sell my 1090t
and will test htt and performance loss.
but seems like it doesnot happen too often
i will need to some test on a system with htt capability to test if its 100+20 or 60+60 for majority of apps
but currently i only have 1090t
(and other classmates have core2duos
and some have i3 (as per my recomendation
) , means i have most powerfull processor in my group
)if 4 th gen i3 can overclock well then i may sell my 1090t
and will test htt and performance loss.we can only dream of an unlocked core i3. short of buying a core i5 655k, there won't be a new unlocked core i3. something like that will cannibalize everything higher than core i3, especially in non us markets. it's be a downgrade from your 1090t anyway (2c vs 6c while games are starting to 4 cores).
-
Reply to de5_Roy
mayankleoboy1 said:
palladin9479 said:
Except that is NOT how that works, it would be 60+60 for the Intel as you have two demanding threads both contending for CPU time. There is no preferred / virtual or whatever BS people want to use "core" in HTT. Both register stacks are treated the same from the OS perspective and are task switched accordingly. The difference is Intel worked with MS to ensure the NT kernel recognizes the "HTT" feature and is careful about scheduling multiple threads simultaneously on a single core.
IIRC, in HTT, the second core has only enough registers to save the thread context, and not do any processing. The real and the virtual core only have a singe ALU between them. So the second thread is executed if the first one is I/O blocked or instruction blocked. Am i correct ?
Since we're talking Intel HT there is no second core. There is one core that just happens to have two exposed x86 register stacks (EAX, EBX, ECX, EDX, ect..). That one core has three internal integer units (amongst other things) that process's micro-ops. Externally the decoder takes two x86 macro-instructions, decodes them into a set of micro-instructions then schedules them onto internal resources. Those three ALU's and the MMU (ld/st unit) must process instructions from both targets, so we're talking 1.5 ALU's per target (thread). Now it's not divided exactly that way due to the front end doing code analysis and using out of order execution it will determine the most efficient sequence to process those micro-instructions (not x86). One stream of binary code typically won't use's more then one ALU at a time, and rarely more then two. Intel capitalized on this fact and thus HT was born. As long as the OS does not schedule two threads onto the same core when other cores are idle. Essentially the OS must be "HT" aware which most modern OS's are. That's how occasionally a program will do worse with HT enabled. Their code multi-tasks well and actually use's most of the processor's resources. Resource contention will cause a loss of performance within the context of a single task.
In BD/PD's case each module is fully dedicating two integer units and one MMU per exposed register stack. The above code that could use three ALU's instead has to use two though it's guaranteed full access to those two. Though there is risk sharing the branch predictor / instruction decode and L2 cache system.
-
Reply to palladin9479
Quote:
so, core i7 prices are pretty much bs. we're paying over $100 for a so little. all it takes for intel to fire its laser at the cpu and voila, it's a core i5, $100 cheaper because it doesn't have hyperthreading! who knows, may be intel doesn't even laser off htt, just uses some kind of software trickery to disable it. in any case,
CPU microcode. They used to use things like read-only registers that were programmed at factory but with "upgradable CPUs" being a think I surmise it's just a bit of microcode that tells the CPU what model it is and what features are supported. Via's CPUs were known to have their flags register open to being changed through software, it's how we can trick any piece of software into thinking it's running on an Intel CPU.
-
Reply to palladin9479
antiglobal said:
So, when is Steamroller CPUs supposed to come out?gamerk316 said:
antiglobal said:
So, when is Steamroller CPUs supposed to come out?/thread
that's right, the day excavator thread starts.
amd's cpu roadmap said they'd release a new cpu lineup every year with 10-15% better perf/watt, which put sr cpus in 2013. they were supposed to come out in october this year, keeping in line with bd and pd releases but now it looks like q1 2014.
-
Reply to de5_Roy
palladin9479 said:
mayankleoboy1 said:
palladin9479 said:
Except that is NOT how that works, it would be 60+60 for the Intel as you have two demanding threads both contending for CPU time. There is no preferred / virtual or whatever BS people want to use "core" in HTT. Both register stacks are treated the same from the OS perspective and are task switched accordingly. The difference is Intel worked with MS to ensure the NT kernel recognizes the "HTT" feature and is careful about scheduling multiple threads simultaneously on a single core.
IIRC, in HTT, the second core has only enough registers to save the thread context, and not do any processing. The real and the virtual core only have a singe ALU between them. So the second thread is executed if the first one is I/O blocked or instruction blocked. Am i correct ?
Since we're talking Intel HT there is no second core. There is one core that just happens to have two exposed x86 register stacks (EAX, EBX, ECX, EDX, ect..). That one core has three internal integer units (amongst other things) that process's micro-ops. Externally the decoder takes two x86 macro-instructions, decodes them into a set of micro-instructions then schedules them onto internal resources. Those three ALU's and the MMU (ld/st unit) must process instructions from both targets, so we're talking 1.5 ALU's per target (thread). Now it's not divided exactly that way due to the front end doing code analysis and using out of order execution it will determine the most efficient sequence to process those micro-instructions (not x86). One stream of binary code typically won't use's more then one ALU at a time, and rarely more then two. Intel capitalized on this fact and thus HT was born. As long as the OS does not schedule two threads onto the same core when other cores are idle. Essentially the OS must be "HT" aware which most modern OS's are. That's how occasionally a program will do worse with HT enabled. Their code multi-tasks well and actually use's most of the processor's resources. Resource contention will cause a loss of performance within the context of a single task.
In BD/PD's case each module is fully dedicating two integer units and one MMU per exposed register stack. The above code that could use three ALU's instead has to use two though it's guaranteed full access to those two. Though there is risk sharing the branch predictor / instruction decode and L2 cache system.
Nice summary, I would add that when I, or others, refer to it as a "virtual core", it's because it only possesses a minimum of resources. It's really like an "internal trick" using resources of the same core internally but with an extra ALU and a few other pieces. It's a low die space way to gain a little extra "oomph" under the right circumstances, but, not under all circumstances.
As a side note: It's a bit misleading for some that look at something like a Ubuntu interface showing "siblings" being 8 on an i7-3770k with 4 cores with HTT enabled...though Linux is smart enough to load the "real" cores first (as most OS suites are these days). Generally they will add complimentary functions to a HTT "virtual core" so that the same resources are being used toward one larger thread goal.
-
Reply to 8350rocks
8350rocks said:
As a side note: It's a bit misleading for some that look at something like a Ubuntu interface showing "siblings" being 8 on an i7-3770k with 4 cores with HTT enabled...though Linux is smart enough to load the "real" cores first (as most OS suites are these days). Generally they will add complimentary functions to a HTT "virtual core" so that the same resources are being used toward one larger thread goal.Remember that from an OS perspective, all it knows are register stacks. The OS doesn't (and shouldn't) care about the internal layout of the CPU architecture, just how many register stacks can be loaded. Hence why most OS's show 8 cores for a 3770k.
That being said, if not loaded properly, HTT can kill performance (See: Early Pentium 4). So they typically look at the HTT bit when loading cores, and try and avoid using the HTT core if at all possible. Personally, AMD could have saved itself a bit a grief with BD if they had simply done the same thing, rather then have MSFT change their scheduler via a patch to basically do the same exact thing.
-
Reply to gamerk316
mayankleoboy1
April 29, 2013 7:11:34 AM
palladin9479 said:
Quote:
so, core i7 prices are pretty much bs. we're paying over $100 for a so little. all it takes for intel to fire its laser at the cpu and voila, it's a core i5, $100 cheaper because it doesn't have hyperthreading! who knows, may be intel doesn't even laser off htt, just uses some kind of software trickery to disable it. in any case,
CPU microcode. They used to use things like read-only registers that were programmed at factory but with "upgradable CPUs" being a think I surmise it's just a bit of microcode that tells the CPU what model it is and what features are supported. Via's CPUs were known to have their flags register open to being changed through software, it's how we can trick any piece of software into thinking it's running on an Intel CPU.
According to CharliD at S|A , modern intel i3 and i5 have the additional feature of the chip fused off during fabrication, so microcodes wont unlock a processor anymore.
Its very silly that intel fabricates a complete chip, then intentionally fuses off some features so that it cn sell the crippled chip at a lower price point.
-
Reply to mayankleoboy1
mayankleoboy1
April 29, 2013 7:14:20 AM
BTW, S|A comparing OpenCL perf of a 3770K and a A10-5800K .
result : Its not what you think .....
![]()
http://semiaccurate.com/assets/uploads/2013/04/Intel-Op...
result : Its not what you think .....

http://semiaccurate.com/assets/uploads/2013/04/Intel-Op...
-
Reply to mayankleoboy1
mayankleoboy1 said:
BTW, S|A comparing OpenCL perf of a 3770K and a A10-5800K .result : Its not what you think .....

http://semiaccurate.com/assets/uploads/2013/04/Intel-Op...
the !@#$ is wrong with the world today... everything is going upside down.... i was aware of hd4k's luxmark advantage but these...
i think there are several reasons for this happening.
1. drivers.
2. drivers.
3. it looks like the benches were hyperthreading aware, in the last cpu+gpu bench, the core i7 gave some serious thrashing to poor li'l trinity.
4. l3 cache
5. this one is frequently denied(by amd fanboys) - cpu bottleneck. i saw opencl benches with a llano where the cpu seemed to be holding the igpu back, the same thing seems to be happening here.
if the bench is fully igpu-bound, the 7660d will have advantage. if there's a cpu component, the hd4000 will close the gap.
then again, this bench means very little for one simple reason - price. i'd like to see these benches re-run with core i3 3225... or even a core i5 3570k. when you realize that you get 8% faster overall performance with 3770k for $200 more, the apu looks better.
edit: radeon hd 7730 revealed? are we seeing the future 'dual graphics' candidates? would be nice. ^_^
http://www.techpowerup.com/183314/AMD-Readies-Radeon-HD...
imo entry level gcn 2.0 cards would be better for kaveri.
-
Reply to de5_Roy
JAYDEEJOHN said:
Of course the cpu is holding the igpu back, which is just more low hanging fruit for AMD, since Intel cant squeeze out much more here, AMD can, all the while improving their igpu/gpus as well, and Ive a feeling the driver team is getting more attention nowadays as welli think the rumored 6 core kaveri has the potential to be the overall opencl performance leader. even if it doesn't turn out like that, amd will still have the price advantage.... as long as they don't price it against core i5.
-
Reply to de5_Roy
de5_Roy said:
3. it looks like the benches were hyperthreading aware, in the last cpu+gpu bench, the core i7 gave some serious thrashing to poor li'l trinity.
5. this one is frequently denied(by amd fanboys) - cpu bottleneck. i saw opencl benches with a llano where the cpu seemed to be holding the igpu back, the same thing seems to be happening here.
Basically that's mostly #3, the i7-3770k is just more raw muscle...and there are some things to be said for that. But as you point out further down, 8% iGPU performance is not worth $200.00 I would hold out for Richland first (10-15% gain over A10-5800k)
Quote:
if the bench is fully igpu-bound, the 7660d will have advantage. if there's a cpu component, the hd4000 will close the gap.then again, this bench means very little for one simple reason - price. i'd like to see these benches re-run with core i3 3225... or even a core i5 3570k. when you realize that you get 8% faster overall performance with 3770k for $200 more, the apu looks better.
100% agree...the test should really be run between 2 products in at least a similar market segment.
Quote:
edit: radeon hd 7730 revealed? are we seeing the future 'dual graphics' candidates? would be nice. ^_^http://www.techpowerup.com/183314/AMD-Readies-Radeon-HD...
imo entry level gcn 2.0 cards would be better for kaveri.
Catalyst v13.4 drivers already support A10-5800k + HD 7750 CF. They will also be getting some great driver updates from Kudori in the very near future.
-
Reply to 8350rocks
Related resources
- SolvedWhat motherboard for the amd FX 8350 cpu solution
- SolvedWhat is the best motherboard for a amd 8350 FX cpu solution
- Solvedwhat AMD cpu would run good with this? solution
- SolvedAMD CPU for gaming (fanboi question) solution
- SolvedStock AMD cpu, Water-cooled and still 70°c? solution
- Solvedwill a amd 750k cpu hold back my 280x solution
- SolvedAMD GPU and CPU speeds? solution
- SolvedAMD Athlon ii x4 640 CPU/GTX 480 temps good? solution
- SolvedUpgrading CPU from AMD solution
- SolvedIntel Core i3-4130 or AMD FX-6300 CPU? solution
- SolvedAmd CPU and 12 Gb of Ram solution
- SolvedCPU-Cooler (AMD FX-8350) solution
- SolvedAMD FX6350 CPU Gaming boost? Any suggestions? solution
- SolvedWhich CPU-Cooler? (AMD FX-8350) solution
- SolvedHelp identifying an AMD Athlon II CPU solution
- More resources
!
and hd6000 wasn't that good in opencl