intel vs amd technically why ... more transistors per core or what does intel do so differently than amd??

ilovecomps · Mar 2, 2014

I am a computer enthusiast and i really want to know why on the technical level. I know the basics and still learning how exactly a processor is made and works. So what is the technical reason. Is it because intel has more transistors per core than an amd counter part or what because that is the only thing I can think of on the top of my head. Also please dont just say the architecture I understand that but what about it do they do! I guess Im asking please be specific! Thanks a bunch!!!

Novuake · Mar 3, 2014

IPC or Instructions per cycle.

Unfortunately you can not just compare transistor count or any "amount" of physical hardware.

Effectively IPC just means that Intel optimizes its actual CPU components more effectively whereas AMD just throws in a lot more "hardware" into the mix.

ingtar33 · Mar 3, 2014

a lot of the answer is transistor count... but they also have a far superior memory controller. Frankly Piledriver/Bulldozer has been handicapped severely by the memory controller. I remember reading once that piledriver cores basically function at 70% their max ability due to the poor memory controller AMD is using. If AMD had a memory controller in the league of an intel... they would be much closer to intel's IPC.

Throw in the smaller process allowing intel to stick a lot more transistors per core at much lower wattage and you have the basic difference between the two companies chips

NOTE: the phenom/phenomII lineup suffered from a number of additional issues which put them behind the performance of intel cpus, it wasn't until nehelam that intel jumped out with a superior memory controller to what AMD was fielding. To give you an idea of what type of performance gain intel got with their memory controller, the jump in performance from the last gen core2 processors to the first gen core i processors was almost completely gained from the memory controller improvements.

*when i talk about "memory controller" i'm talking about the main purpose of the northbridge. Its responsible for taking info from the ram to the cpu, scheduling that info for processing, then writing it back to the ram. Intel put a lot of effort into their memory controller when they came up with Hyperthreading. It was basically 5 years of work trying to make hyperthreading work better that lead to the breakthroughs in their memory controller.

jacobian · Mar 3, 2014

One interesting observation is that Haswell Core i5 and AMD FX8350 have a similar number of transistors, 1.4B vs 1.2B. The die size of AMD FX8350 is similar to Intel Sandy Bridge processors, which use similar production process. This suggests that Intel's architecture and AMD's Bulldozer derived CPUs have a similar level of complexity. If you count the overall number of MIPS (millions instructions per second) for AMD octo-core and Intel quad-core, they're also in the same ballpark. However, the AMD architecture seems to be optimized for many cores and high clock rates, but individual core performance is weak. Whether this was intentional or unintentional, I don't know.

Here is a comment from AMD's ex-engineer on this issue:

http://www.xbitlabs.com/news/cpu/display/20111013232215_Ex_AMD_Engineer_Explains_Bulldozer_Fiasco.html

Personally, unless HSA does take off and makes AMD APUs highly relevant, I think the long term solution to AMD's performance issues is to start from scratch, like Intel did when it replaced its problematic Netburst architecture with the Intel Core CPUs.

http://www.extremetech.com/computing/174980-its-time-for-amd-to-take-a-page-from-intel-and-dump-steamroller

Novuake · Mar 3, 2014

jacobian :

They didn't start from scratch completely. If I remember correctly they used the P6 as a platform to build on. 2 x P6 on one chip to produce the first Intel dual core(second bit I am less sure about).

ingtar33 · Mar 3, 2014

ah... the core series cpus started with the P4-M... it was a mobile chip, and yes despite the name it was a whole new chip design completely different from the horrible netburst. They found the excellent P4-M scaled up to full size desktop and it's performance scaled almost 1:1 with increased clock speeds, and bam the core2duo and core2quad was born. The major revision from the core2 to the corei series was the improvement to the memory controller and scheduling. most of the rest performance gains on this archetecture came from die shrinks and increased transister counts.

so yeah... intel's mobile chip division was what ended up reversing their 5 year slump against AMD.

Novuake · Mar 3, 2014

ingtar33 :

ah... the core series cpus started with the P4-M... it was a mobile chip, and yes despite the name it was a whole new chip design completely different from the horrible netburst. They found the excellent P4-M scaled up to full size desktop and it's performance scaled almost 1:1 with increased clock speeds, and bam the core2duo and core2quad was born. The major revision from the core2 to the corei series was the improvement to the memory controller and scheduling. most of the rest performance gains on this archetecture came from die shrinks and increased transister counts.

so yeah... intel's mobile chip division was what ended up reversing their 5 year slump against AMD.

P4-M were all NetBurst too???? LOL of that I am pretty sure.

ingtar33 · Mar 3, 2014

Novuake :

Ah... sorry i just looked up what you're talking about... we're actually talking past eachother. Technically speaking you're right, the Pentium M based on the "Yonah" core is a distant descendant of a retired Intel chip design the P6 (designed and scrapped in 1995). You're right, the P4-Ms were actually based on netburst. However they resurrected the P6 with the Yonah core in their Pentium M (not p4-m as i called it), and this was the core design that lead to the core single/duo, then the core2 and eventually the core i.

so yeah... you were right as the P6 did lay the groundwork, and it was the mobile division 10 years later which revisited that old design, rebuilt it and released it in Yonah as a mobile chip under the name Pentium M.

I was certainly wrong on the name... it was a Pentium M, not a Pentium 4M. I just remembered it came out in 2006, and of course the P4/Netburst was getting it's butt whipped by AMD at the time... then along came this chip and the game changed.

vmN · Mar 3, 2014

First we need to define cores, as AMD is using a technology like CMT.
A module is in reality a core, so example: a fx 8350 have 4 module and therefore 4 cores.
CMT is used to be space efficiently, as you cut of as many part as possible.

So lets look into the architecture:
Piledrivers branch predictor is terrible and can bearable follow the clockspeed of 9xxx series.
Piledrivers "cores" in their modules are sharing the decoders(this got fixed with steamroller, one of the hige improvement), which was another reason why Piledriver scaled horrible with higher clockspeed as the decoders had a hard time following.

Piledriver have 2 ALUs per "core"(4 per module) and are sharing their FPUs.
Haswell have 4 ALUs per core and have an absurd high numbers of FPUs which they don't share

I don't think there are the huge performance difference between piledriver and haswell schedulers. Piledrivers scheduler is simply a more clombsy one, but I don't think that have a huge factor.

ingtar33 · Mar 3, 2014

vmN :

you're dead on with everything you say except for the second sentence and conclusion. Each core module has 2 logical cores. and for the entire history of the microprocessor a logical core WAS a CORE. Each logical core is what defines what a core IS. It wasn't till around the year 2000 chip-makers started to add math co-processors to the cpu core, that we start to see the floating point integer core as an integrated part of the CPU. In bulldozer and later piledriver AMD matched a single floating point integer core with 2 logical cores in a core module, in an attempt to save both space and power; their reasoning was defensable. Even with floating point cores for every logical core, the calculating power of a modern cpu in calculus is pretty pathetic (yes, even a mighty core i cpu generally sucks at it)... AMD had purchased ATI in 2005, and envisioned a unified cpu/gpu concept where the gpu (which is nothing but a GIANT calculus calculator) would seamlessly take over the floating point calculation of the cpu... furthermore, it saved chip space and power to do this, and saved money, so it seemed a no brainer to gimp their cpu's floating point power.

Now then, moving onto the conclusion... scheduling IS a huge problem with piledriver/bulldozer. when i quoted the 70% number, that's a pretty hard number. at any one point in time these cpus are usually only utilizing close to 70%-75% of their processing power. how do we know this? If you go into the bios and turn off all but one core, piledriver cores get a damned near immediate 40% boost in single core performance. Simply put, most of the time piledriver cpus are information starved... unable to get data from the ram to work on... and sit idle waiting for something to do thanks to the poor AND slow memory controller.

vmN · Mar 3, 2014

Define what a logical core is then, as a core is simply a CPU, so you have multiple CPUs in your CPU.
a core is not a ALU cluster, which you are pointing at(I guess).

It is true you in reality don't need FPUs as a ALU could calculate them with help from the decoders, but it slower.

As said I think the main problem was the fact both "cores", shared decoders, which had a hard time feeding data to BOTH ALU clusters at higher clockspeed. If you disable 1 "core" the decoders would only need to feed 1 ALU cluster and FPU.

Also piledriver have a scheduler for each ALU cluster and FPU cluster. So the problem is already before the schedulers.

gamerk316 · Mar 3, 2014

As far as the OS is concerned, if it has a set of registers, and can schedule threads, its a core. Hence, the OS considers a 2600k an 8-core chip. We know better, but the OS doesn't.

palladin9479 · Mar 3, 2014

There are some actual engineer level definitions to processing units. Processors must contain at least three separate elements to be defined as a distinct processing core, instruction control unit, instruction execution unit, and input/ouput unit. You can have a lot more, but those are the three core requirements to be considered a separate processing element, aka "core". This was a very handy description until super-scalar CPU's were created which abstracted the internal processing element from the machine code coming in. So it gets tweaked a little bit. AMD's BD design is definitely two cores per module, each core has a control unit (scheduler) 4 execution units (2ALU 2AGU) and an I/O unit to get data in and out. The front end scheduler that is shared is the external scheduler not the internal one. The FPU is actually a separate processor entirely, it's bolted on and uses the same I/O but otherwise has it's own registers and scheduler.

Anyhow the primary difference between AMD and Intel's design lies within Intel's caching technology, which is often mistaken as the memory controller, though the IMC also helps Intel quite a bit. I would say a good 90%+ of the "memory benchmarks" people do are actually cache benchmarks as CPU's don't directly read from memory. When you do a memory read / write it's the cache unit that performs the operation in local cache then reports back to the CPU it was finished while asynchronously writing back the operation to main memory. The only time you predictably hit main memory is when your operation is much larger then the cache size.

The differences between their cache technologies is pretty radical. Intel's is an inclusive cache, the contents of each cache are also held in the cache one level higher. The L1 cache is kept in L2 and L3, this makes cache lookups really fast when working with small repetitive datasets. It leverages their strength is a superior branch predictor and prefetch unit. They are really good at ensuring the instructions and data you need are there before your code requests them. AMD use's an exclusive cache design, the contents of each cache are separate from the rest. When L1D fills up, the excess is moved to L2, when that fills up it's moved to L3, when that's full it gets dumped. AMD's predictor and prefetch is significantly less advanced then Intel's and so to compensate they try to maximize the size of the cache and thus the probability of something you need already being in it. Their cache is much slower then Intel's and BD's L2/L3 is even slower then Phenom II due to the modular design. The BD CPU was often being stalled due to having to wait on cache returns.

There are some other uArch differences, Intel use's 3 general purpose ALU's per core, AMD use's 2ALU and 2AGU per core. AMD has four 256-bit FPU's per module that can act as eight 128-bit FPU's while Intel has four 256-bit FPU's that can process 2~3 instructions at once. It boils down to AMD making some design decisions more suited for server environments while trying to lower design costs by making their chips extremely easy to modify / customize. That last one is no small matter, typically you would spend thousands of engineering hours post-design doing timing tweaks and optimizing a chip's microcode which result in a tightly timed efficient CPU. AMD mostly skipped that step and use's an automated design process, the result is BD's rather loose timings. AMD is continuing to work on those timings and is the reason PD and SR have had fairly large efficiency (what you guys call IPC) improvements. Them going modular had the benefit of being able to customize their chip for vender specific sales, it's no coincidence that XBONE and PS4 ended up designing around AMD.

ingtar33 · Mar 3, 2014

palladin9479 :

There are some actual engineer level definitions to processing units. Processors must contain at least three separate elements to be defined as a distinct processing core, instruction control unit, instruction execution unit, and input/ouput unit. You can have a lot more, but those are the three core requirements to be considered a separate processing element, aka "core". This was a very handy description until super-scalar CPU's were created which abstracted the internal processing element from the machine code coming in. So it gets tweaked a little bit. AMD's BD design is definitely two cores per module, each core has a control unit (scheduler) 4 execution units (2ALU 2AGU) and an I/O unit to get data in and out. The front end scheduler that is shared is the external scheduler not the internal one. The FPU is actually a separate processor entirely, it's bolted on and uses the same I/O but otherwise has it's own registers and scheduler.

Anyhow the primary difference between AMD and Intel's design lies within Intel's caching technology, which is often mistaken as the memory controller, though the IMC also helps Intel quite a bit. I would say a good 90%+ of the "memory benchmarks" people do are actually cache benchmarks as CPU's don't directly read from memory. When you do a memory read / write it's the cache unit that performs the operation in local cache then reports back to the CPU it was finished while asynchronously writing back the operation to main memory. The only time you predictably hit main memory is when your operation is much larger then the cache size.

The differences between their cache technologies is pretty radical. Intel's is an inclusive cache, the contents of each cache are also held in the cache one level higher. The L1 cache is kept in L2 and L3, this makes cache lookups really fast when working with small repetitive datasets. It leverages their strength is a superior branch predictor and prefetch unit. They are really good at ensuring the instructions and data you need are there before your code requests them. AMD use's an exclusive cache design, the contents of each cache are separate from the rest. When L1D fills up, the excess is moved to L2, when that fills up it's moved to L3, when that's full it gets dumped. AMD's predictor and prefetch is significantly less advanced then Intel's and so to compensate they try to maximize the size of the cache and thus the probability of something you need already being in it. Their cache is much slower then Intel's and BD's L2/L3 is even slower then Phenom II due to the modular design. The BD CPU was often being stalled due to having to wait on cache returns.

There are some other uArch differences, Intel use's 3 general purpose ALU's per core, AMD use's 2ALU and 2AGU per core. AMD has four 256-bit FPU's per module that can act as eight 128-bit FPU's while Intel has four 256-bit FPU's that can process 2~3 instructions at once. It boils down to AMD making some design decisions more suited for server environments while trying to lower design costs by making their chips extremely easy to modify / customize. That last one is no small matter, typically you would spend thousands of engineering hours post-design doing timing tweaks and optimizing a chip's microcode which result in a tightly timed efficient CPU. AMD mostly skipped that step and use's an automated design process, the result is BD's rather loose timings. AMD is continuing to work on those timings and is the reason PD and SR have had fairly large efficiency (what you guys call IPC) improvements. Them going modular had the benefit of being able to customize their chip for vender specific sales, it's no coincidence that XBONE and PS4 ended up designing around AMD.

best answer in this thread right here.

whyso · Mar 3, 2014

jacobian :

You cannot directly make that comparison as core i series have a igp taking up a significant amount of the transistor budget while FX does not. Haswell igp is 20%+ of the die and graphics logic is denser than cpu logic.

ilovecomps · Mar 7, 2014

Thanks a lot every one!!! I didnt think I would get this many people would answer!! I learned so many things thanks a bunch guys!!

Search

intel vs amd technically why ... more transistors per core or what does intel do so differently than amd??

ilovecomps

Honorable

jacobian

Novuake

Champion

ingtar33

Glorious

jacobian

Honorable

Novuake

Champion

ingtar33

Glorious

Novuake

Champion

ingtar33

Glorious

vmN

Honorable

ingtar33

Glorious

vmN

Honorable

gamerk316

Glorious

palladin9479

Splendid

ingtar33

Glorious

whyso

Distinguished

ilovecomps

Honorable

TRENDING THREADS

Latest posts

Moderators online

Share this page