Sign in with
Sign up | Sign in
Your question
Closed

Intel to Share Next-Gen Poulson CPU Details

Last response: in News comments
Share
Anonymous
a b à CPUs
January 26, 2011 4:02:57 PM

Is this a server processor or the next step after Sandy Bridge?
Score
0
January 26, 2011 4:07:55 PM

What they won't tell you is this processor will set you back $1500 O.O
Score
-1
Related resources
a b à CPUs
January 26, 2011 4:39:57 PM

50MB of cache.

32k X 8 l1 cache = 256k
256k X 8 l2 cache = 2MB
so the remaining cache is 48MB of l3 cache?? That's freaking huge! These definitely wouldn't be for desktop use. These are server chips.
Score
2
January 26, 2011 4:41:43 PM

BulkZerkerWhat they won't tell you is this processor will set you back $1500 O.O


...which is a bargain for server applications ;) 
Score
2
January 26, 2011 4:49:07 PM

BulkZerkerWhat they won't tell you is this processor will set you back $1500 O.O


I'm sure it will be several times $1500.
Score
2
January 26, 2011 4:49:50 PM

timothyburgher@gmailcomIs this a server processor or the next step after Sandy Bridge?


No, it is not. Sandy Bridge E series are. They will be very fast. This CPU is different.
Score
0
a b à CPUs
January 26, 2011 5:00:54 PM

Ivy Bridge is after Sandy Bridge. Stupid a$$ names if you ask me.
Score
-1
Anonymous
a b à CPUs
January 26, 2011 5:06:49 PM

Itanium processors cost much more than $1500. And Itanium L1/L2 caches are much larger than 256KB/2MB. What, no one here knows or remembers what Itanium is?
Score
2
January 26, 2011 5:40:11 PM

LoL Itanium is for nothing but special datacenters i wouldnt even put this in the same class as servers this thing runs super computer nodes and things of this nature. It requires a special operating system that runs the itanium instruction set i think they madea server 2003 itanium edition at one point due to it not be a x86 processor so yeah the old ones would set you back 3500 dollars and i dont know much about these new ones. when i left the eval lab at Intel back in 2003 we had tons of these things
Score
0
January 26, 2011 5:51:31 PM

gfdsgfdsItanium processors cost much more than $1500. And Itanium L1/L2 caches are much larger than 256KB/2MB. What, no one here knows or remembers what Itanium is?

Very big, very expensive and unique chips. Thought x86 was bloated? Itanium uses an EPIC instruction set (explicitly parallel instruction computing) to attempt to achieve a much higher instructions per clock ratio.

I have absolutely no use for it, but damn, I want one.
Score
2
a c 133 à CPUs
January 26, 2011 5:52:12 PM

Geez this is really old new I have seen this same article at least 3 months back on another site. Toms is getting good at being late to the news. In this case really late.
Score
-1
January 26, 2011 5:52:21 PM

For thoese of you commenting on the price.
itanium is not a cpu anyone person would buy. This main frame replacement.
Ment to run a entire data wharehouse.
so stop the pricing as if intel is trying to take advantage.
You thing the price is high. spec out a hight end sun main frome or a cray becasue this is what itanium is ment for.
This is not for crysis.
Score
1
January 26, 2011 6:01:10 PM

holy christ that's huge in many respects. i'd like to see the blade this thing goes into!
Score
0
January 26, 2011 6:07:44 PM

jdamon113For thoese of you commenting on the price.itanium is not a cpu anyone person would buy. This main frame replacement.Ment to run a entire data wharehouse.so stop the pricing as if intel is trying to take advantage.You thing the price is high. spec out a hight end sun main frome or a cray becasue this is what itanium is ment for.This is not for crysis.

crisi on the cloud perhaps!
that's what this thing is designed for. and i'm pretty sure zuckerburg would buy one himself. future evil dictators have to have a way to manage their covert ops and manipulation of information some how! all pun intended.
bad example bad citing, just trying to keep my mind off the fact george soros makes dick cheney look like a two bit player at a poker game.
Score
1
January 26, 2011 6:17:02 PM

Quote:
The on-die cache seems to a bit smaller than the 54 MB that Intel discussed in the past.


Should insert a 'be' after the 'to' and before the 'a'
Score
0
January 26, 2011 7:20:50 PM

Will it max out Crysis2? :-P
Just kidding. :)  Huge die though...
Score
0
January 26, 2011 7:54:07 PM

SSD are breaking the Gb/s mark now.
if u up to a 256MB cache and u can probably get away 20Gb/s like to a super fast SSD and the RAM disappears.
Score
0
January 26, 2011 10:19:31 PM

This article completely misses the most important thing about this release. The current Itanium is a 6-issue (in two bundles of three instructions each), the Poulson will be a 12-issue processor, which could mean even greater IPC than is currently possible. I'm curious if they're going to dynamically allocate bundles based on workload - if you have one thread use all the resources for one, but if you have several give each thread a bundle, or somewhere in between.

It will be interesting to see if Intel can finally get performance superiority over the horrible x86 instruction set processors. They've always been behind in manufacturing technology, and then they did weird things like make the L1 cache accessible in only one clock cycle (severely limiting clock speed), but with manufacturing parity, and finally two clock cycle access to L1 cache, this processor had better be able to beat processors crippled by an obsolete, difficult and inefficient instruction set.

Otherwise, they should have just gone with RISC, instead of VLIW.
Score
-1
a b à CPUs
January 27, 2011 12:59:36 AM

Itanium was very bad as a server CPU. It had its own special *Intel* branded instruction set that you had to compile your OS / Drivers / Applications for. It used microcode emulation to run x86 instructions so that it appeared to be x86 compatible but was horribly inefficient as running x86 instructions. Because of this the Itanium sucked for anything where you would need x86 instructions for (90+% of the commodity server market) which left the "special use" systems that run OS / apps programmed specifically for a special CPU Architecture to run specialized software.

Itanium's competition isn't AMD / Pentium its things like Sun SPARC and IBM Power. And this is an arena where Intel gets spanked pretty badly. On one hand you have the SUN SPARC, the SUN T3 CPU which is,
Quote:
"A 16-core SPARC SoC processor enables up to 512 threads in a 4-way glueless system to maximize throughput. The 6MB L2 cache of 461GB/s and the 308-pin SerDes I/O of 2.4Tb/s support the required bandwidth. Six clock and four voltage domains, as well as power management and circuit techniques, optimize performance, power, variability and yield trade-offs across the 377mm2 die"


In reality this means a single T3 CPU can process 32 integer (2 integer units per core) and 16 floating point (1 FPU per core) and 16 memory (1 MMU per core) operations per cycle. Each core has eight sets of register stacks allowing each core to process eight unique threads each. Each CPU has four DD3 memory channels to its own dedicated memory, 2 10Gb Ethernet ports and its own set of I/O circuitry. Each core has its own built in crypto circuitry for accelerating encryption and hashing. A single server would have four of these CPU's inside it along with 128GB ~ 1TB of memory depending. The only down side is that each CPU is clocked at 1.67Ghz, single thread performance is rather low compared to its IBM Power counterpart. These SPARC CPU's are designed to be used in databases and massively parallel servers, when you need to service thousands of users while processing hundreds of transactions per second, then you use a SPARC.

http://en.wikipedia.org/wiki/SPARC_T3

Their main competitors is IBM and their Power CPU, namely the Power 7.

Quote:
POWER7 has these specifications:[5][6]

* 45 nm SOI process, 567 mm2
* 1.2 billion transistors
* 3.0 – 4.25 GHz clock speed
* max 4 chips per quad-chip module
o 4, 6 or 8 cores per chip
+ 4 SMT threads per core (available in AIX 6.1 TL05 (releases in April 2010) and above)
+ 12 execution units per core:
# 2 fixed-point units
# 2 load/store units
# 4 double-precision floating-point units
# 1 vector unit supporting VSX
# 1 decimal floating-point unit
# 1 branch unit
# 1 condition register unit
o 32+32 kB L1 instruction and data cache (per core)[7]
o 256 kB L2 Cache (per core)
o 4 MB L3 cache per core with maximum up to 32MB supported. The cache is implemented in eDRAM, which does not require as many transistors per cell as a standard SRAM[4] so it allows for a larger cache while using the same area as SRAM.


What this means in reality is that while it you get four threads per core with a maximum of four simultaneous instructions executed per core. Now a note needs to be made that IBM Power / AIX instructions differ from SPARC instructions so the two are very hard to compare. Power focuses more on getting a single task done as fast as possible where the SPARC focuses on getting as many tasks done at once as possible. Powers are clocked at 3 to 3.8Ghz per CPU (can shutdown cores to boost speed to 4.25Ghz) and are many times bigger then a SPARC CPU which often leads to unfair CPU vs CPU comparisons. Better comparisons have been done with system vs system competitions and they each win at different things (T3 at webserving / database work, Power at financial calculations / simulations).

These are the beasts that Itanium must compete against not home gaming rigs and low to medium server markets. Everyone rejected Itanium originally because of the horrible x86 performance, the commodity market doesn't want to recompile / redevelop their entire software base for a single CPU architecture.
Score
2
a b à CPUs
January 27, 2011 1:26:57 AM

Ok some pricing info, I'm very familier with purchasing Sun systems so I'll list the default quote off their site for a single system.

https://shop.sun.com/store/product/578414b2-d884-11de-9...
Config #3,
$177,057.00 Each,
4x SUN Sparc T3 CPU,
512 GB (64 x 8 GB DIMMs) Memory,
Internal Storage: 600 GB (2 x 300 GB 10000 rpm 2.5-Inch SAS Disks),
Max Internal Storage: 2.4Tb (8 x 300GB 10000 rpm 2.5-Inch SAS Disks),
Ethernet: 4 x 1 Gb 10/100/1000 MBs Integrated Ethernet Ports. Option Slot for 8 x 10 GbE XAUI Ports, 16 PCIE express module slots
Power: 4 PSU's @ 12.6 A @ 200 V AC
Space: 5RU, 8 systems per industry standard rack.

You need to purchase the 10GbE adapter separately, the circuitry already exists inside the CPU but you need the physical connector to be either copper or fiber, your choice. And while the system itself is 177 grand a pop, the specialized software this is most likely running will be twice that price.

I can't get a quote on an IBM Power 755 without contacting a sales agent, I figure it will be similar to the above SPARC range. Bonus points to the IBM for being very Linux friendly.
Score
2
a c 127 à CPUs
January 27, 2011 3:11:16 AM

Quote:
Is this a server processor or the next step after Sandy Bridge?


Its a server processor for a specific area, mainly 64bit. It uses IA64 which was developed by Intel and released back in 2001 as their successor to x86. It didn't fare well though because it could only emulate x86 code so it was slower than current x86 CPUs but is much faster in 64bit. Intstead we are stuck with x86-64 because of AMD and while I understand it, it also stuck us with the inferior x86.

Still the masses are hard to switch.

BulkZerker said:
What they won't tell you is this processor will set you back $1500 O.O


Probably a bit more than that. Its only in specific areas and current high end 4c/8t Itanium based CPUs cost $3838.00 each in 1kus. But as I said before, the area it is in it is a beast and hard to beat.

Still its nothing we will ever see but its impressive the technology behind it.
Score
0
a b à CPUs
January 27, 2011 5:15:38 AM

Actually EMT64 was a brilliant idea from AMD. Instead of trying to push a new 64 bit instruction set, they just extended the current x86 instruction set. Don't lament it as some poor idea, there have been "64 bit" instruction sets out for years and the commodity market didn't pick up on them for a very good reason. UltraSparc is a good example (sorry I'm mostly a SUN guy), its a very old 64 bit RISC that has amazing performance. There was even PPC which Apple used for years as a commodity platform. This isn't even a "OMFG Evil Microsoft" fault because MS made a version of NT for the DEC Alpha, an extremely high performance 64-bit RISC CPU. It didn't sell well and DEC eventually got bought out, MS dropped support for them and focused purely on its x86 software platform. When Intel released Itanium MS got behind them and built a NT 5.0 (Windows 2000) kernel for it, it supported 64-bit and everything. The application performance on it was horrendous unless the application manufacturer recoded their application for Intel specific Itanium. Very few of them did this and Itanium languished, some consider it dead.

AMD's push to create EMT64 was good because it allowed the existing industry to slowly adopt / grow rather then try to force them over all at once while creating a gatekeeper scenario (Intel was very stingy with Itanium licenses to HW manufacturers). You ~need~ competition at all levels to keep people honest and for the industry as a whole to progress. AMD licensed their 64-bit technology to Intel, would Intel have done the same if they created the 64-bit code? (No they didn't). How long have EMT64 CPU's been available? Application developers are just now including 64-bit binaries inside their programs, how long until their code is 64-bit exclusive? It takes application makers years if not a decade or more to migrate architectures.

So please, do not blame AMD for the current state of the commodity market. If anything you should be praising them, they are responsible for launching us out of NT x86 world and have brought mainstream 64 bit computing to the home user. Intel took their shot with Itanium and lost.

If you don't like x86 or EMT64 then use SPARCv9 or PPC. I personally run a SunBlade 2000 with dual UltraSparc IIIi @1.2 GHZ, 8GB memory (SUN) 146GB FC-AL 10K RPM disk + 76GB FC-AL 10K RPM, and an XVR-1200 graphics adapter. The OS is Solaris 10 with OpenGL support and a bunch of my own stuff running. Next to this I have my EMT64 machine that I use for gaming.
Score
1
January 27, 2011 6:16:00 AM

I use Itanium/OpenVMS daily and it gets owned by Xeons/Linux in about any algorithm. The Itanium compiler is an EPIC failure and no amount of HW tinkering is going to change that.
Score
1
a b à CPUs
January 27, 2011 6:26:51 AM

Here is another problem with VLIW and Itanium architecture. The binaries encoding of instructions are static and the HW is incapable of executing them out of order. The original Itanium had six execution units, binaries compiled for that CPU can execute up to six instructions in one pass. But if the user later upgrades their CPU to one with eight or twelve instruction units the binary could still only execute on six and the other units would be permanently stalled. You would have to go back and recompile ~everything~ to support the 12 instruction unit model. And if in the future they introduced a 16 or 24 unit model, then you'd have to do all that recompiling all over again.

Now lets reverse it, lets say MS goes out and compiles W2K8 for the *newer* Itanium with 12 instruction units. All the application makers go out and do this too, your entire software base goes out and does this. Guess what happens should you try to run those binaries on the older 6 instruction unit hardware? They will not execute properly if at all. They are statically encoded to send up to 12 instructions to a 6 instruction system. It will work just fine until the code tries to send a 7th simultaneous instruction and suddenly you will get an exception which will cause a nonmaskable interupt (NMI), most likely this will cause the system to crash. This would require the compiler to compile binary code for multiple instances of the CPU and then have the binary check and determine if it should execute 6, 12, 16 or 24 instruction code.

This is all because VLIW based architecture is perfect for when the software is being written directly to a very specific known architecture, stuff used in DSP's or GPU's. Its absolutely a bad idea in a general purpose CPU which can be upgraded or switched out and comes in multiple flavors and may be expected to execute any random amount of code at any random time.
Score
-1
January 27, 2011 1:25:56 PM

palladin9479Here is another problem with VLIW and Itanium architecture. The binaries encoding of instructions are static and the HW is incapable of executing them out of order. The original Itanium had six execution units, binaries compiled for that CPU can execute up to six instructions in one pass. But if the user later upgrades their CPU to one with eight or twelve instruction units the binary could still only execute on six and the other units would be permanently stalled. You would have to go back and recompile ~everything~ to support the 12 instruction unit model. And if in the future they introduced a 16 or 24 unit model, then you'd have to do all that recompiling all over again.Now lets reverse it, lets say MS goes out and compiles W2K8 for the *newer* Itanium with 12 instruction units. All the application makers go out and do this too, your entire software base goes out and does this. Guess what happens should you try to run those binaries on the older 6 instruction unit hardware? They will not execute properly if at all. They are statically encoded to send up to 12 instructions to a 6 instruction system. It will work just fine until the code tries to send a 7th simultaneous instruction and suddenly you will get an exception which will cause a nonmaskable interupt (NMI), most likely this will cause the system to crash. This would require the compiler to compile binary code for multiple instances of the CPU and then have the binary check and determine if it should execute 6, 12, 16 or 24 instruction code.This is all because VLIW based architecture is perfect for when the software is being written directly to a very specific known architecture, stuff used in DSP's or GPU's. Its absolutely a bad idea in a general purpose CPU which can be upgraded or switched out and comes in multiple flavors and may be expected to execute any random amount of code at any random time.


Sadly, you don't know what you're talking about. The hardware uses the compiler to optimize instruction packing, but it does not execute the instructions. You'd just be sending four packets instead of two, and they would execute fine on older hardware, although possibly with a small performance penalty since it's not optimized for it.

The same is true for x86. Remember the Pentium 4 and how code had to be optimized for it to execute well?
Score
-1
January 27, 2011 7:30:35 PM

mayankleoboy1i do want one, but i really have no use for it


These are CPUs for systems like IBM did back in the 70's and 80's where their global market for number of sales (systems, not CPUs, they'll sell a few thousand CPUs in each system install) was maybe Five... six.

You don't sell a kidney to get one, you mortgage a small European country.

I too am wondering if this will be as epic a failure as the previous Itaniums, or if Intel has truely learned their lesson and redid most of the x86 microcode.

The cooling for these systems was typically a "Brick" module, liquid cooled, using some materials I forget (Barium?) which if it leaked, was a big Hazmat incident.

A buddy of mine from SGI used to play with these. He was a MIPS fan mostly.
Score
0
a b à CPUs
January 28, 2011 1:47:29 AM

Quote:
Sadly, you don't know what you're talking about. The hardware uses the compiler to optimize instruction packing, but it does not execute the instructions. You'd just be sending four packets instead of two, and they would execute fine on older hardware, although possibly with a small performance penalty since it's not optimized for it.

The same is true for x86. Remember the Pentium 4 and how code had to be optimized for it to execute well?


Not even close. The HW doesn't use a compiler to do a damn thing, the compile operation happens long before the code gets near the HW. VLIW architecture has no instruction ordering / branch prediction units and instead the compiler is supposed to do all that and render the binary. The idea is that the compiler has infinitely longer time and resources to determine the proper branching of instructions then a HW branch prediction unit would have and thus should be more accurate. The compiler would analyze the code and determine the code path and which instructions can be executed in parallel then encode everything into binary format. The maximum amount of simultaneous instructions would be based on the target architecture, if the target could process 6 instructions then the compiler would encode a maximum of 6 simultaneous instructions, if the target could process 12 instructions then the compiler would encode a maximum of 12 simultaneous instructions. This is what ILP is all about, encoding multiple instructions into a single command then sending it to the CPU to be processed at once.

On a typical superscaler CPU there is instruction ordering and branch prediction hardware that analyzes incoming code and determines which instructions can be processed simultaneously and which branching will happen. This HW doesn't exist inside a VLIW processor.

In theory the VLIW should be faster, and it is if the software is encoded explicitly for VLIW and doesn't change nor have much random branching. The breakdown happens the moment your architecture changes as older binaries are not able to take advantage of it and in real world scenarios where the context of the operation can not be predicted at compile time. A compiler has no way to know exactly what external inputs will provide the program and thus can not possible predict which way those dependent branches will go. A HW branch prediction unit does have knowledge of the context of the operation and can make somewhat accurate predictions if the dependent instruction references something in memory.

Ex,

Something like this would be easy for a VLIW (pseudo ASM)
STORE AX 0 (set AX register to 0)
ADD 1, 2 (Add the value of 1 and 2 and store to AX)
ADD AX, 4 (Add the value of 4 to the AX register)
CMP AX, 7 (is AX 7)
JE AX_EQUAL_TRUE (jump to label AX_EQUAL_TRUE if the value is true)
JMP AX_NOT_EQUAL (else jump to AX_NOT_EQUAL)

:AX_EQUAL_TRUE
more code

:AX_NOT_EQUAL
more code

A compiler could easily see what AX would be and make the binary so that AX_EQUAL_TRUE is executed at the same time that AX is being added to with the knowledge that it would jump there anyway. But if we modify the code such that,

STORE AX 0 (set AX register to 0)
ADD 1, C800:007F (Add the value of 1 and whatever is in C800:007F and store to AX)
ADD AX, 4 (Add the value of 4 to the AX register)
CMP AX, 7 (is AX 7)
JE AX_EQUAL_TRUE (jump to label AX_EQUAL_TRUE if the value is true)
JMP AX_NOT_EQUAL (else jump to AX_NOT_EQUAL)

:AX_EQUAL_TRUE
more code

:AX_NOT_EQUAL
more code

In this example a memory address is being referenced for the value to be added to AX, if a previous predictable operation wrote a value to this memory address then its fine, if not and the value of this address is not known at compile time, then the compiler would have no idea whether to jump or not in the branch. The compiler would have to make a WAG (wild a$$ guess) and hope it was right. If during execution the compiler was found to be wrong, then the CPU will stall as the instruction pipeline is flushed and we make the branch and reload from there. Hopefully the information is in L1 cache so the stall won't be long, if its not then we must go to L2 cache and things start to look much worse for our little program.

The issue hinges on whether the compiler can make the proper decision without knowing the context the program will be operating under. If so then great, else a HW branch prediction engine would be better. The industry heavyweights (SUN / IBM) have already shown that VLIW will not work in a real production environment, there is no magic secret sauce inside the compiler that can make the impossible possible. And yes if a VLIW binary is encoded with 7 simultaneous instructions and tries to execute that on a 6 process system, the CPU will trigger an exception and a NMI, its an illegal instruction. VLIW CPU's can not reorder instructions, they must execute what their given, and if they can't execute the instruction then they must discard it and let the program know.
Score
1
January 28, 2011 9:16:59 PM

wow, these people know a lot.
thanks for sharing.
Score
1
!