Sign in with
Sign up | Sign in
Your question

Intel's 1B Transistor chip by 2005

Last response: in CPUs
Share
July 16, 2004 6:21:00 PM

Looks like Intel will be releasing this chip two years earlier than the 2007. Plus dual core.
July 16, 2004 6:53:51 PM

link?

<font color=blue>My dick is so big, that my dick has a dick. And my dicks' dick is bigger than yours.</font color=blue>
July 18, 2004 10:39:48 PM

I heard 2005 for the Montecito which will be based on the 90-nm process and feature up to half-a-billion transistors. So ya 2007 1 billion sounds like a cake walk.

Xeon

<font color=red>Post created with being a dickhead in mind.</font color=red>
<font color=white>For all emotional and slanderous statements contact THG for all law suits.</font color=white>
Related resources
July 19, 2004 1:18:47 AM

lets see if they can put those transistors to good use lol
July 19, 2004 2:38:48 AM

Most of its cache though, that’s the nice thing about IA-64, no need for a critical change in the articecture.128 64-bit general purpose registers,128 82-bit floating point registers, 64 1-bit predicate registers, 8 64-bit branch registers,8 64-bit kernel registers. That’s all there ever needs to be for IA-64(could add more but what’s the point).

This is directly due to the fact that IA-64's instruction scheduling and resource utilization are determined by the compiler at compile time and are not negotiated or altered by the processor at runtime.

This is one of IA-64's greatest strengths, and it is also one of IA-64's greatest weaknesses.

But the logic that the Itanium CPU’s use is far more advanced than that of IA-32 architecture that is in the P4’s and A64’s, and that’s where IA-64 in the instruction sense will do very very well. Good logic and good compilers make CPU’s scream. Hence when an 800MHz IA-64 can outperform a 10+GHz Pentium 4, if such a creature existed under ideal conditions.

Xeon

<font color=red>Post created with being a dickhead in mind.</font color=red>
<font color=white>For all emotional and slanderous statements contact THG for all law suits.</font color=white>
July 19, 2004 4:34:17 AM

well you know what, for the money and the segment itanium is in, it better produce, it really does need to make a statement.
July 19, 2004 11:14:25 AM

Looking at its SpecFP score, it does. The 1.6 GHz 6MB L3 cache Madison scores almost double on SpecFP than the fastest P4EE.

Although the current crown belongs to IBM's Power5 with an amazing ~2700 peak in SpecFP, Montecito should narrow the gap or even surpass the Power5.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
July 19, 2004 3:03:25 PM

Montecito has already been taped out, has it not?

Anyway, the pathetic amount of 1.7 billion transistors is indeed mostly due to cache. With 1.7 billion transistors, you could fit in there something like 8-16 Dothan cores or so... Which is admittedly much harder to do than cache, but would be much more interesting anyway!

<i><font color=red>You never change the existing reality by fighting it. Instead, create a new model that makes the old one obsolete</font color=red> - Buckminster Fuller </i>
July 19, 2004 5:43:48 PM

yeah and youd think that would be enough, but you still dont see mass adoption of it, you only see some quiet dropping of support froma few vendors.
July 20, 2004 2:51:32 AM

Itanium is becomming pretty popular in the high-end areas. It's killed off Alpha, PA-RISC, MIPS and pretty much Sparc. The only thing left is IBM and Power. I'm not sure how likely it is to gain "mass acceptance", by which I assume you mean consumer desktop, but it's never designed to.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
July 20, 2004 5:35:32 PM

>Hence when an 800MHz IA-64 can outperform a 10+GHz Pentium
>4, if such a creature existed under ideal conditions.

For every benchmark you post that achieves that, I will post one where a 300 MHz G3 based Apple iMac outperforms a 4 way Itanium 1.5/6M.

BTW, static scheduling is IMHO really not Itaniums biggest strength, if anything, it could be one of the main weaknesses of current implementations. Don't be too surprised if Tukwilla or some other upcoming IPF variant does away with it and implements OoO. Its biggest strength is its simplicity/size (of the core) and its massive FP resources (as well as unearthly ammounts of cache that would even make a VIA C3 fly).

BTW2, did you see the Power5 spec scores ? So much for IPF being king of Spec.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 20, 2004 5:52:08 PM

>Itanium is becomming pretty popular in the high-end areas

Hmmm.. its doing okayish in the highly volatile HPC market, where its FP performance shines, but its not "breaking any pots" (dutch expression) in the commercial market yet.

>It's killed off Alpha,

Its Dec and Compaq that killed it, not Itanium. Definately not Itaniums performance or market acceptance, Alpha was killed before IPF even taped out.

>PA-RISC

Also here, its HP that sacrified PA Risc, not customers opting for IA64 instead of PA Risc. Not quite an achievement you can credit IPF for.

> MIPS

MIPS was killed by MIPS :) 

>and pretty much Sparc.

Don't rule out SPARC just yet. Latest Sparc64's are putting up a good fight, and Fujitsu's roadmap looks mighty impressive. I think Sun(jitsu) did the smart thing with dumping ultrasparc and switching to Sparc64 instead. Too little, too late ? Maybe, but I aint betting on it yet. SPARC has a huge installed base, and even now with absymal performance compared to its competitors its still selling. Neither am I taking bets on Niagra or Rock btw..

> I'm not sure how likely it is to gain "mass acceptance",

Well, for its primarely intended market(commercial enterprise servers), its really not hugely succesfull yet, no matter how you look at it.

>by which I assume you mean consumer desktop, but it's never >designed to.

The eternal question :D . Itanium as a chip, no definately not. IPF/IA64 as an ISA ? Yes, definately. Itanium should have been the "Pentium Pro of the P2/3's". The numbers don't lie, the ammount of money intel poured into IPF just isnt warranted if it was destined to remain a high end niche platform indefinately. The volume (in cpu's, not systems!)just isn't there to recover the billions spent. Intel's reluctance to extend x86 to 64 bits should be quite a clear indication as well.. IA64 was meant to replace x86 when and where 64 bit became an issue, that is something I am 100% convinced about.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 20, 2004 6:19:54 PM

Quote:
For every benchmark you post that achieves that, I will post one where a 300 MHz G3 based Apple iMac outperforms a 4 way Itanium 1.5/6M.

Under ideal well optimized code native to each processor, if that’s true I want to see this?

Quote:
BTW, static scheduling is IMHO really not Itaniums biggest strength, if anything, it could be one of the main weaknesses of current implementations. Don't be too surprised if Tukwilla or some other upcoming IPF variant does away with it and implements OoO. Its biggest strength is its simplicity/size (of the core) and its massive FP resources (as well as unearthly ammounts of cache that would even make a VIA C3 fly).

Why would static scheduling be a issue? The compiler is building the binaries as best as it sees fit. All the chip has to do is run it, which saves machine ticks and logic computation time. Win win on resource management IMO.

I have to disagree, its all in the logic; you can have a machine with half the transistors as the Itanium and have superb logic in it. With good code and compilers the machine will haul ass regardless.

Xeon

<font color=red>Post created with being a dickhead in mind.</font color=red>
<font color=white>For all emotional and slanderous statements contact THG for all law suits.</font color=white>
July 20, 2004 6:46:43 PM

>Why would static scheduling be a issue? The compiler is
>building the binaries as best as it sees fit. All the chip
>has to do is run it, which saves machine ticks and logic
>computation time. Win win on resource management IMO

Nope. No matter how good the compiler, you can only predict so much at compile time. At run time, workloads come into play which can not be predicted. Why do you think just-in-time compilers like used with Java or .NET are as fast, or sometimes faster than statically compiled binaries even when not only the code has to be executed, but compiled first? Execution time of java code using a good JIT is significanly lower than statically compiled C++ code, think about that one. The reason is simple, JIT compilers "know" more than any static compiler ever could, even when using PGO.

>I have to disagree, its all in the logic; you can have a
>machine with half the transistors as the Itanium and have
>superb logic in it.

Question is, what do a few hundred thousand transistors to enable OoO matter these days ? Millions of transistors are spent on cache or other features that may give a few percent performance boost. By comparison, OoO look pretty damn cheap to me. I firmly expect Itanium to embrace it one day, predication doesnt rule out OoO at all (though it might well end up being quicker by not using the predicate features ironically enough).

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 20, 2004 7:23:02 PM

Quote:
No matter how good the compiler, you can only predict so much at compile time. At run time, workloads come into play which can not be predicted.

IA-64's speed comes from programming concepts that were totally unheard of in Intel-based PC hardware design when Merced was announced back in 1994. No longer is the responsibility of speeding things up delegated to silicon logic alone. IA-64 now allows, or rather requires, software to convey hardware usage logic directly to the CPU. IA-64 accomplishes this by redefining the instruction format into EPIC design, whereby the very nature of instruction encodings tell the CPU which parts of the chip will be used to process data. This has the serious side-effect of relegating instruction ordering, logic unit usage, and optimization techniques directly to the compiler or Assembly programmer's back

It’s not stupid silicon in fact it’s the smartest logic Intel has ever created, and the apparent pit fall really isn’t that serious.

Xeon

<font color=red>Post created with being a dickhead in mind.</font color=red>
<font color=white>For all emotional and slanderous statements contact THG for all law suits.</font color=white>
July 20, 2004 7:34:52 PM

LOL, you should really stop reading all that intel PR. Since when is it "radically new" to conceive a cpu that requires optimized software for best performance ? LMAO. Its also not something to proud off either, its just another way of saying it sucks at typical, less than optimal code, also called "most real world code".

I can see the VIA C4 announcement: "Using a radically new design paradigms and leveraging the ever increasing performance and still untapped potential of better and smarter compilers, VIA designed the ground breaking C4 with half the number of physical registers, half the cache and only 2/3 of the clockspeed of the C3. Yet using latest, greatest compilers, using PGO and handwritten optimized code, it can achieve similar or better performance than the otherwise identical C3 using 10 year old GCC compilers. This fantastic new technological breakthrough from VIA allows you to do more with less, and the only price you pay is it doesnt run your existing binaries anymore !"

I'm sure if intel wrote such a PR piece, you'd buy into it..

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 20, 2004 7:43:31 PM

Oh btw, you nicely ignored my JIT argument against static compilation. In 1994 no one would have believed you if you claimed dynamically compiled JIT or bytecode could ever be faster than statically compiled (PGO) code either.... Intels idea might have been a good one ten or twenty years ago when transistor count was expensive, and OoO weighted heavily on both the transistor budget and complexity, but omitting it today when $2 microcontrollers can do OoO it, predication is an anachronism. Thank God its not a cornerstone of IPF, or it would be dead as a duck.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 20, 2004 8:13:34 PM

Quote:
LOL, you should really stop reading all that intel PR. Since when is it "radically new" to conceive a cpu that requires optimized software for best performance ? LMAO. Its also not something to proud off either, its just another way of saying it sucks at typical, less than optimal code, also called "most real world code".

But of course you must be an IC engineer or a assembly writer to comment on the product I will leave you be then.

Quote:
Oh btw, you nicely ignored my JIT argument against static compilation.

I don’t know anything about JIT compilers other than they don’t seem to do what they promise most of the time*cough*Nvidia*cough*.

Xeon

<font color=red>Post created with being a dickhead in mind.</font color=red>
<font color=white>For all emotional and slanderous statements contact THG for all law suits.</font color=white>
July 20, 2004 8:43:16 PM

when i said mass acceptence, i meant in the workstaion/server market it was orginally target to take. today it now looks like it will be held to the highest end hpc market. intel wanted thsi chip to drive intot he workstaion market, and i wouldnt be suprised if they also looked to bring a derivative of it to the desktop market whne they thought 64bit was ready. its obvious they wanted IA-64 to take off, not something like AMD64 or EM64T.
July 20, 2004 9:35:07 PM

>I don’t know anything about JIT compilers other than they
>don’t seem to do what they promise most of the
>time*cough*Nvidia*cough*.

A Just-In-Time compilers compiles at run time. So instead of the traditional way, where you write code in C++ or any other high level language, then compile it (link it,..) and thereby turn it into a binary file (machine code) which can be executed by the cpu, a JIT will compile (byte)code as it is needed, *just* before executing it. There is no persistant intermediate binary file that gets generated beforehand.

That a JIT can (and does) outperform a static compiler is rather amazing if you consider it has to compile the code first, before being able to execute it, and has far less time to optimize as a static (PGO) compiler that can take its time optimizing the code for as long as it wants. Yet compilation+execution of bytecode using a JIT is often as fast or faster than only execution of (eg) C++ compiled binary code that may have spent countless hours in a PGO compiler at some point.

The reason is what I described above, and its the same reason predication isnt faster than OoO, and really not such a terrific idea. At least not today.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 21, 2004 12:53:25 AM

Thx for the heads up on JIT compilers, I really don't have anything to say other than Intel will eventually build a compiler for IA-64 as they have for IA-32 machines.

Xeon

<font color=red>Post created with being a dickhead in mind.</font color=red>
<font color=white>For all emotional and slanderous statements contact THG for all law suits.</font color=white>
July 21, 2004 1:38:52 AM

Quote:
BTW, static scheduling is IMHO really not Itaniums biggest strength, if anything, it could be one of the main weaknesses of current implementations. Don't be too surprised if Tukwilla or some other upcoming IPF variant does away with it and implements OoO. Its biggest strength is its simplicity/size (of the core) and its massive FP resources (as well as unearthly ammounts of cache that would even make a VIA C3 fly).


Somewhat true but not. If by "massive FP resources", you mean 2 FMAC units, those can be found in the PPC970 and the Power4/Power5. With the exception of SMT-enabled, dual-core Power5 with its massive 24MB of L3 cache, none of the rest, with the same amount of execution hardware and much more advanced OoOE engine, even came close to the FP execution rate of Itanium.

Explicit parallelism is simply very good for your average FP intensive code. There's usually a lot of parallelism, latencies don't matter much, so you can effectively hide it by scheduling and interleaving instructions, and it's all usually pretty predictable. Very suitable for static compilation. The complex nature of FP instructions, however, make them very difficult to extract parallelism out of on-the-fly, especially in the high-speed MPU's of today. VLIW definitely benefits here. Why do you think practically all GPU's use a VLIW native ISA?

Quote:
Nope. No matter how good the compiler, you can only predict so much at compile time. At run time, workloads come into play which can not be predicted. Why do you think just-in-time compilers like used with Java or .NET are as fast, or sometimes faster than statically compiled binaries even when not only the code has to be executed, but compiled first? Execution time of java code using a good JIT is significanly lower than statically compiled C++ code, think about that one. The reason is simple, JIT compilers "know" more than any static compiler ever could, even when using PGO.


JIT's have been often faster than statically compiled binaries in long-running programs such as databases or webservers. This is because they have a long time to profile the code and machine and determine how best to compile. This, however, doesn't mean they're best for everything. While it is possible for JIT's to be faster than static compilation, in the majority of applications out there, it simply isn't, at least, not modern JIT's. Future implementations may bring better results.

Quote:
Question is, what do a few hundred thousand transistors to enable OoO matter these days ? Millions of transistors are spent on cache or other features that may give a few percent performance boost. By comparison, OoO look pretty damn cheap to me. I firmly expect Itanium to embrace it one day, predication doesnt rule out OoO at all (though it might well end up being quicker by not using the predicate features ironically enough).


Not all transistors are created equal. Transistors devoted to logic are much much more power-hungry and prone to failure than those devoted to cache. You could effectively put millions and millions of transistors towards cache and, for the same amount of power/heat, you may only be able to afford a few tens of thousands for logic (depending on what type of logic, OoOE logic is usually pretty complex and power hungry). And OoOE logic is not a constant-size in terms of complexity. The wider your MPU or the longer your pipeline, the more complex your OoOE logic needs to be. Can you imagine having to do register renaming on the already existing 128 programmable registers defined in IA-64?

There are only a few things dynamic optimization can handle that static can't, one of which is latency hiding. However, with even cache latencies reaching 20+ cycles nowadays, OoOE simply can't help hide that anymore. It's benefits grow smaller and smaller.

Quote:
The reason is what I described above, and its the same reason predication isnt faster than OoO, and really not such a terrific idea. At least not today.


Again, depends on the complexity of instructions and the effectiveness of a dynamic OoOE engine. An MPU's reorder window only contains maybe 80 instructions on the most advanced scheduler I know of (Prescott), and even then, your average ILP achieved is what, 2-3 IPC? Dynamic optimization is simply very limited and, while they may improve in time, they're currently outclassed by static compilation in all but a few tasks (business-class applications that run for days and allows JIT's to profile and optimize).

But all of this is really moot now. Single-threaded performance has been pretty much pushed to its limit. Either via really wide designs such as IA-64 or long-but-narrow designs such as Netburst. The future (as IBM recently showed) is CMP/SMT. And with power limitations being what they are, a simple, less power-hungry core like IA-64 implementations, has many advantages over x86 MPU's. You could literally make a dual-core Itanium that isn't much bigger than today's (considering just how much of it is cache anyway). The same could not be said about a dual-core Prescott or Dothan though.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
July 21, 2004 7:10:04 AM

> With the exception of SMT-enabled, dual-core Power5 with
>its massive 24MB of L3 cache, none of the rest, with the
>same amount of execution hardware and much more advanced
>OoOE engine, even came close to the FP execution rate of
>Itanium.

While I'm not disagreeing on your main point here, it is a bit disingenuous to put it that way, since SPECFP scores are single threaded, so neither Power5's multithreading nor multicore abilities matter here. The L3 likely helps a lot, but its off die, hence quite a bit slower than IPF's large ondie caches.

>VLIW definitely benefits here. Why do you think practically
>all GPU's use a VLIW native ISA?

VLIW is not directly related to being inline or predication, which was the topic here. Don't know enough about modern GPU's to comment though, are they inline (I think, but not sure), or do they support predication (think not, but again, not sure)?

>JIT's have been often faster than statically compiled
>binaries in long-running programs such as databases or
>webservers. This is because they have a long time to
>profile the code and machine and determine how best to
>compile.

"long time" may be a bit misleading here, actually they have much less time to compile than a static compiler, but what you meant to say is exactly my point. In spite of the lack of time to optimize, the generated code is typically considerably faster because at run time you can do optimizations and profiling you can not do at compile time. This is my whole argument to state static scheduling (predication) will never be optimal, no matter how good the compiler, so purely relying on it for extracting ILP, and omitting OoOE is not a silver bullet. I'd WAG an itanium without predicate registers but with OoO and decent branch prediction logic would be (considerably) faster than current Itaniums at the expense of moderately higher complexity.

>This, however, doesn't mean they're best for everything.
>While it is possible for JIT's to be faster than static
>compilation, in the majority of applications out there, it
>simply isn't, at least, not modern JIT's. Future
>implementations may bring better results.

I think you should give current JIT's a second look. Any benchmark/review I've seen lately gives roughly equally and often better performance using Java/JIT (or .NET) than compiled C++. Keep in mind, Java/JIT execution times includes the compilation AND garbage collection, which indicates the execution time as such is faster with JIT than statically compiled code for most (if not all) code out there. That being said, I'm not sure how much time a JIT compiler spends on compilation and garbage collecting percentage wise, but I assume its non trivial.

>Not all transistors are created equal.

True, but I was taking about core logic anyhow, not cache.

> The wider your MPU or the longer your pipeline, the more
>complex your OoOE logic needs to be.

Yes, but Itanium is anything but deeply pipelined. It is quite wide though. 128 registers is quite a bit, but I don't think making Itanium OoO would be more costly for Itanium than say P4. Time will tell, but I expect IPF to implement OoO over the next few years; predication just isnt a substitute and with each process shrink, OoO gets even cheaper.

>Dynamic optimization is simply very limited and, while they
>may improve in time, they're currently outclassed by static
>compilation in all but a few tasks

Thing is, they are anything but mutually exclusive! I also disagree with your premises. Any current cpu design goes to extreme lengths to avoid pipeline stalls for good reason..

>But all of this is really moot now. Single-threaded
>performance has been pretty much pushed to its limit

Single threaded performance still matters, and OoOE is one way to improve it that has been expoited by every (modern) architecture execpt IPF. Its low hanging fruit IMHO..

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 21, 2004 8:47:51 AM

Quote:
While I'm not disagreeing on your main point here, it is a bit disingenuous to put it that way, since SPECFP scores are single threaded, so neither Power5's multithreading nor multicore abilities matter here. The L3 likely helps a lot, but its off die, hence quite a bit slower than IPF's large ondie caches.

Not quite. SpecFP is written without any multithreaded code, however, compilers are allowed to auto-parallelize and generate threads (such as with ICC 8.0). This has increased the P4's SpecFP score with HT over non-HT scores. Although the SMT implementation on the P4's nowhere near as performance-oriented as on the Power5.

As for the L3 cache of the Power5, it has been improved significantly. Quite significantly actually. The latencies for the old Power4 was about twice as high as McKinley/Madison. Word around is that the Power5's L3 cache has close to 1/3 the latency of that of the Power4.

Quote:
VLIW is not directly related to being inline or predication, which was the topic here. Don't know enough about modern GPU's to comment though, are they inline (I think, but not sure), or do they support predication (think not, but again, not sure)?

VLIW does explicit parallelization upon code generation, which, in turn, would require that compilers use some form of prediction/inlining. Of course, most modern GPU's use a JIT (drivers) to optimize for their VLIW architectures (something .Net will be able to do much more effectively with IA-64 than with scalar architectures).

Quote:
"long time" may be a bit misleading here, actually they have much less time to compile than a static compiler, but what you meant to say is exactly my point. In spite of the lack of time to optimize, the generated code is typically considerably faster because at run time you can do optimizations and profiling you can not do at compile time. This is my whole argument to state static scheduling (predication) will never be optimal, no matter how good the compiler, so purely relying on it for extracting ILP, and omitting OoOE is not a silver bullet. I'd WAG an itanium without predicate registers but with OoO and decent branch prediction logic would be (considerably) faster than current Itaniums at the expense of moderately higher complexity.

Erm, long time being runtime. JIT's don't have much time to compile but they have all the time in the world to profile. If you're doing the same loop over and over again, the first few iterations, the JIT may be slower, but given a long enough runtime, the JIT can adapt its compiling methods to perform better. This requires, of course, that the application has a long runtime (and is not something like an Office Applications).

And as I've already mentioned, OoOE's benefits are becomming smaller and they are, by no means, as powerful as profiling done by JIT's. The level of profiling and optimizations done by JIT's is simply beyond the level of hardware to do. Your window for optimizing using OoOE is about 80 or so instructions for the most advanced scheduler that I know of out there (Prescott). Your JIT can optimize the entire code structure of an arbitrary size. The more advanced you make your dynamic optimization methods, the more complexity and heat you'll have on-chip. Running it in software, however, costs you nothing except memory footprint (something there is plenty of) and possibly some processor resources (but considering just how much of modern MPU's die remains idle, I'd say there's much to spare).

Again, JIT's along with a VLIW, simple core architecture is a proven method under certain applications (3D graphics and rendering). All modern GPU's use this method.

Quote:
I think you should give current JIT's a second look. Any benchmark/review I've seen lately gives roughly equally and often better performance using Java/JIT (or .NET) than compiled C++. Keep in mind, Java/JIT execution times includes the compilation AND garbage collection, which indicates the execution time as such is faster with JIT than statically compiled code for most (if not all) code out there. That being said, I'm not sure how much time a JIT compiler spends on compilation and garbage collecting percentage wise, but I assume its non trivial.

Again, for what applications? For databases and webservers? Definitely. For office applications and/or scientific computing? Hardly. I say bring on the benchmarks if you truly have them. I'm a heavy Java programmer myself and I'll admit, most applications I write are more sluggish than their c++/ICC counterparts.

Quote:
Yes, but Itanium is anything but deeply pipelined. It is quite wide though. 128 registers is quite a bit, but I don't think making Itanium OoO would be more costly for Itanium than say P4. Time will tell, but I expect IPF to implement OoO over the next few years; predication just isnt a substitute and with each process shrink, OoO gets even cheaper.

The P4's nowhere near as wide as McKinley/Madison and replicating 8 programmable registers using 128 physical registers is a lot *lot* simpler than replicating 128 programmable registers for register renaming. Logic required to replicate register files while still maintaining flat access grows *exponentially*. And looking at Netburst, such things only serve to cause greater power consumption.

Quote:
Thing is, they are anything but mutually exclusive! I also disagree with your premises. Any current cpu design goes to extreme lengths to avoid pipeline stalls for good reason..

And achieves very (relatively) little. Optimizations upon compile can speed your program up by many folds. By factors of 10's perhaps using a good compiler with the proper optimizations. OoOE on even the most advanced scheduler out there generates perhaps 2-3 IPC worth of ILP if it's lucky and that's *with* a lot of optimizations that the compiler has to do to tweak the instruction stream. This compared to the in-order, scalar designs of the Pentium days offers perhaps a 2-3 fold improvement in ILP *with* proper static optimizations upon compile time.

You tell me which brought more improvement.

Quote:
Single threaded performance still matters, and OoOE is one way to improve it that has been expoited by every (modern) architecture execpt IPF. Its low hanging fruit IMHO..

The costs for OoOE keep growing for wider or longer designs and the benefits are growing smaller and smaller relatively. Especially as compiler technology improves. It simply isn't worth it anymore for the most part.

A good JIT in conjunction with a very open ISA (VLIW) will be much better for much of today's performance-demanding applications (multimedia, scientific computing, 3d graphics, etc.) IA-64 offers the hardware for this, only time will tell to see whether Intel invests in runtimes to take advantage of that.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
July 21, 2004 9:37:34 AM

>Not quite. SpecFP is written without any multithreaded
>and generate threads (such as with ICC 8.0). This has
>increased the P4's SpecFP score with HT over non-HT scores.

Are you sure ? Check this:
<A HREF="http://www.spec.org/osg/cpu2000/results/res2004q2/cpu20..." target="_new"> specFP score Dell </A>. Its the highest P4 SpecFP I could find, and HT is DISABLED. SMT/CMT helped on older Spec95, but current compilers have not yet enabled any speedups for Spec2000 using SMT.

> the first few iterations, the JIT may be slower, but given
>a long enough runtime, the JIT can adapt its compiling
>methods to perform better. This requires, of course, that
>the application has a long runtime (and is not something
>like an Office Applications)

Thats nonsense. "long runtime" has nothing to do with it. Profiling can happen after just 1 or 2 cycles, much in the same way a tracecache helps, and not only for "long", repetitive tasks.

> For office applications and/or scientific computing?
>Hardly. I say bring on the benchmarks if you truly have
>them.

http://www.idiom.com/~zilla/Computer/javaCbenchmark.htm...
http://www.javaworld.com/javaworld/jw-02-1998/jw-02-jpe...
http://sys-con.com/story/?storyid=45250&DE=1

One quote:"JVM startup time was included in these results. "That means even with JVM startup time, Java is still faster than C++ in many of these tests"

Let the myth that Java is slow now rest, it isn't, and not only for webserving or databases. .NET shows similar or even better results.

For your other arguments, lets just agree to disagree. Time may tell...

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 21, 2004 10:04:06 AM

oh and one more thing: most of those links are pretty old, 2001 or earlier. JIT's have been improving by leaps and bounds...

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 22, 2004 1:03:57 AM

Quote:
Are you sure ? Check this:
specFP score Dell . Its the highest P4 SpecFP I could find, and HT is DISABLED. SMT/CMT helped on older Spec95, but current compilers have not yet enabled any speedups for Spec2000 using SMT.


Replaced by <A HREF="http://www.spec.org/cpu2000/results/res2004q3/cpu2000-2..." target="_new">this</A> with SMT enabled. The <A HREF="http://www.spec.org/cpu2000/results/res2004q3/cpu2000-2..." target="_new">SMT-enabled</A> 3.4EE also scores higher. Notice that all the recent submissions have enabled SMT and use ICC 8.0.

Quote:
Thats nonsense. "long runtime" has nothing to do with it. Profiling can happen after just 1 or 2 cycles, much in the same way a tracecache helps, and not only for "long", repetitive tasks.


Erm, not in software it can't. It takes more than 1 or 2 cycles for an application to even receive back the result of a calculation, let alone figure anything out about it. Fine-granuarity dynamic optimizations can occur on-hardware only. How would a JIT profile a code that has only issued 2 instructions? How could it even process the instructions to do that profiling? JIT's aren't run in parallel (perhaps on SMT systems but that hardly means it runs exactly parallel) with the main application, they're switched in and out. To properly optimize an application for the specific architecture, long-term profiles need to be determined (in order to find the critical section) and optimization occurs there.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
July 22, 2004 1:23:07 AM

Quote:
imgod2u

Yes you are good sir.

Xeon

<font color=red>Post created with being a dickhead in mind.</font color=red>
<font color=white>For all emotional and slanderous statements contact THG for all law suits.</font color=white>
July 22, 2004 6:30:23 AM

>Replaced by this with SMT enabled.

Indeed. Hmm, well at least grant me this is a VERY recent development, I Don't read spec submissions every few weeks :)  Interesting though, I'd like to see how it performs without HT on the same compiler. Also, while ICC might be able to take some advantage of SMT on single threaded FP code, I am unsure if IBM's compilers are as clever.

>Erm, not in software it can't. It takes more than 1 or 2
>cycles for an application to even receive back the result
>of a calculation

I didnt mean that as in "clock cycles" obviously. Should have used the word "iterations" instead.. still, the links ought to show JIT performance isnt any worse than statically compiled code..

= The views stated herein are my personal views, and not necessarily the views of my wife. =
July 22, 2004 7:08:14 AM

Quote:
Indeed. Hmm, well at least grant me this is a VERY recent development, I Don't read spec submissions every few weeks :)  Interesting though, I'd like to see how it performs without HT on the same compiler. Also, while ICC might be able to take some advantage of SMT on single threaded FP code, I am unsure if IBM's compilers are as clever.

Yes, but then again, the Power5's submissions are more recent than any.

As for IBM's compilers. I'm not sure of the exact details but the core of the Power5 remains almost the same as the Power4. With the exception of the faster cache, SMT and multi-core, I'm not sure what else could account for such a dramatic increase in SpecFP scores.

Quote:
I didnt mean that as in "clock cycles" obviously. Should have used the word "iterations" instead.. still, the links ought to show JIT performance isnt any worse than statically compiled code..

The links showed that under certain circumstances, the best JIT performance is about on-par with less than aggressive optimizations using GCC rather than ICC on a Pentium 4. Something which ICC has traditionally been much better at optimizing for. It does show promise, but that hardly qualifies as the best static vs the best JIT.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
July 22, 2004 7:49:42 AM

> I'm not sure what else could account for such a dramatic
>increase in SpecFP scores.

Note that not only SpecFP scores have increased by such magnitude, also TPC-C, SpecJBB, SAP,.. scores are out of this world.

> It does show promise, but that hardly qualifies as the
>best static vs the best JIT.

The initial point was that JIT often create code that runs faster than static compilers. If JIT performance including VM launch time and compilation time is overall on par with exectution time of static compiled code, I think this point is proven.

= The views stated herein are my personal views, and not necessarily the views of my wife. =
!