Sign in with
Sign up | Sign in
Intel's Pentium Performance Hangs on a Hyper-Thread
By ,
1. Introduction

Intel has quietly begun to deviate from its self-perpetuated myth that processor performance is based on clock-speed alone: Intel's new 3.06 GHz Pentium 4 offers Hyper-Threading, a technology that Intel says offers up to a 24% performance gain independent of clock speed. The launch follows Intel's debut of its Pentium 4 with an excessively long pipeline of 20-stages two years ago that boosted clock speed to the detriment of processor performance.

Intel will likely continue to throw a substantial portion of its billion-dollar marketing budget to perpetuate the clock-speed myth in the average consumer's mind that processor performance is measured by clock speed alone. But after the debut of the HT technology in its desktop CPU line, incremental performance gains in new Pentium processors will no longer be measured by pure Megahertz levels, Intel said.

Enter Hyper-Threading

Simply put, HT allows one processor to serve as two physical processors while the OS and other programs are tricked into thinking that there are actually two working physical processors. The benefits are twofold: when multitasking, HT will allow you take about any combination of desktop applications with HT, run them simultaneously and then get some level of benefit depending on the application you are running, measured by task-completion benchmarks.

For example, you put on a slide show with music. While encoding the music, you manipulate images at the same time. The encoding process in the background is going to finish faster. While you will still have a quick responsive imaging experience in the foreground, in the past you may have had to wait and may not have had a responsive environment. And in some cases, the foreground application may have ground to a halt.

For office applications, virus scanning and encryption cranked up to the maximum take up IC computing power. When you are working on basic tasks in the foreground with a simple virus scan operating in the background, opening up a large PowerPoint file can take a lot of time. But instead of taking several minutes without HT, opening up the PowerPoint presentation might just take a few seconds.

2. Non-intentional Benefits For HT-ready Applications

Several applications exist today that will benefit from HT, but the developers were not aware of the potential HT benefit when writing the programs, such as Adobe PhotoShop and Windows Media Decoder.

While many applications were written specifically for multithreaded dual processors applications, some multithreads were added by developers as a convenience for the software vendors when they were debugging their code. Because the debuggers had dual processor workstations and were writing with multiple threads, the debugging process worked faster. The end-user with a single-threaded environment obviously didn't see the benefit of that, but after turning on HT, you see it. Now, all of a sudden, these applications are better in a stand-alone mode.

The Pentium 4 Trojan Horse?

But for some observers, HT will help the Pentium 4 compensate for several of the less-than-stellar performance benchmarks associated with the Pentium 4 vs. the Athlon and the Pentium 3.

One of the most well known features of the new Pentium 4 is its extremely long pipeline. The pipeline of Pentium III, for example, has 10 stages and the Athlon has 11. Meanwhile, the Pentium 4 has no less than 20 stages. Due to the Pentium 4's long pipe architecture, performance has lagged megahertz for megahertz compared to the Pentium 3 and the Athlon for most office applications (an issue on which AMD has tried to capitalize with its AthlonXP line since its introduction last year).

The performance gap has even triggered ongoing class action lawsuits. This summer, class-action lawyers sued Intel, Gateway, and Hewlett-Packard. Technically, the suit includes anybody in the U.S. who had ever bought or leased a PC containing with a Pentium 4. Potentially, the plaintiffs may number hundreds of thousands of people. But now, Intel says HT enables its long pipeline architecture to see its full fruition. Could HT be Intel's Trojan Horse that will render the Pentium 4's performance commensurate with its clock-speed?

3. Two Threads Are Better Than One

By rendering a single physical processor to serve as two processors, HT represents a departure from traditional CPU processor performance improvements involving either boosting the clock speed or cache designs.

HT provides a second logical processor in a single physical package so there are two separate logical architectures that also share only one set of physical execution resources. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on conventional physical processors in a multiprocessor system. From a microarchitecture perspective, this means that instructions from logical processors will persist and execute simultaneously on shared execution resources.


With two copies of the architectural state on each physical processor, the system appears to have four logical execution units, branch predictors, control logic, and buses.

Each logical processor has its own interrupt controller. Interrupts sent to a specific logical processor are handled only by that logical processor.

4. The Pipeline

Intel is happy to note that each stage of the Pentium 4's 20-stage pipeline is long enough to be able to simultaneously replicate and execute its resources when processing more than one thread.

At the pipeline's front-end, for example, which is responsible for delivering instructions to the later pipe stages, the OS schedules and dispatches threads of code to each processor. When a thread is not dispatched, the associated logical processor is kept idle.


When a thread is scheduled and dispatched to a logical processor, HT utilizes the necessary processor resources to execute the thread.

When a second thread is scheduled and dispatched on another processor, resources are replicated, divided, or shared to execute the second thread. As each thread finishes, the operating system idles the unused logical processor, freeing resources for the running processor.

To optimize performance in multi-processor systems with HT, the OS can be configured to schedule and dispatch threads to alternate physical processors before dispatching to different logical processors on the same physical processor.

5. A History Of Parallelism

Parallelism and multi-threaded computation, concepts on which HT is based, are of course anything but new. In fact, Intel's Xeon server processor has employed HT since its debut earlier this year. Thread-level parallelism is now used by Intel, Sun, IBM, and Compaq processors.

By definition, parallelism boosts performance by performing independent tasks simultaneously. Since the mid-1990s, Intel has employed processor parallelism to milk as much performance as possible from its processors' silicon for server applications.

Specifically, HT is based on Thread-Level Parallelism (TLP), which involves switching utilization of chip resources from the currently executing thread to a new thread when the currently executing thread initiates a long latency operation. This reduces the likelihood of long pipeline stalls by allowing the second thread to execute while the long latency operation of the first thread completes.

Switching processor resources from one thread to another incurs a performance penalty, however, since the current thread's instructions must be flushed or drained from the pipeline. Because the thread's architectural state must be preserved in the pipeline, the new logical processor must be activated, and instructions from the new thread must be provided to the processor's resources. These steps can take up to 40 clock cycles to complete.

With HT, however, multi-processor-capable software applications can run unmodified with twice as many logical processors to use. Each logical processor can respond to interrupts independently. The first logical processor can track one software thread, while the second logical processor can track another software thread simultaneously. Because the two threads share one set of execution resources, the HT can use resources that would be otherwise idle if only one thread was executing. The result is an increased utilization of the execution resources within each physical processor package.

For example, one logical processor can execute a floating-point operation while the other logical processor executes an addition and a load operation. HT is complementary to MP-based systems because the operating system can not only schedule separate threads to execute on each physical processor simultaneously, but on each logical processor simultaneously as well.

This improves overall performance and system response because many parallel threads can be dispatched sooner due to twice as many logical processors being available to the system. Even though there are twice as many logical processors available, they are still sharing one set of execution resources. So the performance benefit of another physical processor with its own set of dedicated execution resources will typically offer greater performance levels. In other words, HT is complementary to multi-processing by offering greater parallelism within each processor in the system, but is not a replacement for dual or multi-processing.

6. Overall: More To Life Than Clock Speeds


HT does not represent a radical shift away from Intel's existing Pentium 4 architecture, nor will it require new programming skill-sets. Users only stand to benefit while the transition to an HT architecture will remain largely transparent to developers. Applications already written for multi-processor applications will utilize HT two-logical processor applications. According to Sysmark tests run by Tom's Hardware , HT does indeed offer significant performance boosts independent of clock speed gains in the 3.06 Pentium 4. The boost in speed in alone also does not take into account multi-tasking, such as the time it takes to open a large PowerPoint application while a heavy-duty virus check runs in the background.

However, only WindowsXP recognizes HT in a Windows environment, so while existing multi-threaded applications may be able to take advantage of HT, older version of Windows will not.

In summary, with so many performance gains that HT offers independent of clock-speed, perhaps Intel is on the verge of ending its emphasis on the megahertz race with AMD. At the very least, Intel has begun to attempt to convey to the average user that there is more to life than clock speeds alone.