AMD Bulldozer Review: FX-8150 Gets Tested

Page 4 of 24:

A Shared Front-End And Dual Integer Cores

Sharing The Front-End

As I already mentioned, Bulldozer’s instruction fetch and decode stages are shared between both of its cores. AMD uses interleaved multi-threading to track the thread ID of each instruction in flight, decide which thread most needs work completed, and perform an operation on behalf of that thread. It’s able to switch on a per-cycle basis to keep progress moving on both threads.

Branch prediction is guided by 512-entry L1 and 5000-entry L2 branch target buffers (BTBs). That pipeline is responsible for predicting ahead to populate a queue of future fetch addresses, and keep it as full as possible. There are actually two queues—one for each thread—ensuring there’s always work to be done. The instruction fetch pipeline then pulls addresses from the prediction queue.

Those addresses enter the fetch pipeline’s 64 KB two-way instruction cache, which is shared between both threads (the threads compete dynamically for access to it). Next, Bulldozer’s fetch queue feeds x86 instructions to a decode pipeline composed of four x86 decoders that, in turn, dispatch up to four operations per cycle to the schedulers.

When a miss occurs (that is, it’s not available in the instruction cache), a request is sent to the L2 cache and forwarded to system memory if necessary. That’s a big latency hit. So, while the request is in flight, fetch addresses further in the prediction queue are looked up to see if they’ll hit or not. If they’ll miss as well, a subsequent request is sent to L2 as the first instruction is coming back, overlapping instruction miss requests.

Dual Integer Cores

From the front-end, decoded operations make their way to one of two independent integer cores, where they execute fully out-of-order. The two cores each come equipped with two execution units and two address generation units.

Each core also features its own 16 KB way-predicted L1 data cache. Moreover, both cores include 32-entry L1 data translation lookaside buffers (TLBs) backed by a 1024-entry, eight-way L2 TLB that lives in the logic shared by both cores. Thirdly, each of the two integer cores employs out-of-order load/store units capable of two 128-bit loads/cycle or one 128-bit store/cycle.

Current page: A Shared Front-End And Dual Integer Cores

Prev Page The Idea Behind AMD’s Bulldozer Next Page Single Floating-Point Unit, AVX Performance, And L2

TOPICS

Chris Angelini is an Editor Emeritus at Tom's Hardware US. He edits hardware reviews and covers high-profile CPU and GPU launches.

530 Comments Comment from the forums

btto

yeah finaly, now i'll read it
Reply
ghnader hsmithot

nOT Bad AMd!
Reply
jdwii

Been so long and i'm kinda sad.
Reply
compton

Not many surprises but I've been waiting for a long, long time for this. I hope this is just the first step to a more competitive AMD.
Reply
ghnader hsmithot

At least its almost as good as Nehalem.
Reply
gamerk316

Dissapointing. Predicted it ages ago though. PII X6 is a better value.
Reply
As I expected - failure.
Reply
AbdullahG

I see the guys from the BD Rumors are here. As many others are, I'm disappointed.
Reply
iam2thecrowe

for the gaming community this is a FLOP.
Reply
phump

FX-4100 looks like a good alternative to the 955BE. Same price, higher clock, and lower power profile.
Reply

Show more comments