AMD Bulldozer Review: FX-8150 Gets Tested

A Shared Front-End And Dual Integer Cores

Sharing The Front-End

As I already mentioned, Bulldozer’s instruction fetch and decode stages are shared between both of its cores. AMD uses interleaved multi-threading to track the thread ID of each instruction in flight, decide which thread most needs work completed, and perform an operation on behalf of that thread. It’s able to switch on a per-cycle basis to keep progress moving on both threads. 

AMD actually decouples the branch target predictor from the instruction fetch stage, allowing it run ahead, independent of any stalls that occur in the fetch pipeline. More important, AMD says, is that decoupling those components enables a feature called prediction-directed instruction prefetch, characterized by a high level of accuracy and energy efficiency.

Branch prediction is guided by 512-entry L1 and 5000-entry L2 branch target buffers (BTBs). That pipeline is responsible for predicting ahead to populate a queue of future fetch addresses, and keep it as full as possible. There are actually two queues—one for each thread—ensuring there’s always work to be done. The instruction fetch pipeline then pulls addresses from the prediction queue.

Those addresses enter the fetch pipeline’s 64 KB two-way instruction cache, which is shared between both threads (the threads compete dynamically for access to it). Next, Bulldozer’s fetch queue feeds x86 instructions to a decode pipeline composed of four x86 decoders that, in turn, dispatch up to four operations per cycle to the schedulers.

When a miss occurs (that is, it’s not available in the instruction cache), a request is sent to the L2 cache and forwarded to system memory if necessary. That’s a big latency hit. So, while the request is in flight, fetch addresses further in the prediction queue are looked up to see if they’ll hit or not. If they’ll miss as well, a subsequent request is sent to L2 as the first instruction is coming back, overlapping instruction miss requests.

Dual Integer Cores

From the front-end, decoded operations make their way to one of two independent integer cores, where they execute fully out-of-order. The two cores each come equipped with two execution units and two address generation units.

Each core also features its own 16 KB way-predicted L1 data cache. Moreover, both cores include 32-entry L1 data translation lookaside buffers (TLBs) backed by a 1024-entry, eight-way L2 TLB that lives in the logic shared by both cores. Thirdly, each of the two integer cores employs out-of-order load/store units capable of two 128-bit loads/cycle or one 128-bit store/cycle.

Chris Angelini
Chris Angelini is an Editor Emeritus at Tom's Hardware US. He edits hardware reviews and covers high-profile CPU and GPU launches.
  • btto
    yeah finaly, now i'll read it
    Reply
  • ghnader hsmithot
    nOT Bad AMd!
    Reply
  • jdwii
    Been so long and i'm kinda sad.
    Reply
  • compton
    Not many surprises but I've been waiting for a long, long time for this. I hope this is just the first step to a more competitive AMD.
    Reply
  • ghnader hsmithot
    At least its almost as good as Nehalem.
    Reply
  • gamerk316
    Dissapointing. Predicted it ages ago though. PII X6 is a better value.
    Reply
  • As I expected - failure.
    Reply
  • AbdullahG
    I see the guys from the BD Rumors are here. As many others are, I'm disappointed.
    Reply
  • iam2thecrowe
    for the gaming community this is a FLOP.
    Reply
  • phump
    FX-4100 looks like a good alternative to the 955BE. Same price, higher clock, and lower power profile.
    Reply