A Shared Front-End And Dual Integer Cores
Sharing The Front-End
As I already mentioned, Bulldozer’s instruction fetch and decode stages are shared between both of its cores. AMD uses interleaved multi-threading to track the thread ID of each instruction in flight, decide which thread most needs work completed, and perform an operation on behalf of that thread. It’s able to switch on a per-cycle basis to keep progress moving on both threads.
AMD actually decouples the branch target predictor from the instruction fetch stage, allowing it run ahead, independent of any stalls that occur in the fetch pipeline. More important, AMD says, is that decoupling those components enables a feature called prediction-directed instruction prefetch, characterized by a high level of accuracy and energy efficiency.
Branch prediction is guided by 512-entry L1 and 5000-entry L2 branch target buffers (BTBs). That pipeline is responsible for predicting ahead to populate a queue of future fetch addresses, and keep it as full as possible. There are actually two queues—one for each thread—ensuring there’s always work to be done. The instruction fetch pipeline then pulls addresses from the prediction queue.
Those addresses enter the fetch pipeline’s 64 KB two-way instruction cache, which is shared between both threads (the threads compete dynamically for access to it). Next, Bulldozer’s fetch queue feeds x86 instructions to a decode pipeline composed of four x86 decoders that, in turn, dispatch up to four operations per cycle to the schedulers.
When a miss occurs (that is, it’s not available in the instruction cache), a request is sent to the L2 cache and forwarded to system memory if necessary. That’s a big latency hit. So, while the request is in flight, fetch addresses further in the prediction queue are looked up to see if they’ll hit or not. If they’ll miss as well, a subsequent request is sent to L2 as the first instruction is coming back, overlapping instruction miss requests.
Dual Integer Cores
From the front-end, decoded operations make their way to one of two independent integer cores, where they execute fully out-of-order. The two cores each come equipped with two execution units and two address generation units.
Each core also features its own 16 KB way-predicted L1 data cache. Moreover, both cores include 32-entry L1 data translation lookaside buffers (TLBs) backed by a 1024-entry, eight-way L2 TLB that lives in the logic shared by both cores. Thirdly, each of the two integer cores employs out-of-order load/store units capable of two 128-bit loads/cycle or one 128-bit store/cycle.