So again, we know that Silvermont is based on an out-of-order execution engine, which has huge ramifications for performance compared to Saltwell (remember, that design is already competitive with other SoCs available today). Intel continues to lean on macro-op execution for more efficient handling of certain x86 instruction combinations, though.
The 32 nm Saltwell execution pipeline is 16 stages long, and because it’s in-order, macro ops have to go through the whole thing, even if they don’t need the cache access stages. As a result, branch mispredicts waste 13 cycles. In Silvermont, the op can bypass the access stages and execute if cache isn’t needed. Mispredicts consequently only burn 10 cycles.
Each Silvermont core receives a number of tweaks and improvements, from larger branch predictors to the reworked execution units and bigger caches. A lot of effort went into identifying instructions that were on the slower side in Intel’s Bonnell design. Silvermont improves much of that, reducing latency and increasing throughput. Floating-point add operations are down several cycles each, packed SIMD double results are achieved in four clocks (instead of nine), and signed multiplies are sped-up significantly. All told, Intel claims that its per-core IPC is about 50% higher across a wide swath of workloads. Consider the jump from Sandy Bridge to Ivy Bridge, where we saw single-digital IPC gains comparing two CPUs running at the same frequency. A 50% boost is outright massive.
But of course, Atom typically shows up in multi-core configurations. When the processor family first launched, it was a single-core chip. Not long after, Intel introduced a dual-core model, also manufactured at 45 nm. When it came time to adopt 32 nm, only dual-core versions surfaced. And as the company advances it process technology, more parallelized configurations become viable. In fact, Silvermont can scale as high as eight physical cores.
Now, the L2 cache is tightly coupled to the cores, yielding low latency and high bandwidth. Intel’s architects didn’t want to share that cache across more than two cores, though. So they went with a module-based approach. Each little building block includes a pair of cores and 1 MB of L2 cache shared between them (previous Atom processors had 512 KB of L2 per core). Individual cores, the L2 cache, and the interface between the cores and cache can all be power-gated. The cores in a module can even run at different frequencies, though they’ll operate symmetrically by default.
Modules communicate over a point-to-point in-die interface with independent read and write channels, replacing the front-side bus topology altogether. Incidentally, Intel identifies its IDI as one of the keys to the modularity of the Nehalem/Westmere generation, and it’d seem that a lot of work from the “big” core space is affecting Atom here today.
Intel took a look at its core architecture, optimized for single-threaded performance, along with its modular approach to scalability, and chose to drop Hyper-Threading. Including the technology would have increased power use in single-threaded workloads. So the company bypassed SMT altogether, favoring more cores to boost performance in parallelized tasks.
At the same time, Intel’s engineers incremented its instruction set architecture to the 2010 Westmere class—up four years from the original Atom design’s Merom-compatible ISA. SSE4.1, SSE4.2, and POPCNT (which operates on integer registers) are part of this ISA package update, augmenting the Atom’s performance picture. AES-NI acceleration and Secure Key (including the RDRAND instruction and Digital Random Number Generator) also make it in.
Virtualization acceleration evolves from VT-x support to the technology’s second generation, introduced with Nehalem, supporting Extended Page Tables. Virtual Processor IDs in the TLBs and Unrestricted Guest (allowing KVM guests to run real and unpaged mode code natively when EPT are turned on) are part of that same evolution.