Into The Core And Cache, AVX-512
The Skylake Microarchitecture
Skylake's architectural advantages over Broadwell are well-known from our desktop explorations. However, Intel added more to this particular implementation, such as AVX-512 and its re-worked cache hierarchy.
First, a quick refresher. The Skylake design first seen on mainstream desktops is fairly similar to Broadwell, which preceded it. Intel made a number of tweaks to promote instruction-level parallelism, though. Architects widened the front end and improved the execution engine with a larger re-order buffer, scheduler, integer register file, and greater retire performance. Improved decoders, branch predictor (reduced penalties for wrong jumps), and deeper load/store buffers are also among the list of enhancements. Skylake features a wider integer pipeline than its predecessor, along with a beefed-up micro-op cache (1.5x bandwidth and larger instruction window) that has an 80% hit rate.
While that was all well and good, the changes Intel makes to this revamped Skylake architecture are equally exciting.
The Skylake-SP Core - Bolt It On
Intel enables AVX-512 by fusing port 0 and 1 into a single 512b execution unit (doubling throughput) and extending port 5 to add the second FMA unit outside of the core. That's represented as "Extended AVX" in the slide below. This facilitates up to 32 double-precision and 64 single-precision FLOPS per core.
The original core design could have supported higher L1 throughput. However, that wasn't required until Intel added the second FMA unit. Doubling the core's compute capability necessitated more throughput to prevent stalls. So, Intel doubled L1-D load and store bandwidth to keep each core fed. Now we get up to two 64-byte loads and one 64-byte store per cycle.
In addition to Skylake's 256KB per-core L2 cache, Intel tacks on 768KB more outside of the core. Our diagram isn't just a lazy mock-up. Rather, Intel specifies that the added blocks are physically outside of the original core. Data stored on the external L2 cache consequently suffers a two-cycle penalty compared to the internal L2. Still, going from 256KB to 1MB of L2 is a big deal. Greater capacity and better caching algorithms more than make up for slightly higher latency in the bigger performance picture.
Rejiggering The Cache Hierarchy
And then there's the re-architected cache hierarchy. The quadrupled L2 is still private to each core, so none of its data is shared with other cores. Each core is also associated with less shared L3 cache (from 2.5MB to 1.375MB per core).
In an inclusive hierarchy, cache lines stored in L2 are duplicated in the L3. Considering the increased L2 capacity, an inclusive hierarchy would populate most of the L3 cache with duplicated data. To offset the increased L2 capacity, Intel transitioned to a non-inclusive caching scheme that doesn't require all data in the L2 to be duplicated in L3. Now the L3 serves as an overflow cache instead of the primary cache (L2 is now primary). That means only cache lines that are shared across multiple cores are duplicated in the L3 cache. The L2 cache remains 16-way associative, but the L3 drops from 20-way to 11-way.
The increased L2 capacity is a boon for virtualization workloads, largely because they don't have to share the larger private L2 cache. This also grants threaded workloads access to more data per thread, reducing data movement through the mesh.
The L2 miss-per-instruction ratio is reduced in most workloads, while the L3 miss ratio purportedly remains comparable to Broadwell-EP (this varies by workload, of course). Intel provided latency benchmarks along with SPECint*_rate data to represent its L2 and L3 hit rates in common applications.
L3 latency does increase because there are more cores in the overall design. Also, the L3 cache operates at a lower frequency to match the mesh topology.
Intel adds AVX-512 support, which debuted with Knights Landing, to its Skylake design. However, the company doesn't support all 11 instructions. Instead, it targets specific feature sets for different market segments.
The vector unit goes from 256 bits wide to 512, the operand registers are doubled from 16 to 32, and eight mask registers are added, among other improvements. All of that returns solid gains in single- and double-precision compute performance compared to previous generations.
Processing AVX instructions is power-hungry work, so these CPUs operate at different clock rates based on the work they're doing. With the addition of AVX-512, there are now three sets of base and Turbo Boost frequencies (non-AVX, AVX 2.0, AVX-512), and they vary on a per-core basis. Intel's performance slide highlights that, even at lower AVX frequencies, overall compute rates increase appreciably with the instruction set. AVX-512 also dominates in both GFLOPS-per-watt and GFLOPS-per-GHz metrics.
MORE: Best CPUs
MORE: All CPU Content