Scheduling, Core Parking, And Throttling, Oh My!
The fact that AMD’s Bulldozer architecture failed to set the PC world on fire is no secret (AMD Bulldozer Review: FX-8150 Gets Tested), its eight integer cores sharing the resources of a four-module design. AMD credits its effort with lower power consumption compared to a full eight-core design, and even showed off plenty of benchmarks at its press events to demonstrate that the performance of its configuration was truly competitive in the right tests. At the end of the day, though, we were left unimpressed with Bulldozer's position relative to the competition, even though we gave it a fighting chance in our System Builder Marathon, Dec. 2011: $1200 Enthusiast PC.
In our launch story, we made it very clear that Windows 7 is not optimized for the module-based layout that Bulldozer employs. Chris talked to representatives at Microsoft who were able to confirm the operating system's behavior, and he ran the developer build of Windows 8 to confirm the next-gen OS would handle the FX family differently. From that review:
"According to Arun Kishan, software design engineer at Microsoft, each module is currently detected as two cores that are scheduled equally. So, in a dual-threaded application, you might see one active module and three idle modules—great for optimizing power, but theoretically less ideal from a performance standpoint. This also plays havoc with AMD’s claim that, when only one thread is active, it has full access to shared resources. Adding just one additional thread could tie up those shared resources, even as multiple other modules sit idle.
Microsoft is looking to change that behavior moving forward, though. Arun says that the dual-core modules have performance characteristics more similar to SMT than physical cores, so the company is looking to detect and treat them the same as Hyper-Threading in the future. The implications there would be significant. Performance would unquestionably improve, while AMD’s efforts to spin down idle modules would be made less effective."
This explanation does make sense for certain workloads. Two threads running on two separate modules have access to two front ends (and two FPUs), while two cores running on a single module must share both the front-end and FPU. A smarter OS might know the most effective way of distributing the load, which AMD stated would be a feature of Windows 8. Fortunately, MS released a hotfix to address some of what was purportedly going wrong in Windows 7.
There remained, however, a performance penalty in the form of latency if the task that would have been scheduled to an already-active module is instead sent to a “parked” core. Microsoft introduced a second hotfix to address that issue. Put together, these two patches should help overcome the Bulldozer architecture's performance issues in lightly-threaded applications.
We followed up with Microsoft yet again for comment, and heard back that that the core scheduler indeed now recognizes AMD's modules as SMT sets. However, the Windows 7 patches still should not be taken as an indication of how FX will behave under Windows 8. Apparently, there will be additional scheduler improvements that relate to how SMT is treated.
Although Microsoft helped AMD address how the Bulldozer architecture is addressed in less demanding workloads, there is still an issue we've seen on the FX-8150, where the 3.6 GHz part throttles down to 3.3 GHz under a full load. That’s probably considered a power-saving feature in densely-packed 2U servers. However, desktop users have the option to disable this strange step backward through the HPC Mode options exposed through recent firmware updates.
Our goal today is to find out how the latest ecosystem improvements help AMD's early-adopting customers. We're going to test with the benchmarks we normally run, too, and not the hand-picked titles AMD is using to illustrate the gains enabled by Microsoft's new patches.