Interview with 4A Games & Metro: Last Light Redux
Introduced back in 2013, Metro: Last Light is an older title that we continue to use as a benchmark because of its ability to brutalize graphics hardware. 4A Games’ 4A engine is no longer shiny and new, but it is threaded. What does that mean? Dmitry Lymar, director at 4A Games, answered a few of our questions regarding Metro and its underlying engine.
Tom’s Hardware: We understand that one of the motivators for the 4A Engine was X-Ray’s (GSC Game World’s foundation for the S.T.A.L.K.E.R. series) lack of parallelization. Can you go into more depth about how 4A uses threading and what benefits that confers?
Dmitry: Right from the design stage of the 4A Engine we made correct decisions regarding parallelization, and I am still happy with the system. However, we've had to go through several iterations of adapting/optimizing the engine for target hardware.
Conceptually our model is really simple. We have a few control threads responsible mostly for spawning tasks (or work-items). Other threads are just workers; they "steal" those tasks from the queue and execute them out of order. Many of those tasks spawn another tasks if they need to. The control threads become workers as well when they do their "serial" processing and spawn a lot of tasks.
Usually there are almost no synchronization points, apart from the one at the start of a frame. That is why the scalability is really good.
There are some auxiliary threads as well, but they are mostly I/O threads with little to no compute on them.
As for variations we've used in the past:
Xbox 360 = Two control threads, four workers + auxes
PlayStation 3 = Two control, 12 workers (actually they were "fibers" or user-mode threads, and 12 of them were required to hide offloading latency to SPUs for our code) + SPUs + auxes
Xbox One/PlayStation 4 = Three control, three workers + one latency-tolerant worker + auxes
PC (for Redux games): One dedicated D3D API submission thread, two or three control (depends on hardware thread count), all other available cores/threads become workers + some aux + some GPU driver threads + a lot of random threads from other processes
PC is slightly more complicated because of oversubscription risks (which result in visible stutter for us), as we do not control what already is running on user's PC.
Tom’s Hardware: Is there an upper bound to the number of host processing cores the engine can schedule to?
Dmitry: No, there is no upper bound. However, there is lower bound. We can run on a dual-thread CPU, but in this case there is already internal oversubscription, which causes lower framerate and stutter. Realistically, a quad-threaded CPU is the minimum for us, with eight-thread CPUs vastly preferred.
Tom’s Hardware: In the two images attached (we shared some of our graphs with Dmitry), it’s easy to see some scaling up to ~six cores at 1920x1080, though even on a GeForce GTX 1080, graphics becomes the bottleneck at 2560x1440 (this is one reason Tom’s benchmarks with Metro). Given the predominantly graphics-bound nature of many games, what makes a threaded engine so important?
Dmitry: Regarding the graphs. As we mostly optimize CPU performance on the current-gen consoles (as they are slower), for you to see our real engine scalability you will have to underclock CPUs to approximately consoles’ performance.
Obviously on such powerful CPUs we will be easily GPU-bound, especially as graphics gets special treatment on the PC from us (read: we try to load more powerful GPUs with better graphics), but that is not the case for consoles. CPU can be quite real bottleneck on XB1/PS4.
Tom’s Hardware: As an influencer, and understanding that companies like Intel and AMD work within power budgets, would you advise gamers to prioritize more cores (as in the Haswell-E/Broadwell-E platform) or higher clock rates/more IPC throughput (as in Skylake) if forced to choose?
Dmitry: For us, quad-core/eight-thread i7 is probably the sweet spot on PC. Mostly because we can be sometimes bound by D3D submission thread, so more threads barely help. As a side note - measuring only throughput/framerate is not the right thing to do for gaming. Framerate stability/smoothness is of equal priority. For example, a higher-clocked i5 can give higher average framerate, but lower-clocked i7 can deliver more even framerate, depending on the machine config of course.
MORE: Best Graphics Cards
MORE: All Graphics Content
Metro: Last Light Redux
Although it doesn’t seem to matter whether you use an eight- or 10-core CPU, both do demonstrate an advantage over our six-core configuration at 1920x1080. In turn, that processor is notably faster than a four-core Skylake-based design.
The dual-core CPU technically satisfies 4A Games’ minimum requirement, but it also severely handicaps our GeForce GTX 1080 (as evidenced by an average frame rate lower than the other four systems’ minimums). It’s not as though two cores incur huge frame time spikes in this metric, but true to Dmitry’s comments, frame rates are lower and you clearly see a less-smooth experience on-screen.
A bump up to 2560x1440 has such an impact on graphics performance that more than four cores no longer matter. In fact, even Broadwell-E’s big L3 caches can’t keep the Skylake architecture’s IPC advantage from shining alongside GeForce GTX 1080.
That’s not to say two Skylake cores can catch up. They can’t. At least not yet. But now we’re seeing 4A Games’ emphasis on higher-quality graphics shrink the gap between host processors.
Incredibly, the graphics load at 3840x2160 is such that core count matters little. It’s the Skylake architecture’s ability to get more done per clock cycle that gives GeForce GTX 1080 a touch extra headroom. The result is that we see our four-core sample land in first place, followed by the two-core config. The Broadwell-E chips file in after.
These platforms are essentially tied in the average frame rate and frame rate over time graphs. But the dual-core arrangement does exhibit greater frame time variance just as Dmitry promised, making it a great illustration of his point.