Google Stadia Port Troubles Blamed on the Linux Kernel Scheduler
Bad penguin.
If anyone cares more about millisecond-long delays than gamers, it's developers. They know a millisecond can make a big difference in how a game plays. That's bad news for Google Stadia because devs recently claimed an issue with the Linux kernel scheduler can lead to issues in games ported to the platform.
A developer named Malte Skarupke publicized the problem on Monday. Skarupke explained how he became aware of the issue and his efforts to address it in a blog post (shout-out to Phoronix for spotting the post).
This is the high-level overview Skarupke provided before offering more technical details about the issue:
"I overheard somebody at work complaining about mysterious stalls while porting Rage 2 to Stadia. The only thing those mysterious stalls had in common was that they were all using spinlocks. I was curious about that because I happened to be the person who wrote the spinlock we were using. The problem was that there was a thread that spent several milliseconds trying to acquire a spinlock at a time when no other thread was holding the spinlock. Let me repeat that: The spinlock was free to take, yet a thread took multiple milliseconds to acquire it. In a video game, where you have to get a picture on the screen every 16ms or 33ms (depending on if you’re running at 60Hz or 30Hz), a stall that takes more than a millisecond is terrible. Especially if you’re literally stalling all threads."
Skarupke said he spent months investigating the issue before concluding that "most mutex implementations are really good, that most spinlock implementations are pretty bad and that the Linux scheduler is OK but far from ideal." He eventually decided to apply the band-aid solution of switching from a spinlock to a mutex.
More information is available in Skarupke's blog post, which is worth a read for anyone curious about how much difference a few milliseconds of latency can make while playing a game--especially on a streaming platform like Stadia--and how developers try to solve those problems. Hopefully it stops becoming an issue in the future.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Nathaniel Mott is a freelance news and features writer for Tom's Hardware US, covering breaking news, security, and the silliest aspects of the tech industry.
-
mitch074 ...and it was debunked by Linus Torvalds himself here : https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723and here: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189752 He describes the way the time measurement is plain wrong for one, and second that using spinlocks in user space is in 99% of cases a bad idea.Reply
The original developer's rant simply amounts to "using spinlocks on Windows works as I want it, but not on Linux". From what can be found, it would seem that this is a bug, not a feature, of the Windows kernel. -
bit_user
That may be, but...mitch074 said:...and it was debunked by Linus Torvalds himself
Seems consistent with what the original author concluded, which is sound advice. I never use spinlocks.mitch074 said:and second that using spinlocks in user space is in 99% of cases a bad idea.
So what should we do based on the above? I hope to have convinced you to avoid spinlocks, even for very short sections.
Um, not really. He was debugging a real problem, which was stuttering that he didn't see on Windows. Even if his measurement methodology was flawed, I think he still reached the best conclusion, and had some interesting observations about different Linux schedulers that he tried (not the timing data, which Torvalds debunked, but the more casual observations) and about realtime thread scheduling (in short: don't).mitch074 said:The original developer's rant simply amounts to "using spinlocks on Windows works as I want it, but not on Linux". From what can be found, it would seem that this is a bug, not a feature, of the Windows kernel.
Edit: I do think it's pretty sad to see how much time that developer spent, trying to optimize a fundamentally bad approach. However, that ultimately serves to reinforce his conclusion & Torvalds' message. Had his analysis not been so involved, I doubt this would've gotten quite so much attention.