AMD responds to buggy AI software complaints, releases firmware documentation for RDNA3 GPUs — pressure from TinyCorp spurred the sudden posting

RDNA 3 Silicon (Image credit: AMD)

Following AMD's confirmation last month that it was moving forward with plans to open source its GPU stack after complaints from Tiny Corp, we've finally seen the first significant step via the release of MES documentation on GPUOpen.com. MES stands for Micro Engine Scheduler, which corresponds to how graphics and compute work are scheduled on GPUs—AMD RDNA 3 GPUs, in particular, are specified in this new documentation.

While this is indeed one of the things Tiny Corp asked for, they have already noted on Twitter that they are bypassing the MES in their backend alongside most of the MEC. Tiny Corp, for those unfamiliar, is focused on building powerful AI workstations in a small footprint and lower price than the current top-end of AI hardware.

.@AMD @amdradeon released some MES documentation today! (it's on GPUOpen)A good start, but we are bypassing the MES now in our "AMD" backend. We are even bypassing most of the MEC.Can you document the PM4 packets and what happens after you poke COMPUTE_DISPATCH_INITIATOR? pic.twitter.com/WabsNG3q3SMay 9, 2024

So, it seems that Tiny Corp's past prodding of AMD was successful but not fast enough to stop them from implementing an MES workaround. Tiny Box's goal to "commoditize the petaflop" with the help of high-performance AMD or Nvidia-powered AI boxes should get a little easier with further open-sourcing of AMD's software and documentation. The MES firmware to go along with this documentation is expected to be released in the coming weeks and is likely only held by legal concerns as it goes open source.

Additional parts of the AMD Radeon software stack are expected to be open-sourced throughout the year, echoing earlier statements from AMD. Tiny Corp previously seemed displeased with AMD's progress so far, but AMD may yet be able to win over Tiny Corp and other prosumers seeking Nvidia alternatives. Raw hardware power dictates that AMD's Tiny Box should be easily on par with Nvidia for much cheaper, but the current status of AMD's software stock prevents that from being true.

Tiny Corp and others hope to push AMD into open-sourcing its software stack to make these software issues easier to diagnose and fix. With any luck, this can be done efficiently enough to start giving Nvidia some real competition in the GPU compute market, especially where AI and similar workloads are concerned.

See more GPUs News

TOPICS

Christopher Harper has been a successful freelance tech writer specializing in PC hardware and gaming since 2015, and ghostwrote for various B2B clients in High School before that. Outside of work, Christopher is best known to friends and rivals as an active competitive player in various eSports (particularly fighting games and arena shooters) and a purveyor of music ranging from Jimi Hendrix to Killer Mike to the Sonic Adventure 2 soundtrack.

9 Comments Comment from the forums

bit_user

Credit to AMD, I suppose, but I think there's no way you can ever win, with some like George Hotz. In fact, showing weakness in the face of his demands almost guarantees that he and others will try the same tactics, as long as there's ever anything they want from AMD (which is pretty much always going to be the case).

I'm not saying AMD should've ignored Hotz, but they should've politely acknowledged his initial complaints and then followed up with him as per their normal partner support process, which certainly isn't done via social media.

Meanwhile, I'd bet most of AMD's current and future competitors are studying this new source code drop to see what tricks and secrets they can learn about AMD's GPUs.
Reply
sivaseemakurthi

What if the issues are in hardware but not in software? Does opening up the drivers help in that case? Also why couldn't AMD ask Hotz just sign an NDA and get the info? May be AMD thought making their software open will help with improving the quality!
Reply
bit_user

sivaseemakurthi said:
What if the issues are in hardware but not in software? Does opening up the drivers help in that case?
In my personal experience of developing firmware for a buggy ASIC, not a chance. Us firmware engineers would see some inexplicable behavior, triple-check our code and make sure it's not our fault, and then raise an issue with the hardware team.

At that point, they'd look at our characterization and go away to study the relevant Verilog code. Typically we had already reproduced the issue in a RTL level simulation and they'd be studying the logfiles it generated as well. Then, they'd return with a set of things for us to try. Ideally, we'd workaround the issue on the first try, but sometimes it required a few iterations.

The chance of an outsider debugging hardware issues in a production environment, not even on a testbench with a logic analyzer hooked up, and with virtually know in-depth knowledge about the hardware implementation, just seems very remote to me. Even if they're able to characterize the symptoms of the bug and fumble their way into a mitigation, it's difficult to know whether you have a reliable workaround without someone confirming the root cause of the problem.

sivaseemakurthi said:
Also why couldn't AMD ask Hotz just sign an NDA and get the info?
Good questions. I assume the level of effort in packaging up the code + tools + docs for partners to fiddle around with is probably similar to open sourcing, except the latter would require an IP review to determine whether you're giving away any IP that AMD doesn't own or giving away any of AMD's crown jewels.

The fact that Hotz was pushing for AMD to open source this stuff suggests to me that AMD wasn't even prepared to give anyone access under NDA, or else I don't know why they wouldn't have done exactly that.

sivaseemakurthi said:
May be AMD thought making their software open will help with improving the quality!
Eh, I'm not going to venture any further into AMD's mindset. I've probably said more than enough about that.
Reply
edzieba

bit_user said:
I'm not saying AMD should've ignored Hotz, but they should've politely acknowledged his initial complaints and then followed up with him as per their normal partner support process, which certainly isn't done via social media.
The problem is, that did happen. But the outcome was 'wait for the next release' followed by a release that did not fix the bugs.
The social media spat was after the normal avenues of communication came up empty, and as aa result of basically "if you're not going to bother fixing it, then give us the code so we can fix it ourselves".

AMD may have considered Tiny Corp to be too small to be worth dedicating resources to, but that has resulted in a massive own-goal for AMD. For a company positioning themselves as the cheaper and more open alternative to Nvidia in the DL space - and Tiny Corp initially choosing them on that basis and explicitly eschewing offering an Nvidia solution - they have now gained a reputation for "if you're not ordering hundreds of thousands of units then screw your bug reports", and a previously anyone-but-Nvidia supplier now offering an Nvidia solution as the explicitly 'it's more expensive, but at least it works' option. If a company that explicitly wanted to use your cards have to start a public slapfight just to get not even support, but just the tools to fix the problem themselves, what chance does anyone else have?

The smaller business and hobbyist markets may not offer the nice slabs of profit that hyperscalers do, but if you spurn those markets then you effectively offer your competitors as the only option to new entrants. If your potential customers grew up playing with Nvidia-based cards, and then get the cold shoulder from AMD when trying to scale up their little bedroom hobby into a business, then they're certainly not going to be considering you in a good light when you want to scale out to a datacentre or multiple datacentres, and particularly not when they've already spent years building your software stack with a competitors API suite. This was Nvidia's big win with CUDA: the API you run on your massive farm of H100s and the API you can run on the 4050 you're playing with LLMs on is the same API.
Reply
TechLurker

edzieba said:
AMD may have considered Tiny Corp to be too small to be worth dedicating resources to, but that has resulted in a massive own-goal for AMD. For a company positioning themselves as the cheaper and more open alternative to Nvidia in the DL space - and Tiny Corp initially choosing them on that basis and explicitly eschewing offering an Nvidia solution - they have now gained a reputation for "if you're not ordering hundreds of thousands of units then screw your bug reports", and a previously anyone-but-Nvidia supplier now offering an Nvidia solution as the explicitly 'it's more expensive, but at least it works' option. If a company that explicitly wanted to use your cards have to start a public slapfight just to get not even support, but just the tools to fix the problem themselves, what chance does anyone else have?
Let's not forget that Tiny Corp explicitly wants to use consumer grade GPUs to bypass the buy-in cost of enterprise grade accelerators, and yet still demand the same level of service that enterprise users get but consumer users do not. AMD definitely did not need to pay service to them, but they did anyway, since it makes sense for them to do so in the long run. If anything the more important issue is that Tiny Corp is still running their AI cabinets off Threadripper anyway, and it's more important AMD not kill Threadripper again.

The whole reason they're badgering AMD is because there is no one else interested in catering to them. NVIDIA would tell them to screw off and buy their AI accelerators, while Intel just isn't in a position to devote resources to converting their gaming GPUs to AI tasks when they're focused on making in-roads into the gaming sector and prioritizing enterprise clients with their own AI offerings. You'll note that not once did Tiny Corp harass NVIDIA about their offerings or trying to get them to unlock access to their GPUs for better optimization.

If anything, the fact AMD has remained quiet and slowly but steadily review their code for what can be open-sourced is admirable, refusing to completely play into Hotz' hands. They've already put top level engineers on AMD's side in contact with Tiny Corp's team, which is normally reserved for enterprise, and at this point Hotz is just being a whiny prick to try and get sympathy on his side. And to AMD's benefit, having open-sourced some non-sensitive stuff could lead to more novel coding for AI or whatever on their hardware, for those that like to experiment with such besides Tiny Corp.
Reply
bit_user

edzieba said:
The problem is, that did happen. But the outcome was 'wait for the next release' followed by a release that did not fix the bugs.
The social media spat was after the normal avenues of communication came up empty, and as aa result of basically "if you're not going to bother fixing it, then give us the code so we can fix it ourselves".
I think that exposed a deep flaw in Tiny's business plan. They decided to take a product (AMD's gaming GPUs), which hadn't previously been used for serious AI training, and use them to undercut the other solutions on the market. It was naive to think there wouldn't be any technical hurdles encountered, or that they could all be cleared within the aggressive timeline of a typical startup. Especially if they hadn't previously gotten AMD on board with their plan and committed to devoting extra resources, from the outset.

I'm actually less critical of what Tiny/Hotz did after that, because it's unsurprising to see desperate people do reckless things. At this point, my criticism switches to how AMD handled it. Not only did they essentially do everything that Tiny asked, but it didn't even make a difference, in the end. It was infeasible for AMD to open source its MES firmware on a timescale that was meaningful to Tiny, if you look at everything involved in doing something like that, which is yet another reason why it was silly for them to cave to this demand.

edzieba said:
AMD may have considered Tiny Corp to be too small to be worth dedicating resources to,
How do you know they didn't? From the sound of it, AMD absolutely did have people working on the issues Tiny raised! Bugs cannot be fixed on a deterministic timescale. Bugs involving hardware & firmware are some of the most tricky. The mere fact that AMD didn't get all of the bugs fixed when Tiny demanded doesn't mean nobody was working on them. Hotz and AMD both referred to meetings they had and work that AMD did to try and address Tiny's issues. However, due to your obvious bias, you completely ignore all of that, because it doesn't suit your narrative.

edzieba said:
a previously anyone-but-Nvidia supplier now offering an Nvidia solution as the explicitly 'it's more expensive, but at least it works' option.
That's not correct. They went to market with both solutions on offer, after bypassing MES. Of course, after all the noise they made, they had to downplay the AMD-based solution.

edzieba said:
If a company that explicitly wanted to use your cards have to start a public slapfight just to get not even support,
You're really buying into Hotz' exaggerated narrative, and then you're exaggerating it even on top of what even he said!

This just shows why it was a lost cause for AMD ever to engage with someone like Hotz. There's no upside for them, and engaging him only brings more attention and appearance of legitimacy to his position. Furthermore, it sounds like whatever happened with this whole affair, you'd find some way to trash AMD for it. At first, I thought you had just gotten caught up in the drama of the whole affair, but I can now see that your anti-AMD hate runs deeper than that.
Reply
bit_user

TechLurker said:
If anything the more important issue is that Tiny Corp is still running their AI cabinets off Threadripper anyway, and it's more important AMD not kill Threadripper again.
No, they never said anything about using ThreadRipper. Tiny's plan always involved lower core-count EPYCs, because that's how you get the most PCIe lanes per $.
https://tinygrad.org/#tinyboxI'm not sure if they said, but I suspect they're Milan (Zen 3), because it offers PCIe 4.0 and that's all they need. Paying extra for a CPU & platform with PCIe 5.0 would be silly, so unless AVX-512 support was super important (which I doubt), it'd be better to go with a Zen 3 solution and just add cores until the host CPU isn't a bottleneck.

TechLurker said:
The whole reason they're badgering AMD is because there is no one else interested in catering to them. NVIDIA would tell them to screw off and buy their AI accelerators,
You might be right, if that had been their plan from the outset. As it played out, he did ultimately offer a Nvidia option and perhaps they decided to go along with it for the PR win, after all the negative press that was generated about AMD. If Nvidia had rebuffed Tiny after all that, then their proponents like @edzieba could not turn this into an allegory about how AMD sux and nobody should ever use anything other than Nvidia.

TechLurker said:
while Intel just isn't in a position to devote resources to converting their gaming GPUs to AI tasks
No, the issue with Intel is their A770 simply lacks the performance and memory to be a viable alternative. Tiny acknowledged this and they're right.

As far as driver support goes, Intel has a workstation/datacenter version of the same dGPUs with a 5 year support contract. From what I've seen, the support for running compute workloads on Intel's dGPUs has always been in better shape than their gaming support. I only have experience with running compute jobs on their iGPUs, but my interest in their dGPUs has always been for compute - so that's the aspect I've followed most closely.

I would also point out that my dealings with Intel's partner support team have been top notch. We weren't doing anything as ambitious as Tiny, but their support was competent and responsive and their compute support on their iGPUs proved to be stable and mature.
Reply
edzieba

bit_user said:
I think that exposed a deep flaw in Tiny's business plan. They decided to take a product (AMD's gaming GPUs), which hadn't previously been used for serious AI training, and use them to undercut the other solutions on the market. It was naive to think there wouldn't be any technical hurdles encountered, or that they could all be cleared within the aggressive timeline of a typical startup.
That's the core problem: On the Nvidia side of things, AI training and inferencing is explicitly supported across all GPU lines. As we saw in this very situation, on the AMD side of things it's 'supported' until you have an actual problem, at which point 'support' becomes 'pulling teeth'.

bit_user said:
How do you know they didn't? From the sound of it, AMD absolutely did have people working on the issues Tiny raised! Bugs cannot be fixed on a deterministic timescale. Bugs involving hardware & firmware are some of the most tricky. The mere fact that AMD didn't get all of the bugs fixed when Tiny demanded doesn't mean nobody was working on them.
The problem wasn't that bugs weren't fixed "when Tiny demanded", the problem was AMD releasing a "here' we fixed the bugs" release that did not actually fix the bugs. That's what prompted the escalation in the first place!

TechLurker said:
yet still demand the same level of service that enterprise users get but consumer users do not
Consumers on the green side of the fence can expect support for AI training and inference on their non-enterprise cards. Nvidia even release their own software to do so. That Tiny expects AMD to support what they claim to be available on Radeon GPUs and what is supported by their competition is not an 'enterprise level of support' but the minimum support expected for features advertised.

TechLurker said:
The whole reason they're badgering AMD is because there is no one else interested in catering to them. NVIDIA would tell them to screw off and buy their AI accelerators You'll note that not once did Tiny Corp harass NVIDIA about their offerings or trying to get them to unlock access to their GPUs for better optimization.
Because they didn't need to in the first place. They're using 4090s, and it works. No need to give up on direct support and yell on twitter if there is not a problem in the first place.
Customer preference has also been clear: 5:1 in favour of it.
Reply
bit_user

edzieba said:
on the AMD side of things it's 'supported' until you have an actual problem, at which point 'support' becomes 'pulling teeth'.
If we're to take Hotz' word for it, yet we know he's given to exaggeration and creating drama.

edzieba said:
The problem wasn't that bugs weren't fixed "when Tiny demanded", the problem was AMD releasing a "here' we fixed the bugs" release that did not actually fix the bugs. That's what prompted the escalation in the first place!
If AMD didn't have Hotz' code to test with, then it's actually pretty hard to know if you truly fixed a bug. If you've never looked at a bug report, gone into the code, found and fixed what you thought was the cause, and then marked it as fixed... only for the reporter to inform you that the problem was not fixed, then I say you've never worked as a software developer. Fixing bugs is sometimes an iterative process, especially if we can't reproduce a bug in our own development environment.

edzieba said:
Consumers on the green side of the fence can expect support for AI training and inference on their non-enterprise cards.
It would be interesting to see how many bugs have been reported on CUDA, over the years, and compare how responsive Nvidia is at fixing them vs. AMD's current rate.
Reply

Show more comments