Update 11/1/18: Added thermal analysis of the GDDR6 memory modules and FLIR images.
UPDATE #2 11/1/18: Corrected incorrect image and text.
Product launches rarely go off without a hitch. But knowing that provides little comfort when you've spent $1,200 on Nvidia's flagship graphics card only to have it fail right away. That appears to be what's happening to numerous people who bought the RTX 2080 Ti at launch, forcing Nvidia to respond.
Digital Trends reported this week that several RTX 2080 Ti customers on Nvidia's support forum and Reddit have complained about various issues with their graphics cards. The exact problems vary: people have reported "crashes, black screens, blue screen of death issues, artifacts and cards that fail to work entirely," the report said. Nvidia's replacement cards seem to suffer from similar problems too.
We don't know what exactly could be behind these reported problems. They don't appear to be limited to the RTX 2080 Ti Founders Edition card, as several people said they had problems with cards from Nvidia's add-in partners as well. That could indicate a problem with the underlying Turing architecture. It could also result from manufacturing problems with the GPUs themselves, or be caused by unrelated issues at each company.
But this is sure to leave a sour taste in people's mouths. It's one thing to spend more than a grand on a graphics card knowing that its full potential, whether it's sheer performance or broader support for new features like ray tracing and DLSS, won't be fully tapped for years. It's another thing entirely when the card doesn't work shortly after launch--at that point someone could've just bought a GTX 1080 Ti and been done with it.
Of course, all the usual caveats about online kvetching apply here as well. People experiencing these problems are more likely to participate in a support forum or post on Reddit than people whose RTX 2080 Ti is working perfectly. Many of these claims haven't been verified either, and if trolls can spread enough disinformation to impact elections they can almost certainly inspire doubt in Nvidia's flagship product.
That doesn't mean these complaints aren't legitimate though. The RTX launch has been anything but smooth, with numerous delays to the RTX 2080 Ti's debut. Questions about which Turing card someone ought to get are darn near irrelevant if the flagship—and some of the RTX 2080 cards as well—are reportedly failing right out of the gate.
Nvidia, for its part, has treated these complaints like business as usual. The company told us that "it's not an increasing number of users" affected by this problem, saying "it's not broad." It added that "we are working with each user individually like we do always."
This is a product launch, and like other product launches, it's no surprise that it got a little messy. At least Nvidia appears to be standing by.
Is Memory to Blame?
Tom's Hardware Germany analyzed its infrared images of the GTX 2080 Ti reference card to investigate rumors that Micron's GDDR6 packages are overheating, thus causing the errors. Thermal measurements indicate that the M6 and M7 GDDR6 modules could run hot during an extended 100% load. The modules are located directly over internal power supply tracks embedded in the PCB. These tracks run between the PWM nodes and the GPU socket. The memory modules could run hotter than the measurement shown in our image, which is a measurement of the PCB and not the GDDR6 module, due to the high currents flowing through the power tracks embedded in the PCB and heat migration from the VRMs.
We don't know many of the technical details surrounding Micron's GDDR6 packages, but we do know they have a maximum safe operating temperature of 95C. In either case, the symptoms of the failures also seem to support the theory that overheated memory is to blame. Most readers have complained of failures after the cards had been used for some period of time, and in some cases, the cards even work correctly after a cooling-off period. The problems also appear to be more prevalent in partner cards that use cheaper cooling solutions.
That is the key. I read the posts on nvidias forum and over at reddit. Each case has no proof of problem and is just a report. This could be legit but with the state of information currently some healthy skepticism is necessary. We are also talking about a half dozen reports here even if legit.
This is almost begging some to jump topic and argue over whether the claim is valid, or just a part of the blame game.
NVidia started the doubt train in the way they handled the announcement... it isn't hard to imagine that others will give it a good push to give it some momentum. It also isn't hard to imagine that NVidia, wanting to make sure they had a good lead on AMD, and fearing AMD just might pull a Ryzen out of their hat in the GPU market, may have rushed it to market, bugs and all. We just don't have enough info... and I'm sure NVidia won't tell
From my point of view, if NVIDIA does not consider the launch of the RTX series to be broad, they could claim a 100% failure rate is "not broad."