Sign in with
Sign up | Sign in

Nvidia GPU Failures Caused By Material Problem, Sources Claim

By - Source: Tom's Hardware US | B 12 comments

Chicago (IL) - When Nvidia announced in early July that it has noticed a higher than normal failure rate in some of its notebook chips, investors reacted concerned, sending the company stock down 22%. The stock recovered after Nvidia apparently demonstrated good control of the issue and a one-time charge of almost $200 million. But what seems to be a closed chapter and a black eye for the company could be a much more serious problem that is just taking off: Several industry sources confirmed to TG Daily what has been reported by some publications for some time: In contrast to Nvidia’s claims that only a limited number of GPUs are affected, sources indicated that "most" recent Nvidia GPUs carry the problem and a chance of failure, pushing the potential damage into stratospheric regions.

We have been chasing the Nvidia GPU problem for quite some time, trying to shed more light on an issue Nvidia refuses to release any meaningful information on other than the statement that a limited number of notebook GPUs is affected. Charlie Demerjian from The Inquirer has been reporting for some time that Nvidia’s problem may be much larger than the company admits. Demerjian wrote that, in addition to currently repaired notebooks, G84/6 GPUs may show failures and even G92 and G94 chips could be affected. After several weeks of digging, it seems that Demerjian’s claims may not be as far from the truth as some have claimed. There is a lot of speculation in the market, fueled by Nvidia’s decision not to reveal any details what the source of the problem is. But the general consensus across industry sources we talked to is that a material problem may be the reason for the trouble and depending on whom you believe, between 15 and 75 million GPUs could be affected.

According to our sources, the failures are caused by a solder bump that connects the I/O termination of the silicon chip to the pad on the substrate. In Nvidia’s GPUs, this solder bump is created using high-lead. A thermal mismatch between the chip and the substrate has substantially grown in recent chip generations, apparently leading to fatigue cracking. Add into the equation a growing chip size (double the chip dimension, quadruple the stress on the bump) as well as generally hotter chips and you may have the perfect storm to take high lead beyond its limits. Apparently, problems arise at what Nvidia claims to be "extreme temperatures" and what we hear may be temperatures not too much above 70 degrees Celsius.

What supports the theory that a high-lead solder bump in fact is at fault is the fact that Nvidia ordered an immediate switch to use eutectic solders instead of high-lead versions in the last week of July. Eutectic solders are believed to solve the problem of fatigue cracking. This material is often chosen in such cases as chip designers already have experience with this material. Further out in the future, chip designers will have to consider ROHS exclusions and a transition to lead free bumps using materials such as Tin-Silver. We are speculating here, but a sudden switch of the material could bring additional problems for Nvidia, as such a material switch involving electro-migration requires substantial design work and testing. As a minimum, Nvidia would have to review its power delivery to the chip to avoid high current bumps. We were not able to receive any information whether this has been done or not.

As far as we are told, ATI has been using eutectic solders for some time and appears not to be experiencing a similar problem. However, Nvidia’s sudden switch to eutectic solders may have limited the availability of the material, impacting AMD production and putting actual chip fabs in the middle. There are questions why Nvidia may have missed potential high-lead issues - and may have missed them for quite some time. There is no doubt that all Nvidia chips were tested according to JEDEC rules. Only Nvidia knows why this issue, if high-lead is actually the problem, slipped through.

If we assume for a moment that high-lead is the cause, then there is this question: Which chips are affected and are only notebook GPUs affected? According to our sources, both desktop chips and notebook chips are affected, but the issue is most likely to pop up in notebook chips due to the increased material constraints amplified by the turning on-and-off procedures. We heard that G84, G86 and G92 GPUs could show failures, but we were not able to confirm G94s. Technically, Nvidia would have to replace all those GPUs and the total number is somewhere north of 70 million. But since the issue tends to show up only in notebooks, it is unlikely that there will be any desktop replacements and therefore we are looking at a number closer to 15 million (notebook) GPUs. Take into account that the repair of such a notebook will cost Nvidia at least $150-$250 and you have a damage that could easily be in the billions of dollars.

At this time we only know that Nvidia has made a switch from high-lead to eutectic, everything else is speculation as long as it is not confirmed by Nvidia. However, the detail of information relating to the material switch is surprising and lends a certain credibility to these sources.

The other question, of course, is how often and in which cases those GPUs actually fail. If Nvidia is right and there are in fact low failure rates, then the $200 million that were allocated to repair affected notebooks should be appropriate. If we assume that Nvidia pays about $200 per repair and that 100% of the potential damage is in the neighborhood of $3 billion, then Nvidia’s $200 million allocation suggest that substantially less than 10% of (notebook) GPUs are showing failures.

A big problem would be if failure rates are in fact higher than expected and Nvidia is trying to contain the problem by playing it down and avoid a massive recall that could inflict a lot of damage to the company’s finances: $3 billion is almost twice of what Nvidia currently has in the bank.

So, what does this mean to you? Obviously, only Nvidia knows how serious the problem really is and there is virtually no way of telling whether your Nvidia-based notebook with an affected GPU will show failures or not, as this will depend on the temperatures the GPU will reach. If it shows failures, however, you should contact your vendor and ask for a replacement, provided you are still covered by a warranty.

Discuss
Ask a Category Expert

Create a new thread in the News comments forum about this subject

Example: Notebook, Android, SSD hard drive

This thread is closed for comments
  • -5 Hide
    thatmymp5 , August 26, 2008 10:09 AM
    not new for me oredi cos i pick and stick on Intel Graphic Accelator Chipset on my Singtel's internet subscription Free of Charge Dell Latitude D430 instead of upgrading to D630! hee hee! is this call Karma? hee hee..
  • 1 Hide
    outlw6669 , August 26, 2008 10:25 AM
    ^
    You sir make my brain hurt.
  • 0 Hide
    Anonymous , August 26, 2008 10:55 AM
    I have a Asus G1S-A1 and my gpu temp goes over 100, then my computer shuts off, only on games like crysis though.
  • Display all 12 comments.
  • 1 Hide
    spaztic7 , August 26, 2008 12:30 PM
    I have a toaster...
  • 0 Hide
    crockdaddy , August 26, 2008 3:54 PM
    I have a credit card that I am still paying off from buying my 7800GTX from three years ago.
  • 0 Hide
    nekatreven , August 26, 2008 7:23 PM
    *moves things around to make more room around the gpu exhaust port*
  • 0 Hide
    Anonymous , August 26, 2008 11:18 PM
    i baught a failing brand new inno3d 8800gt 1gb cant even run a game at stock clocks with stock cooling ... lol
  • 0 Hide
    jhansonxi , August 27, 2008 1:28 AM
    I'm a cheapskate and recently bought a 7300GT. You bleed a lot less with trailing-edge technology.
  • 0 Hide
    Luscious , August 27, 2008 6:07 AM
    Too many assumptions Wolfgang!!! Give me more facts and less speculation. I hardly think The Inquirer is the most credible source out there, or the best tech-savvy publication.

    Article rant aside, this is just great! Now I can't buy any Nvidia-based laptop out there because they refuse to disclose the GPU's that are affected. Nvidia needs to get their act together otherwise it's HP/Dell that will end up with pissed-off customers and an RMA pile so high it'll need it's own warehouse.
  • 0 Hide
    Anonymous , November 10, 2008 9:06 AM
    Prescient comments Luscious. Check out this HP thread of many ticked off customers. http://forums11.itrc.hp.com/service/forums/bizsupport/questionanswer.do?admit=109447626+1226314225626+28353475&threadId=1274587 Also note the references to the class action suit. Do you see the posts containing the links to the law suit web site? No, of course not. Forum moderators have been busy deleting those posts!!! I would advise everyonr to steer clear of any Nvidia GPU's until this mess is resolved. ATI have a perfectly good product that has no known issues. One of the posters is a personal friend and she is going to let her now useless $2300 HP laptop rot on a shelf and purchase a Dell with an ATI GPU. I sincerely pity those people who can't afford that option.
  • 0 Hide
    efffect , March 9, 2009 10:59 PM
    Im would like to know if there is a chip compatible replacement for the laptop GPUs effected. I do laptop component level repairs and there will likely be quite a few of these units coming in after warranties have run out with a blown GPU, I wont be doing any dodgy work so I would only be repairing them if I can get a chip with improved reliability to replace it with.

    Ive seen the press release about the "NB8E-SET" which doesnt state if they are directly compatible with any previous chip.

    Any clues anyone?
  • 0 Hide
    Anonymous , July 10, 2011 3:34 PM
    u didnt read the information in the artice well effect theres noting worng with the gpu chip its the soldering thats fualty if u resolder the chip it should work and if u get cunning im sure u could solder it with the proper solder that should of been used in first place and if ur not that advanced try finding a way to put pressure on chip to hold down this may work u can also try heating chip up its some tricks if u know wat ur doing read about xboxes red ring of death and the way to fix it be very simular just that it will be tempory fix unless u use a better solder or cool it better