Researchers Chart Alarming Decline in ChatGPT Response Quality

ChatGPT quality declines
(Image credit: Future)

In recent months there has been a groundswell of anecdotal evidence and general murmurings concerning a decline in the quality of ChatGPT responses. A team of researchers from Stanford and UC Berkeley decided to determine whether there was indeed degradation and come up with metrics to quantity the scale of detrimental change. To cut a long story short, the dive in ChatGPT quality certainly wasn't imagined.

Three distinguished academics, Matei Zaharia, Lingjiao Chen, and James Zou, were behind the recently published research paper How Is ChatGPT's Behavior Changing Over Time? (PDF) Earlier today, Computer Science Professor at UC Berkeley, Zaharia, took to Twitter to share the findings. He startlingly highlighted that "GPT -4's success rate on 'is this number prime? think step by step' fell from 97.6% to 2.4% from March to June."

GPT-4 became generally available about two weeks ago and was championed by OpenAI as its most advanced and capable model. It was quickly released to paying API developers, claiming it could power a range of new innovative AI products. Therefore, it is sad and surprising that the new study finds it so wanting of quality responses in the face of some pretty straightforward queries.

We have already given an example of GPT-4's superlative failure rate in the above prime number queries. The research team designed tasks to measure the following qualitative aspects of ChatGPT's underlying large language models (LLMs) GPT-4 and GPT-3.5. Tasks fall into four categories, measuring a diverse range of AI skills while being relatively simple to evaluate for performance.

  • Solving math problems
  • Answering sensitive questions
  • Code generation
  • Visual reasoning

An overview of the performance of the Open AI LLMs is provided in the chart below. The researchers quantified GPT-4 and GPT-3.5 releases across their March 2023 and June 2023 releases.

(Image credit: Matei Zaharia, Lingjiao Chen, James Zou)

It is clearly illustrated that the "same" LLM service answers queries quite differently over time. Significant differences are seen over this relatively short period. It remains unclear how these LLMs are updated and if changes to improve some aspects of their performance can negatively impact others. See how much 'worse' the newest version of GPT-4 is compared to the March version in three testing categories. It only enjoys a win of a small margin in visual reasoning.

(Image credit: Matei Zaharia, Lingjiao Chen, James Zou)

Some may be unbothered about the variable quality observed in the 'same versions' of these LLMs. However, the researchers note, "Due to the popularity of ChatGPT, both GPT-4 and GPT-3.5 have been widely adopted by individual users and a number of businesses." Therefore, it isn't beyond the bounds of possibility that some GPT-generated information can affect your life.

The researchers have voiced their intent to continue to assess GPT versions in a longer study. Perhaps Open AI should monitor and publish its own regular quality checks for its paying customers. If it can't be clearer about this, it may be necessary for business or governmental organizations to keep an check on some basic quality metrics for these LLMs, which can have significant commercial and research impacts.

AI and LLM tech isn't a stranger to surprising issues, and with the industry's data pilfering claims and other PR quagmires, it currently seems to be the latest 'wild west' frontier on connected life and commerce.

Mark Tyson
News Editor

Mark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.

  • PlaneInTheSky
    I wrote a few months ago that AI gets worse over time, because it has to resort to less and less reliable data to train on.

    Machine learning needs so much data because it has no idea what it's reading. The Chinese room argument can easily prove this, and also proves how flawed the Turing test is.

    Because of the complete lack of intelligence involved in machine learning, it can not interpret, and has to go through tons of data before it will react with any consistency.

    Eventually it will unconsciously start to train on preexisting AI data and you get a similar effect to that game you played as a kid, where 1 person is told to tell the next person a sentence, and then the next person,'ll know it's a very unreliable way to transmit data because bits and pieces get left out or added. That's what AI eventually does, and it becomes incredibly unreliable.

    Few seem to be aware that AI gets worse over time. Researchers won't tell you because they will lose their funding, nor will hardware companies.

    An article on the Atlantic finally spoke about it.

    "The language model they tested, too, completely broke down. The program at first fluently finished a sentence about English Gothic architecture, but after nine generations of learning from AI-generated data, it responded to the same prompt by spewing gibberish: “architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.” For a machine to create a functional map of a language and its meanings, it must plot every possible word, regardless of how common it is. “In language, you have to model the distribution of all possible words that may make up a sentence,” Papernot said. “Because there is a failure over multiple generations of models, it converges to outputting nonsensical sequences.”

    In other words, the programs could only spit back out a meaningless average—like a cassette that, after being copied enough times on a tape deck, sounds like static. As the science-fiction author Ted Chiang has written, if ChatGPT is a condensed version of the internet, akin to how a JPEG file compresses a photograph, then training future chatbots on ChatGPT’s output is “the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.”
  • peachpuff
    Nothing wrong with that, it's just learning to be as dumb as it's users.
  • bigdragon
    I wonder how much impact various data scraping countermeasures are having on GPT and its AI siblings. I know a lot of artists who have put subtle noise filters on their art and some angry developers who spread unrelated or bogus code on question answer sites. A lot of people said no to having their data or contributions taken for AI without attribution or compensation. Many have been working on fighting back.

    From a certain point of view, AI can simultaneously be improving -- as in not becoming "dumber" -- while also providing worse outcomes resulting from sucking up inferior data.
  • RedBear87
    PlaneInTheSky said:
    I wrote a few months ago that AI gets worse over time, because it has to resort to less and less reliable data to train on.
    I don't think that's the issue here, unless they've silently introduced a ChatGPT 4.5 or something, the same model should be able to answer questions about prime number identifications with similar accuracy over time. It's more likely that this is a side effect of the guardrails that they keep adding on top of the model to keep the moralists and censors of our own Dark Ages satisfied.
  • brandonjclark
    I literally cancelled my OpenAI GPTPlus subscription due to my now-confirmed hypothesis it was being nuetered.
  • USAFRet
    The intern in my office vehemently, strongly defends his use of ChatGPT.
    (banned in the office, but he uses it for a LOT of other things)

    Taking a Biology class, and for the final, has to write a paper.
    One of the office denizens asked..."So you're going to use the ChatGPT?"
    'Absolutely! 1000%!'

    "So, you're not learning anything about biology, just learning how to ask the questions to ChatGPT, to produce a semi crappy answer."

    'Yes, but I still get credit for the class'

    The rest of us laughed.
  • InvalidError
    The joys of generalistic AI: you train it on a presumably curated known-good data set and when a new iteration of the AI regresses between generations, you have no clue why because it is all just billions parameters in matrices where most parameters have no specific link to any single thing.
  • oofdragon
    The only time I tried chatgpt it couldn't add numbers correctly.. I mean really, the most simple thing. But of course it's dmb, It was created being fed with left wing chicken donuts to push the agenda, what did you guys expect
  • bit_user
    Thanks for covering this. Please keep an eye on this important story.
  • vehekos
    It's the return of Lysenkoism. When it is forced to adhere to a political agenda, is forced to abandon reason.