Apple says generative AI cannot think like a human - research paper pours cold water on reasoning models

(Image credit: Apple)

Apple researchers have tested advanced AI reasoning models — which are called large reasoning models (LRM) — in controlled puzzle environments and found that while they outperform 'standard' large language models (LLMs) models on moderately complex tasks, both fail completely as complexity increases.

The researchers from Apple, which is not exactly at the forefront of AI development, believe that the current LRMs and LLMs have fundamental limits in their ability to generalize reasoning, or rather thinking the way humans do.

Apple researchers studied how advanced AI models — the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs — handle increasingly complex problem-solving tasks. They moved beyond standard math and coding benchmarks and designed controlled puzzle environments, such as Tower of Hanoi and River Crossing, where they could precisely adjust problem complexity. Their goal was to evaluate not just final answers but also the internal reasoning processes of these models, comparing them to standard large language models under equal computational conditions. Through the puzzles, they aimed to uncover the true strengths and fundamental limits of AI reasoning.

Apple researchers discovered that LRMs perform differently depending on problem complexity. On simple tasks, standard LLMs, without explicit reasoning mechanisms, were more accurate and efficient and delivered better results with fewer compute resources. However, as problem complexity increased to a moderate level, models equipped with structured reasoning, like Chain-of-Thought prompting, gained the advantage and outperformed their non-reasoning counterparts. When the complexity grew further, both types of models failed completely: their accuracy dropped to zero regardless of the available compute resources. (Keep in mind that the the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs have limitations when it comes to their training.)

A deeper analysis of the reasoning traces revealed inefficiencies and unexpected behavior. Initially, reasoning models used longer thought sequences as problems became harder, but near the failure point, they surprisingly shortened their reasoning effort even when they had sufficient compute capacity left. Moreover, even when explicitly provided with correct algorithms, the models failed to reliably execute step-by-step instructions on complex tasks, exposing weaknesses in logical computation. The study also found that model performance varied significantly between familiar and less-common puzzles, suggesting that success often depended on training data familiarity rather than true generalizable reasoning skills.

Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

29 Comments Comment from the forums

derekullo

Maybe we just suck at asking questions :P
Reply
Konomi

Doesn't need a research paper to tell you that, just common sense. It isn't a human therefore we shouldn't expect it to think like one. But I suppose people need to find some validation in their efforts to make AI more than a meme..
Reply
SomeoneElse23

Is that a POP we're hearing of the AI bubble?

"AI", as it is today, is now basically Search 2.0.
Reply
Kindaian

That mimics precisely my experience.

When i ask simple and common things, the LLM answers with an adequate response. When I ask more complex and niche things, the response is just random garble.

Also, for code / programming, it is not only quite useless, but also dangerous. The samples that they base the answers on have in a lot of cases disclaimers like "don't use this in prod" and the like. But the LLM will not understand what that means and will just give you something full of vulnerabilities.

To add insult to injury, the LLMs don't reason about the code that they are building. They will not use best practices, patterns, consider code re-usability or maintenance. They won't optimize for speed, memory footprint or reliability. They will just give you an answer, and even if it may even work, consider it a bizarre coincidence.

Otherwise you are shooting yourself in the foot!

With regard of the datasets used for training, and considering that people are using the regurgitated output of LLM to "create" more content, it just means that the noise to signal will diminish in quality over time, which means that the LLMs will become worse, not better overtime.
Reply
shady28

This jives with the Atari 2600 beating ChatGPT 4.0 in Chess at beginner level.

From my experience AI is more like a powerful data aggregator, that is to say it matches patterns and cross indexes things and puts together something that appears correct and accurate (but it may not be).

This does make it an incredibly powerful search engine, able to not only find what you're looking for but pull related information from multiple sources into a more complete consolidated answer.

However, I've seen many instances that if the same question is asked two different ways, a different result will come up because it decided that the question was better answered by one source vs another.

It's a far cry from intelligence able to solve a complex problem that hasn't been solved before or answer a question that hasn't been asked before, and any answer is somewhat dubious if it's used for anything important.
Reply
SomeoneElse23

Kindaian said:
Also, for code / programming, it is not only quite useless, but also dangerous. The samples that they base the answers on have in a lot of cases disclaimers like "don't use this in prod" and the like. But the LLM will not understand what that means and will just give you something full of vulnerabilities.

To add insult to injury, the LLMs don't reason about the code that they are building. They will not use best practices, patterns, consider code re-usability or maintenance. They won't optimize for speed, memory footprint or reliability. They will just give you an answer, and even if it may even work, consider it a bizarre coincidence.

Otherwise you are shooting yourself in the foot!

I've had the same experience.

I stopped using ChatGPT for coding help after it lied to me 3 times about code that simply did not work.

Since they put so much effort into making "AI" friendly, they should at least change the programming related answers to include a disclaimer:

"I think this answer may be correct. But it may not be. It may have serious flaws or bad practices in it. Use at your own risk."
Reply
baboma

>I stopped using ChatGPT for coding help after it lied to me 3 times about code that simply did not work.

To channel Jobs, you're doing it wrong. Don't use ChatGPT for coding help. (Disclaimer: Yes, I also tried.)

There are dedicated code-help AIs and best practices to avail of. Read what the pros and experts are saying. There are lots of good advices to learn from. Here's one, off the cuff:

https://fly.io/blog/youre-all-nuts/
Reply
A Stoner

I thought this was widely known.

The circular logic that happens when I try discuss anything complex with LLMs is unbelievable. Then again, after talking to some humans, I think Apple is giving too many humans too much credit. Some of them seem to be just as programmed and incapable of real thought as the LLMs are.
Reply
Konomi

A Stoner said:
I thought this was widely known.

The circular logic that happens when I try discuss anything complex with LLMs is unbelievable. Then again, after talking to some humans, I think Apple is giving too many humans too much credit. Some of them seem to be just as programmed and incapable of real thought as the LLMs are.
Take those humans back to where you found them and ask for a refund.
Reply
baboma

A more digestible and cogent commentary on Apple's AI paper is here:

https://garymarcus.substack.com/p/a-knockout-blow-for-llms
Who is Gary Marcus: https://substack.com/@garymarcus
Summation (in the author's words):
"AI is not hitting a wall.
But LLMs probably are (or at least a point of diminishing returns).
We need new approaches, and to diversify which roads are being actively explored."
My add: LLM progress may be hitting a wall, but it current capabilities are already enough to supplant many jobs and functions done by humans today. Countries are now focused on improving AI, and there are untold billions if not trillions of dollars being poured into AI's further development.
Reply

Show more comments