Apple says generative AI cannot think like a human - research paper pours cold water on reasoning models
Large reasoning models fail at complex tasks.

Apple researchers have tested advanced AI reasoning models — which are called large reasoning models (LRM) — in controlled puzzle environments and found that while they outperform 'standard' large language models (LLMs) models on moderately complex tasks, both fail completely as complexity increases.
The researchers from Apple, which is not exactly at the forefront of AI development, believe that the current LRMs and LLMs have fundamental limits in their ability to generalize reasoning, or rather thinking the way humans do.
Apple researchers studied how advanced AI models — the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs — handle increasingly complex problem-solving tasks. They moved beyond standard math and coding benchmarks and designed controlled puzzle environments, such as Tower of Hanoi and River Crossing, where they could precisely adjust problem complexity. Their goal was to evaluate not just final answers but also the internal reasoning processes of these models, comparing them to standard large language models under equal computational conditions. Through the puzzles, they aimed to uncover the true strengths and fundamental limits of AI reasoning.
Apple researchers discovered that LRMs perform differently depending on problem complexity. On simple tasks, standard LLMs, without explicit reasoning mechanisms, were more accurate and efficient and delivered better results with fewer compute resources. However, as problem complexity increased to a moderate level, models equipped with structured reasoning, like Chain-of-Thought prompting, gained the advantage and outperformed their non-reasoning counterparts. When the complexity grew further, both types of models failed completely: their accuracy dropped to zero regardless of the available compute resources. (Keep in mind that the the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs have limitations when it comes to their training.)
A deeper analysis of the reasoning traces revealed inefficiencies and unexpected behavior. Initially, reasoning models used longer thought sequences as problems became harder, but near the failure point, they surprisingly shortened their reasoning effort even when they had sufficient compute capacity left. Moreover, even when explicitly provided with correct algorithms, the models failed to reliably execute step-by-step instructions on complex tasks, exposing weaknesses in logical computation. The study also found that model performance varied significantly between familiar and less-common puzzles, suggesting that success often depended on training data familiarity rather than true generalizable reasoning skills.
Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
Konomi Doesn't need a research paper to tell you that, just common sense. It isn't a human therefore we shouldn't expect it to think like one. But I suppose people need to find some validation in their efforts to make AI more than a meme..Reply -
SomeoneElse23 Is that a POP we're hearing of the AI bubble?Reply
"AI", as it is today, is now basically Search 2.0. -
Kindaian That mimics precisely my experience.Reply
When i ask simple and common things, the LLM answers with an adequate response. When I ask more complex and niche things, the response is just random garble.
Also, for code / programming, it is not only quite useless, but also dangerous. The samples that they base the answers on have in a lot of cases disclaimers like "don't use this in prod" and the like. But the LLM will not understand what that means and will just give you something full of vulnerabilities.
To add insult to injury, the LLMs don't reason about the code that they are building. They will not use best practices, patterns, consider code re-usability or maintenance. They won't optimize for speed, memory footprint or reliability. They will just give you an answer, and even if it may even work, consider it a bizarre coincidence.
Otherwise you are shooting yourself in the foot!
With regard of the datasets used for training, and considering that people are using the regurgitated output of LLM to "create" more content, it just means that the noise to signal will diminish in quality over time, which means that the LLMs will become worse, not better overtime. -
shady28 This jives with the Atari 2600 beating ChatGPT 4.0 in Chess at beginner level.Reply
From my experience AI is more like a powerful data aggregator, that is to say it matches patterns and cross indexes things and puts together something that appears correct and accurate (but it may not be).
This does make it an incredibly powerful search engine, able to not only find what you're looking for but pull related information from multiple sources into a more complete consolidated answer.
However, I've seen many instances that if the same question is asked two different ways, a different result will come up because it decided that the question was better answered by one source vs another.
It's a far cry from intelligence able to solve a complex problem that hasn't been solved before or answer a question that hasn't been asked before, and any answer is somewhat dubious if it's used for anything important. -
SomeoneElse23
I've had the same experience.Kindaian said:Also, for code / programming, it is not only quite useless, but also dangerous. The samples that they base the answers on have in a lot of cases disclaimers like "don't use this in prod" and the like. But the LLM will not understand what that means and will just give you something full of vulnerabilities.
To add insult to injury, the LLMs don't reason about the code that they are building. They will not use best practices, patterns, consider code re-usability or maintenance. They won't optimize for speed, memory footprint or reliability. They will just give you an answer, and even if it may even work, consider it a bizarre coincidence.
Otherwise you are shooting yourself in the foot!
I stopped using ChatGPT for coding help after it lied to me 3 times about code that simply did not work.
Since they put so much effort into making "AI" friendly, they should at least change the programming related answers to include a disclaimer:
"I think this answer may be correct. But it may not be. It may have serious flaws or bad practices in it. Use at your own risk." -
baboma >I stopped using ChatGPT for coding help after it lied to me 3 times about code that simply did not work.Reply
To channel Jobs, you're doing it wrong. Don't use ChatGPT for coding help. (Disclaimer: Yes, I also tried.)
There are dedicated code-help AIs and best practices to avail of. Read what the pros and experts are saying. There are lots of good advices to learn from. Here's one, off the cuff:
https://fly.io/blog/youre-all-nuts/ -
A Stoner I thought this was widely known.Reply
The circular logic that happens when I try discuss anything complex with LLMs is unbelievable. Then again, after talking to some humans, I think Apple is giving too many humans too much credit. Some of them seem to be just as programmed and incapable of real thought as the LLMs are.