June 8 News.appleThe Center for Machine Learning Research published on June 6, local time anResearch PapersThe claim is that the existing AI ModelsDoesn't really have the ability to think or reason, but relies on pattern matching and memorization, especially for complex tasks.

Apple researchers conducted a systematic evaluation of existing cutting-edge "large-scale inference models" - such as OpenAI o3-mini, DeepSeek-R1, Anthropic's Claude 3.7 Sonnet Thinking and Google Gemini Thinking - were systematically evaluated.
The study found that while these models have the ability to generate detailed "chains of thought" and show strengths on medium-complexity tasks, there is a fundamental limitation to their reasoning ability: when problem complexity exceeds a certain threshold, the model's performance breaks down completely to "zero accuracy". The model performance breaks down to "zero accuracy" when the problem complexity exceeds a certain threshold.
In addition, the number of tokens used for "thinking" in the model inference process decreases with increasing difficulty, even though there is still sufficient inference power, a phenomenon that implies a fundamental limitation of the existing inference methods.
This article, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models through the Lens of Problem Complexity" by Parshin Shojaee et al. The study shows that current industry evaluations of these models focus on mathematical and programming benchmarking, focusing on the accuracy of the final answer, but this tends to ignore the problem of data contamination and fails to provide insight into the structure and quality of internal reasoning trajectories.
The researchers employed a series of controlled puzzle-solving environments that allowed precise manipulation of compositional complexity while maintaining consistency in logical structure. This allowed not only to analyze the final answer, but also to explore the internal trajectory of reasoning to gain a deeper understanding of how these models "think".
The research team suggests that model performance can be divided into three stages:
- low-complexity task: Traditional large models (note: e.g. Claude-3.7 thoughtless version) perform better;
- Medium complexity tasks: Large Reasoning Models (LRMs) with thinking mechanisms are more dominant;
- High Complexity Tasks: Both types of models fall into a state of complete failure.
In particular, it was found that LRMs have limitations in performing exact computations, are unable to use explicit algorithms and exhibit inconsistency when reasoning across different puzzles.
Overall, this study not only questions the current paradigm of evaluating LRMs based on established mathematical benchmarks, but also highlights the need for a more nuanced experimental setup to explore these issues. Through the use of a controlled puzzle environment, this study provides insights into the capabilities and limitations of linguistic reasoning models and points the way for future research.
According to the researchers, "These findings highlight the strengths and limitations of existing LRMs, raising questions about the nature of reasoning in these systems that have important implications for their design and deployment."
References: