Apple's latest study: existing AI big models are 'more like memorization than real reasoning'

June 8 News.appleThe Center for Machine Learning Research published on June 6, local time anResearch PapersThe claim is that the existing AI ModelsDoesn't really have the ability to think or reason, but relies on pattern matching and memorization, especially for complex tasks.

New Apple study: existing big AI models are "more like memorization than real reasoning"

Apple researchers conducted a systematic evaluation of existing cutting-edge "large-scale inference models" - such as OpenAI o3-mini, DeepSeek-R1, Anthropic's Claude 3.7 Sonnet Thinking and Google Gemini Thinking - were systematically evaluated.

The study found that while these models have the ability to generate detailed "chains of thought" and show strengths on medium-complexity tasks, there is a fundamental limitation to their reasoning ability: when problem complexity exceeds a certain threshold, the model's performance breaks down completely to "zero accuracy". The model performance breaks down to "zero accuracy" when the problem complexity exceeds a certain threshold.

In addition, the number of tokens used for "thinking" in the model inference process decreases with increasing difficulty, even though there is still sufficient inference power, a phenomenon that implies a fundamental limitation of the existing inference methods.

This article, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models through the Lens of Problem Complexity" by Parshin Shojaee et al. The study shows that current industry evaluations of these models focus on mathematical and programming benchmarking, focusing on the accuracy of the final answer, but this tends to ignore the problem of data contamination and fails to provide insight into the structure and quality of internal reasoning trajectories.

The researchers employed a series of controlled puzzle-solving environments that allowed precise manipulation of compositional complexity while maintaining consistency in logical structure. This allowed not only to analyze the final answer, but also to explore the internal trajectory of reasoning to gain a deeper understanding of how these models "think".

The research team suggests that model performance can be divided into three stages:

low-complexity task: Traditional large models (note: e.g. Claude-3.7 thoughtless version) perform better;
Medium complexity tasks: Large Reasoning Models (LRMs) with thinking mechanisms are more dominant;
High Complexity Tasks: Both types of models fall into a state of complete failure.

In particular, it was found that LRMs have limitations in performing exact computations, are unable to use explicit algorithms and exhibit inconsistency when reasoning across different puzzles.

Overall, this study not only questions the current paradigm of evaluating LRMs based on established mathematical benchmarks, but also highlights the need for a more nuanced experimental setup to explore these issues. Through the use of a controlled puzzle environment, this study provides insights into the capabilities and limitations of linguistic reasoning models and points the way for future research.

According to the researchers, "These findings highlight the strengths and limitations of existing LRMs, raising questions about the nature of reasoning in these systems that have important implications for their design and deployment."

References:

《The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity - Apple Machine Learning Research》

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

{{userData.name}}Verify

New Apple study: existing big AI models are "more like memorization than real reasoning"

ChatGPT Team members get $1 off for the first month for US/UK/Europe/Australia!

Video Generation Platform Runway Hosts Annual AI Film Festival, 6,000 Entries to Determine Top 10

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

{{userData.name}}Verify

Related content:

ChatGPT Team members get $1 off for the first month for US/UK/Europe/Australia!

Video Generation Platform Runway Hosts Annual AI Film Festival, 6,000 Entries to Determine Top 10

Apple spends $50 million to license millions of Shutterstock images for training AI models

Apple's new AI model research Ferret-UI: may improve Siri and understand screen content

Apple to open up AI models for developers to use

Competition for smartphone AI features heats up in 2024: iPhone 16 will spark AI competition with Samsung Galaxy S24

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow