1AI fromChina TelecomThe Artificial Intelligence Institute has learned that its "Large models of complex reasoning”TeleAITeleAI-t1-preview is now officially released and will soon be launched on the Sky AI open platform. TeleAI-t1-preview uses a reinforcement learning training methodology, which dramatically improves the accuracy of the model in complex problems such as logical reasoning and mathematical derivation by introducing the paradigms of thinking such as exploration and reflection.

Officially, TeleAI-t1-preview has been recognized as the best in the U.S. math competitions AIME 2024 and MATH500 math benchmarks. 60 and 93.8 pointsachievements.outperform by a wide margin Benchmarking models such as OpenAI o1-preview and GPT-4o. On the graduate-level quiz test GPQA Diamond, TeleAI-t1-preview scoredOver GPT-4oIt's also on par with the performance level of the Claude 3.5 Sonnet.
The evaluation showed that when a question from the Nine Chapters of the Mathematical Art was given to TeleAI-t1-preview, it was able to comprehend and simplify the Chinese text before converting it into modern Chinese, and then giving the mathematical derivation and answer.

It is reported that in this process, TeleAI-t1-preview can combine figurative thinking with abstract thinking to visualize the scenarios involved and assist in understanding the topic. Not only that, it is also capable of rigorously converting ancient and modern units.
TeleAI introduces innovative training strategies to ensure that the reasoning process is accurate and effective.
- Data preparation phase:A high-quality reasoning dataset with a mathematical core, complemented by a multidisciplinary approach, was collected and constructed to ensure that the model can be adapted to different types of reasoning tasks.
- Judge Model:A Judge Model was trained specifically to analyze and assess the correctness of the model's long thought links, providing guidance for model reflection and error correction.
- SFT (supervised fine tuning) phase:MCTS (Monte Carlo Tree Search) is used to construct high-quality long reasoning data, combining the accuracy rate and solution length of each step to select the optimal complete path, which ensures the accuracy of the reasoning answer while effectively lengthening the chain of thought to obtain a more fine-grained reasoning process. At the same time, the Judge Model is used to analyze the paths with low correctness in the reasoning process, and guide the model to reflect and correct the wrong reasoning steps, so as to construct high-quality thought chain data for SFT training.
- Intensive learning phase:The Rule-based Reward Model was additionally constructed to provide sufficiently accurate reward signals to further enhance the model's logical reasoning ability through online reinforcement learning algorithms.