Qwen2.5 – Math – PRM, 72B and 7B size models perform significantly beyond the same open source process incentive model. Of these, 7B-size Qwen2.5 – Math – PRM exceeded GPT – 4o in its ability to identify the error of reasoning. At the same time, the initial step-level evaluation criteria, ProcessBench, was presented and opened up by the general team and consisted of 3,400 mathematical problem test cases, each with a gradual process of reasoning that had been marked by human experts, and a comprehensive assessment of the capacity of models to identify error steps. In the ProcssBech assessment, Qwen2.5 – Math – PRM of 72B and 7B sizes had a significant advantage, and version 7B not only went beyond the same-sized open-source PRM model but also beyond the closed-source GPS – 4o – 0806, opening new avenues for the reasoning process to oversee technological development。
Open source address: https://github.com/QwenLM/Qwen2.5-Math
https://huggingface.co/Qwen/Qwen2.5-Math-PRM-72B
