{"id":26215,"date":"2025-01-04T13:50:16","date_gmt":"2025-01-04T05:50:16","guid":{"rendered":"https:\/\/www.1ai.net\/?p=26215"},"modified":"2025-01-04T13:50:16","modified_gmt":"2025-01-04T05:50:16","slug":"ai%e7%bc%96%e7%a8%8b%e8%83%bd%e5%8a%9b%e5%93%aa%e5%ae%b6%e5%bc%ba%ef%bc%9f%e9%98%bf%e9%87%8c%e9%80%9a%e4%b9%89%e5%8d%83%e9%97%ae-qwen-%e6%8e%a8-codeelo-%e5%9f%ba%e5%87%86%ef%bc%8copenai-o1-mini","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/26215.html","title":{"rendered":"Which is the best AI programming ability? Ali Tongyi Thousand Questions Qwen Launches CodeElo Benchmark, OpenAI o1-mini Wins Championship Over 90% Human Programmers"},"content":{"rendered":"<p>January 4, 2012 - Ali<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e9%80%9a%e4%b9%89%e5%8d%83%e9%97%ae\" title=\"[View articles tagged with [Tongyi Thousand Questions]]\" target=\"_blank\" >Thousand Questions on Tongyi<\/a> Qwen's newest CodeElo benchmarking test, by comparing it to the human<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e7%a8%8b%e5%ba%8f%e5%91%98\" title=\"[Sees articles with [programmer] labels]\" target=\"_blank\" >programmer<\/a>A comparative Elo rating system to assess the level of programming in the Large Language Model (LLM).<\/p>\n<p><strong>Project Background<\/strong><\/p>\n<p>One of the AI scenario applications of big language modeling is to generate and complement code, except that there are many challenges in assessing the real capabilities of programming at this stage.<\/p>\n<p>Existing benchmarks, including LiveCodeBench and USACO, have limitations, lack robust private test cases, do not support specialized judgment systems, and often use inconsistent execution environments.<\/p>\n<p>CodeElo: Leveraging CodeForces for a More Accurate LLM Evaluation System<\/p>\n<p>Note: The Qwen research team, in an effort to address these challenges, has<strong>Introduced the CodeElo Benchmark Test, designed to assess the level of programming competition at LLM using the Elo rating system compared to human programmers.<\/strong><\/p>\n<p>CodeElo's questions come from the CodeForces platform, which is known for its rigorous programming competitions. By submitting solutions directly to the CodeForces platform, CodeElo ensures the accuracy of evaluations, addresses issues such as false positives, and supports questions that require special judging mechanisms. In addition, the Elo rating system mirrors human rankings, allowing for effective comparisons between LLM and human contestant performance.<\/p>\n<p><strong>CodeElo's three core elements: comprehensiveness, robustness and standardization<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-26216\" title=\"02dcafa2j00spjuu50032d000sg009vp\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/01\/02dcafa2j00spjuu50032d000sg009vp.jpg\" alt=\"02dcafa2j00spjuu50032d000sg009vp\" width=\"1024\" height=\"355\" \/><\/p>\n<p>CodeElo is based on three key elements:<\/p>\n<ul>\n<li><strong>Comprehensive selection of issues<\/strong>: Topics are categorized by tournament division, difficulty level, and algorithmic labels to provide a comprehensive assessment.<\/li>\n<li><strong>Robust assessment methods.<\/strong><strong>\u00a0<\/strong>Submitted code is tested on the CodeForces platform, utilizing its special evaluation mechanisms to ensure accurate judgments, eliminating the need to hide test cases and providing reliable feedback.<\/li>\n<li><strong>Standardized rating calculations.<\/strong>\u00a0The Elo rating system evaluates the correctness of code, takes into account problem difficulty, and penalizes errors to incentivize high-quality solutions, providing a careful and effective tool for evaluating coding models.<\/li>\n<\/ul>\n<p><strong>Test Results<\/strong><\/p>\n<p>After testing 30 open-source LLMs and 3 proprietary LLMs, OpenAI's o1-mini model performed the best, with an Elo score of 1578, outperforming 90% human participants; among the open-source models, QwQ-32B-Preview topped the list with a score of 1261.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-26217\" title=\"3b7e3b9dj00spjuum00bpd000sg00mop\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/01\/3b7e3b9dj00spjuum00bpd000sg00mop.jpg\" alt=\"3b7e3b9dj00spjuum00bpd000sg00mop\" width=\"1024\" height=\"816\" \/><\/p>\n<p>However, many of the models still struggled with solving simple problems and typically ranked behind the human participants.20% The analysis showed that the models performed well in categories such as mathematics and implementation, but were deficient in dynamic programming and tree algorithms.<\/p>\n<p>In addition, the model performs better when coded in C++, which is consistent with the preferences of competitive programmers, and these results highlight areas for improvement in LLM.<\/p>","protected":false},"excerpt":{"rendered":"<p>On January 4, Ali Tongyi Qianqian Qwen newly launched the CodeElo benchmark test to assess the programming level of Large Language Model (LLM) by comparing the Elo rating system with human programmers. Project Background One of the AI scenario applications of Large Language Model is to generate and complement code, but there are just many challenges in assessing the real ability of programming at this stage. Existing benchmarks, including LiveCodeBench and USACO, have limitations, lack robust private test cases, do not support specialized judgment systems, and often use inconsistent execution environments. CodeElo: Leveraging CodeForces for a More Accurate LLM Evaluation System Note: Qwen<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[1124,1410,331],"collection":[],"class_list":["post-26215","post","type-post","status-publish","format-standard","hentry","category-news","tag-ai","tag-1410","tag-331"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/26215","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=26215"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/26215\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=26215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=26215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=26215"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=26215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}