{"id":32935,"date":"2025-04-14T10:40:11","date_gmt":"2025-04-14T02:40:11","guid":{"rendered":"https:\/\/www.1ai.net\/?p=32935"},"modified":"2025-04-14T10:40:11","modified_gmt":"2025-04-14T02:40:11","slug":"ai%e6%8e%a8%e7%90%86%e6%a8%a1%e5%9e%8b%e5%85%b4%e8%b5%b7%ef%bc%8c%e5%9f%ba%e5%87%86%e6%b5%8b%e8%af%95%e6%88%90%e6%9c%ac%e9%a3%99%e5%8d%87","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/32935.html","title":{"rendered":"Benchmarking Costs Soar as AI 'Reasoning' Models Emerge"},"content":{"rendered":"<p>With artificial intelligence (<a href=\"https:\/\/www.1ai.net\/en\/tag\/ai\" title=\"[View articles tagged with [AI]]\" target=\"_blank\" >AI<\/a>) technology continues to evolve, the so-called \"<a href=\"https:\/\/www.1ai.net\/en\/tag\/%e6%8e%a8%e7%90%86\" title=\"[Sees articles containing [debate] labels]\" target=\"_blank\" >inference<\/a>\u201d<a href=\"https:\/\/www.1ai.net\/en\/tag\/ai%e6%a8%a1%e5%9e%8b\" title=\"[View articles tagged with [AI models]]\" target=\"_blank\" >AI Models<\/a>became a research hotspot. These models are able to think about problems step-by-step like humans and are considered to be more capable than non-reasoning models in specific fields, such as physics. However.<strong>This advantage comes with high testing costs, making the ability to independently validate these models difficult.<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-32936\" title=\"5bddaecdj00suospo002ad000v900j1p\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/04\/5bddaecdj00suospo002ad000v900j1p.jpg\" alt=\"5bddaecdj00suospo002ad000v900j1p\" width=\"1125\" height=\"685\" \/><\/p>\n<p>According to data from Artificial Analysis, a third-party AI tester, evaluating OpenAI's o1 inference model on seven popular AI <a href=\"https:\/\/www.1ai.net\/en\/tag\/%e5%9f%ba%e5%87%86%e6%b5%8b%e8%af%95\" title=\"[See articles with [baseline test] labels]\" target=\"_blank\" >benchmarking<\/a>The performance (including MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, IAME 2024 and MATH-500) cost US$ 2767.05 (note: current exchange rate is approximately RMB 20191). The cost of assessing the \u201cmixed\u201d reasoning model of Claude 3.7 Sonnet in Anthropic is $1485.35 (the current exchange rate is approximately 10839 yuan), compared to $344.59 (the current exchange rate is approximately 2514 yuan) for the OpenAI test o3-mini-high. Despite the relatively low cost of testing some of the reasoning models, such as the o1-mini assessment of OpenAI, which required only 141.22 dollars (the current exchange rate is approximately RMB 1030), the cost of testing the reasoning models remains high overall. To date, \u201cAssumption of artificial intelligence\u201d has cost approximately $5,200 (at an exchange rate of approximately 37945 yuan) to assess about a dozen reasoning models, which is close to double the $2,400 spent by the company to analyse more than 80 non-indicative models\u3002<\/p>\n<p>OpenAI's non-reasoning GPT-4o model, released in May 2024, cost only $108.85 to evaluate, compared to $81.41 for Claude 3.6 Sonnet, the non-reasoning predecessor to Claude 3.7 Sonnet. \"George Cameron, co-founder of Analytics for Artificial Intelligence, told TechCrunch that the organization plans to increase its testing budget as more AI labs develop inference models. \"At 'Artificial Intelligence Analytics,' we run hundreds of evaluations per month and have a sizable budget for that,\" Cameron said, \"and we expect that to increase as models are released more frequently. \"<\/p>\n<p>\"AI Analytics isn't the only organization facing rising AI testing costs, says Ross Taylor, CEO of AI startup General Reasoning, who recently spent $580 to evaluate Claude 3.7 Sonnet with about 3,700 unique cues. He recently spent $580 evaluating Claude 3.7 Sonnet with about 3,700 unique cues, and Taylor estimates that just one full test of MMLU Pro, a set of questions designed to assess a model's language comprehension, would cost more than $1,800. \"We're moving toward a world where a lab reports x% results in a benchmark test in which they spend y amount of computational resources, but academics have far less than y,\" Taylor wrote in a recent post on X. \" No one has been able to replicate these results.\"<\/p>\n<p>So why is it so expensive to test inference models?<strong>The main reason is that they generate a lot of tokens<\/strong>The token represents a fragment of the original text, e.g., the word \"fantastic\" split into the syllables \"fan\", \"tas\" and \"tic. According to AI Analytics, OpenAI's o1 generated more than 44 million tokens in the company's benchmarks, roughly eight times the amount generated by GPT-4o. Most AI companies charge by the token, so costs can easily add up.<\/p>\n<p>In addition, modern benchmarks typically elicit a large number of tokens from models because they contain questions that involve complex, multistep tasks.Jean-Stanislas Denain, a senior researcher at Epoch AI, says that this is because today's benchmarks are more complex, even though the number of questions per benchmark has decreased overall. decreased. \"They typically try to assess a model's ability to perform real-world tasks, such as writing and executing code, browsing the Internet, and using a computer,\" Denain states. Deneen also noted that the most expensive models have seen their cost per token increase over time. For example, Anthropic's Claude 3 Opus, released in May 2024, was the most expensive model at the time, costing $75 per million output tokens. OpenAI's GPT-4.5 and o1-pro, released earlier this year, cost $150 and $600 per million output tokens, respectively.<\/p>\n<p>\"While the performance of models has improved over time and the cost of reaching a given level of performance has certainly dropped dramatically, you still need to pay more if you want to evaluate the biggest and best model at any given time,\" Deneen said. Many AI labs, including OpenAI, offer free or subsidized access to models to benchmarking organizations for testing purposes. But some experts say this can compromise the fairness of test results -- even without evidence of manipulation, the involvement of AI labs could itself compromise the integrity of assessment scores.<\/p>","protected":false},"excerpt":{"rendered":"<p>With the continued development of AI technology, the so-called \u201cdebate\u201d AI model has become a hot spot for research. These models are able to reflect on issues as gradually as humans, and are considered to be more capable of doing so in specific areas, such as physics, than non-inference models. However, this advantage is accompanied by high test costs, making it difficult to independently validate these models. According to data provided by the third party AI testing agency Artificial Analysis, the O1 reasoning model for assessing OpenAI was tested in seven popular AI baseline tests (including MMLU-Pro, GPQA Diamond, Humanity 's Last Exam, Live<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[411,167,5192,6258],"collection":[],"class_list":{"0":"post-32935","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"hentry","6":"category-news","7":"tag-ai","9":"tag-5192","10":"tag-6258"},"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/32935","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=32935"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/32935\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=32935"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=32935"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=32935"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=32935"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}