{"id":35160,"date":"2025-05-14T11:21:11","date_gmt":"2025-05-14T03:21:11","guid":{"rendered":"https:\/\/www.1ai.net\/?p=35160"},"modified":"2025-05-14T11:21:11","modified_gmt":"2025-05-14T03:21:11","slug":"openai-%e5%85%ac%e5%b8%83-ai-%e5%81%a5%e5%ba%b7%e5%9f%ba%e5%87%86%ef%bc%8c%e6%96%b0%e6%a8%a1%e5%9e%8b%e5%aa%b2%e7%be%8e%e4%ba%ba%e7%b1%bb%e5%8c%bb%e7%94%9f","status":"publish","type":"post","link":"https:\/\/www.1ai.net\/en\/35160.html","title":{"rendered":"OpenAI Announces AI Health Benchmarks, New Model Rivals Human Doctors"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-35161\" title=\"172431cbj00sw8emb002kd000u000i9m\" src=\"https:\/\/www.1ai.net\/wp-content\/uploads\/2025\/05\/172431cbj00sw8emb002kd000u000i9m.jpg\" alt=\"172431cbj00sw8emb002kd000u000i9m\" width=\"1080\" height=\"657\" \/><\/p>\n<p>A few days ago,<a href=\"https:\/\/www.1ai.net\/en\/tag\/openai\" title=\"[View articles tagged with [OpenAI]]\" target=\"_blank\" >OpenAI<\/a> An announcement was made <a href=\"https:\/\/www.1ai.net\/en\/tag\/ai\" title=\"[View articles tagged with [AI]]\" target=\"_blank\" >AI<\/a> Health System Assessment Criteria.<a href=\"https:\/\/www.1ai.net\/en\/tag\/healthbench\" title=\"[Sees articles with [HealthBech] label]\" target=\"_blank\" >HealthBench<\/a>\"The \"Most <a href=\"https:\/\/www.1ai.net\/en\/tag\/agi\" title=\"_OTHER ORGANISER\" target=\"_blank\" >AGI<\/a> Iconic\".<\/p>\n<p>Specifically, HealthBench is designed to better measure the performance of AI systems in healthcare. The benchmark, created by OpenAI with 262 professional doctors in 60 countries around the world, contains 5,000 real medical conversations, and each conversation has a customized scoring rubric for doctors to evaluate the model's responses.<\/p>\n<p>In HealthBench's model performance test results, the<strong>The top performer was OpenAI's own o3 model, which took the top spot and ranked first, with Grok 3 and Gemini 2.5 Pro in second and third place, respectively, and o3 also outperforming Claude 3.7 Sonnet in the benchmarks.<\/strong>For its part, OpenAI says that its frontier model has improved its performance in HealthBench by 281 TP3T in recent months.<\/p>\n<p>For reliability, OpenAI evaluated the worst performance of each model at k samples on HealthBench. The results show that o3's worst score at 16 samples is more than double that of GPT-4o.<\/p>\n<p>The most interesting part of the game is the big model PK against real doctors. 262 doctors were divided into two groups: those who were \"on their own\" and those who \"relied on AI for higher quality answers\". The answers generated by the AI were then compared with the answers of the previous two groups to evaluate the PK and assess the performance of the big model in terms of accuracy, professionalism, practicality, and so on.<\/p>\n<p><strong>From the September 2024 model (o1-preview, 4o), the AI-generated responses alone outperformed the 'on their own merit' physician responses.<\/strong><\/p>\n<p><strong>And from the April 2025 model (o3, GPT-4.1), there is already no significant difference in quality between the AI-generated responses and the physician responses that 'rely on AI for higher quality responses'.<\/strong><\/p>","protected":false},"excerpt":{"rendered":"<p>A few days ago, OpenAI announced an AI health system evaluation standard, 'HealthBench', which is claimed to be 'the most iconic of AGI'. Specifically, HealthBench aims to better measure the performance of AI systems in the medical field. The benchmark was created by OpenAI in collaboration with 262 professional doctors in 60 countries around the world, and includes 5,000 real medical conversations, each with a customized scoring rubric for doctors to evaluate the model's responses. In HealthBench's model performance results, OpenAI's own o3 model was the top performer, taking first place with the highest score, followed by Grok 3 and Gemini 2.5.<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[146],"tags":[151,411,6586,190],"collection":[],"class_list":["post-35160","post","type-post","status-publish","format-standard","hentry","category-news","tag-agi","tag-ai","tag-healthbench","tag-openai"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/35160","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/comments?post=35160"}],"version-history":[{"count":0,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/posts\/35160\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/media?parent=35160"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/categories?post=35160"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/tags?post=35160"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.1ai.net\/en\/wp-json\/wp\/v2\/collection?post=35160"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}