OpenAI Announces AI Health Benchmarks, New Model Rivals Human Doctors

OpenAI Announces AI Health Benchmarks, New Model Rivals Human Doctors

A few days ago,OpenAI An announcement was made AI Health System Assessment Criteria.HealthBench"The "Most AGI Iconic".

Specifically, HealthBench is designed to better measure the performance of AI systems in healthcare. The benchmark, created by OpenAI with 262 professional doctors in 60 countries around the world, contains 5,000 real medical conversations, and each conversation has a customized scoring rubric for doctors to evaluate the model's responses.

In HealthBench's model performance test results, theThe top performer was OpenAI's own o3 model, which took the top spot and ranked first, with Grok 3 and Gemini 2.5 Pro in second and third place, respectively, and o3 also outperforming Claude 3.7 Sonnet in the benchmarks.For its part, OpenAI says that its frontier model has improved its performance in HealthBench by 281 TP3T in recent months.

For reliability, OpenAI evaluated the worst performance of each model at k samples on HealthBench. The results show that o3's worst score at 16 samples is more than double that of GPT-4o.

The most interesting part of the game is the big model PK against real doctors. 262 doctors were divided into two groups: those who were "on their own" and those who "relied on AI for higher quality answers". The answers generated by the AI were then compared with the answers of the previous two groups to evaluate the PK and assess the performance of the big model in terms of accuracy, professionalism, practicality, and so on.

From the September 2024 model (o1-preview, 4o), the AI-generated responses alone outperformed the 'on their own merit' physician responses.

And from the April 2025 model (o3, GPT-4.1), there is already no significant difference in quality between the AI-generated responses and the physician responses that 'rely on AI for higher quality responses'.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

Epoch AI Predicts: Inferential Modeling Pace Will Slow Down in as Little as 1 Year

2025-5-14 11:13:48

Information

Honor launches AI image-generated video feature based on Google Veo 2

2025-5-14 11:23:54

Search