OpenAI announces AI health benchmarks, new model rivals human doctors

OpenAI Announces AI Health Benchmarks, New Model Rivals Human Doctors

A few days ago,OpenAI An announcement was made AI Health System Assessment Criteria.HealthBench"The "Most AGI Iconic".

Specifically, HealthBench is designed to better measure the performance of AI systems in healthcare. The benchmark, created by OpenAI with 262 professional doctors in 60 countries around the world, contains 5,000 real medical conversations, and each conversation has a customized scoring rubric for doctors to evaluate the model's responses.

In HealthBench's model performance test results, theThe top performer was OpenAI's own o3 model, which took the top spot and ranked first, with Grok 3 and Gemini 2.5 Pro in second and third place, respectively, and o3 also outperforming Claude 3.7 Sonnet in the benchmarks.For its part, OpenAI says that its frontier model has improved its performance in HealthBench by 281 TP3T in recent months.

For reliability, OpenAI evaluated the worst performance of each model at k samples on HealthBench. The results show that o3's worst score at 16 samples is more than double that of GPT-4o.

The most interesting part of the game is the big model PK against real doctors. 262 doctors were divided into two groups: those who were "on their own" and those who "relied on AI for higher quality answers". The answers generated by the AI were then compared with the answers of the previous two groups to evaluate the PK and assess the performance of the big model in terms of accuracy, professionalism, practicality, and so on.

From the September 2024 model (o1-preview, 4o), the AI-generated responses alone outperformed the 'on their own merit' physician responses.

And from the April 2025 model (o3, GPT-4.1), there is already no significant difference in quality between the AI-generated responses and the physician responses that 'rely on AI for higher quality responses'.

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.

{{userData.name}}Verify

OpenAI Announces AI Health Benchmarks, New Model Rivals Human Doctors

Epoch AI Predicts: Inferential Modeling Pace Will Slow Down in as Little as 1 Year

Honor launches AI image-generated video feature based on Google Veo 2

AI Weibo

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai tiktok

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

1ai WeChat

Five minutes a day

Become a master in one year

Scan the QR code to follow

{{userData.name}}Verify

Related content:

Epoch AI Predicts: Inferential Modeling Pace Will Slow Down in as Little as 1 Year

Honor launches AI image-generated video feature based on Google Veo 2

OpenAI CEO Altman's Latest Thoughts on AI Inclusion and the Challenges of AGI

OpenAI's five-level AGI strategy has been criticized by the industry. Is it flashy or really visionary?

OpenAI Chief Strategy Officer Jason Kwon: We won’t suddenly release an “all-encompassing” AI overnight

How to Prove You're Human in the Age of AI, OpenAI Aultman Pushes Chat App World to Try Iris Recognition

AI Applications

5000+ AI applications! Updated daily

1AICLUB

Highly recommended! Official brand Weibo

AI Tutorials

Tons of tutorials to read

AI Basic Training Camp

Zero-based entry, leading you to become an AI expert

1ai master

TikTok account: 1ai.net

1ai master

TikTok account: 1ai.net

Five minutes a day

Become a master in one year

Scan the QR code to follow