paper may 13, 2025

healthbench: evaluating llms towards improved human health.

contributing physician to openai's open-source benchmark for evaluating llm performance and safety in healthcare. 262 physicians, 5,000 multi-turn conversations, 48,562 rubric criteria.

healthbench openai ai-safety healthcare

overview

HealthBench is an open-source benchmark developed by OpenAI to assess both the performance and safety of large language models in healthcare contexts. it encompasses 5,000 multi-turn conversations and 48,562 unique rubric criteria spanning emergencies, clinical data transformation, global health, and more.

role

contributing physician evaluator — the only australian clinician on the team. evaluated the accuracy and quality of model-generated health-related responses as part of openai’s advanced human data initiatives, designed to improve the safety of ai models on health-related questions.

262 physicians contributed to the benchmark across multiple health contexts and behavioural dimensions including accuracy, instruction following, and communication.

key findings from the benchmark

physician-authored rubrics reveal significant gaps between model confidence and clinical reasoning.
gpt-3.5 turbo scored 16%, gpt-4o scored 32%, o3 reached 60% — substantial progress but far from solved.
healthbench hard (the most challenging subset) tops out at 32% for the best current model.
gpt-4.1 nano outperforms gpt-4o while costing 25x less.

the benchmark is released under cc by 4.0 and is designed to be a living evaluation that grows with the field.