healthbench: evaluating llms towards improved human health.
contributing physician to openai's open-source benchmark for evaluating llm performance and safety in healthcare. 262 physicians, 5,000 multi-turn conversations, 48,562 rubric criteria.
overview
HealthBench is an open-source benchmark developed by OpenAI to assess both the performance and safety of large language models in healthcare contexts. it encompasses 5,000 multi-turn conversations and 48,562 unique rubric criteria spanning emergencies, clinical data transformation, global health, and more.
role
contributing physician evaluator β the only australian clinician on the team. evaluated the accuracy and quality of model-generated health-related responses as part of openaiβs advanced human data initiatives, designed to improve the safety of ai models on health-related questions.
262 physicians contributed to the benchmark across multiple health contexts and behavioural dimensions including accuracy, instruction following, and communication.
key findings from the benchmark
- physician-authored rubrics reveal significant gaps between model confidence and clinical reasoning.
- gpt-3.5 turbo scored 16%, gpt-4o scored 32%, o3 reached 60% β substantial progress but far from solved.
- healthbench hard (the most challenging subset) tops out at 32% for the best current model.
- gpt-4.1 nano outperforms gpt-4o while costing 25x less.
the benchmark is released under cc by 4.0 and is designed to be a living evaluation that grows with the field.