doctors disagree with each other more than you think.
we analysed 60,896 physician judgments grading ai medical outputs. doctors disagreed 22.5% of the time, and most of that disagreement was unexplainable.
If you ask two qualified doctors to evaluate the same AI medical output, using the same rubric, looking at the same case, they’ll disagree about one in every four or five times.
Not because one of them is wrong. Because clinical judgment, the thing we treat as the gold standard for evaluating AI, is noisier than almost anyone in the industry wants to admit.
22.5%. That’s the disagreement rate.
60,896 judgments
I was one of 262 physicians selected to contribute to OpenAI’s HealthBench, a large-scale benchmark for evaluating clinical AI, the only Australian clinician in the group. I connected with Satya Borgohain, a senior AI engineer previously at Relevance and Mutinex, and we got curious: when doctors disagree on whether an AI response is good enough, where does that disagreement actually come from? Is it the doctors? The rubrics? The cases themselves?
OpenAI had released the full HealthBench meta-evaluation dataset publicly, which gave us something rare: 60,896 physician judgments across 29,511 cases, graded by 186 doctors, with enough metadata to actually decompose disagreement by source.
We wrote a paper called Decomposing Physician Disagreement in HealthBench. The findings challenged my assumptions.
The entire premise of clinical AI evaluation rests on the fact that a physician’s judgment is a reliable gold standard. When someone says “we had a doctor review the outputs,” the implied message is that the review is definitive. But if you put two qualified clinicians in front of the same case and the same rubric, they’ll disagree almost a quarter of the time. Emergency medicine, internal medicine, paediatrics, psychiatry, primary care. Across the board.
where the disagreement lives
We used crossed random-effects models to break down the sources of variance. Three possible sources: the rubric (what question you’re asking), the physician (who’s doing the grading), and the case itself.
The rubric explains 15.8% of the variance. Some rubrics are just harder to agree on. Fair enough.
The physician explains 2.4%. Physician identity, the individual doctor’s training and experience and clinical judgment, accounts for 2.4% of the variance in grading.
The remaining 81.8% is case-level residual. The specific interaction between a particular doctor, a particular rubric, and a particular case. The stuff that changes every time.
We threw everything we had at explaining that 81.8%. Physician specialty. Whether the rubric used normative language (“should,” “must”). The medical specialty of the case. Surface-level text features. Semantic embeddings of the cases and responses.
None of it made a meaningful dent.
The dominant source of physician disagreement in clinical AI evaluation is, as far as we can measure, irreducible to observable features. It’s something about the specific meeting point of that doctor, that rubric, and that case, and we can’t predict it.
what’s fixable and what isn’t
We classified cases by the type of uncertainty present. Some had what we called reducible uncertainty: missing context, ambiguous phrasing, information gaps that could theoretically be closed with better evaluation design. Others had irreducible uncertainty: genuine medical ambiguity where reasonable doctors would legitimately disagree because the medicine itself doesn’t have a clear answer.
Reducible uncertainty more than doubled the odds of disagreement. The odds ratio was 2.55. If the case had missing context or unclear wording, doctors were two and a half times more likely to disagree on their grading.
Irreducible uncertainty, the genuine medical ambiguity, had essentially no effect. Odds ratio of 1.01. A flat line.
I assumed cases where the medicine was genuinely ambiguous would be the main driver. It seemed obvious. Ambiguous medicine, ambiguous grading. But the data showed something different. The real driver was the evaluation setup, not the medicine. Missing information. Unclear instructions. Scenarios that didn’t give the grading physician enough context to make a confident call.
This is actually good news. You can’t fix genuine medical ambiguity. But you can write better scenarios and tighter rubrics. The agreement ceiling in clinical AI evaluation is largely structural, not medical.
open questions
We couldn’t explain the 81.8% residual. Maybe it’s something about how individual physicians interact with specific clinical contexts in ways their specialty or demographics don’t capture. Maybe it’s the private heuristics and pattern-matching each doctor develops over thousands of consultations. I also don’t know how these findings translate beyond HealthBench into other evaluation contexts where you have the patient in front of you and can ask follow-up questions.
There’s a deeper question the paper touches but doesn’t resolve. If doctors disagree with each other this often, and the disagreement is mostly unexplainable, what does “correct” even mean in clinical AI evaluation? We’re measuring AI outputs against a human standard that isn’t as stable as we assumed. That doesn’t mean physician judgment is useless. But it is noisy.
Overall, the data points to two things.
The first is that we can reduce disagreement by building better evals. Tighter prompts, more context, clearer rubrics. The reducible uncertainty finding tells us there’s real ground to gain here, and it’s worth gaining.
The second is that healthcare is chaotic. Presentations are messy, ambiguous, and context-dependent. Even through a text-based interface like chatting to an LLM, that messiness doesn’t go away. disagreement isn’t a bug in the evaluation, it’s a feature of industry. and potentially any system for evaluating clinical AI needs to make peace with that rather than pretend it can be engineered out.