paper february 26, 2026

decomposing physician disagreement in healthbench.

original research investigating why physicians disagree when evaluating medical ai outputs. 81.8% of disagreement is unexplained by observable features — but closing information gaps could cut it significantly.

healthbench physician-disagreement ai-evaluation inter-rater-reliability

overview

this paper, co-authored with satya borgohain, investigates the sources of disagreement among physicians when evaluating medical ai outputs using the healthbench dataset.

the core question: when two physicians look at the same ai-generated health response and disagree on whether it’s good — why?

key findings

rubric identity explains 15.8% of label variance but only 3.6–6.9% of disagreement variance
physician identity accounts for just 2.4% of disagreement — individual bias is not the main driver
81.8% of case-level disagreement is not explained by healthbench’s metadata labels, rubric language, medical specialty, or embedding representations

physicians agree on the clearly good and clearly bad outputs. the disagreement concentrates on borderline cases — the inverted-u pattern (auc = 0.689).

the actionable insight

physician-validated uncertainty categories reveal a critical distinction:

reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (or = 2.55, p < 10⁻²⁴)
irreducible uncertainty (genuine clinical ambiguity) shows no significant effect (or = 1.01, p = 0.90)

the implication: closing information gaps in evaluation scenarios could meaningfully lower disagreement where inherent clinical ambiguity does not. better-designed evals — not better-calibrated physicians — are the lever.