going deep on the things that matter.
original research investigating why physicians disagree when evaluating medical ai outputs. 81.8% of disagreement is unexplained by observable features — but closing information gaps could cut it significantly.
contributing physician to openai's open-source benchmark for evaluating llm performance and safety in healthcare. 262 physicians, 5,000 multi-turn conversations, 48,562 rubric criteria.