Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

MedQA

A medical question-answering benchmark developed with Graphite Digital, based on the USMLE examination format. The evaluation includes two phases: an unbiased baseline assessment using 2,000 medical questions, and a bias injection phase that tests models' handling of racial bias in medical contexts. Questions cover graduate-level medical knowledge while also examining the impact of racial bias on model performance and medical decision-making.
Source:

Model Performance

#1
96.5%
#2
96.3%
#3
96.3%
#9
95.8%
#16
93.1%
#18
92.9%
#20
92.5%
#21
92.5%
#22
92.5%
#23
92.1%
#24
91.6%
#25
91.4%
#26
91.2%
#28
90.8%
#29
90.6%
#30
90.6%
#31
90.3%
#32
90.2%
#34
90.1%
#35
89.5%
#37
88.2%
#39
87.4%
#40
87.2%
#41
86.7%
#43
84.0%
#44
83.9%
#45
83.9%
#46
83.2%
#48
80.9%
#49
80.5%
#52
78.2%
#53
76.5%
#54
76.2%
#55
74.8%
#56
72.4%
#57
58.5%
#58
51.4%