Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

GPQA

Measures accuracy on the most challenging subset of the Google-Proof Question Answering benchmark, which includes difficult, expert-level questions.
Source:

Model Performance

#2
88.9%
#4
84.1%
#6
83.8%
#8
80.8%
#9
80.8%
#10
80.8%
#14
77.4%
#19
73.7%
#20
73.7%
#21
72.7%
#22
72.0%
#23
71.4%
#25
70.4%
#26
70.0%
#27
68.3%
#35
64.9%
#36
64.9%
#37
64.0%
#39
62.3%
#40
62.0%
#41
59.9%
#42
59.2%
#44
56.5%
#45
55.2%
#46
55.2%
#47
53.6%
#48
53.5%
#49
52.8%
#53
46.1%
#54
45.5%
#56
44.4%
#57
38.7%
#58
37.7%
#59
33.7%
#60
33.3%
#61
26.9%
#62
25.6%
#63
17.2%
#64
5.7%
#65
5.7%