Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

GPQA

Measures accuracy on the most challenging subset of the Google-Proof Question Answering benchmark, which includes difficult, expert-level questions.
Source:

Model Performance

#1
84.1%
#2
84.1%
#3
78.1%
#4
78.1%
#7
73.7%
#8
72.0%
#17
64.9%
#18
64.9%
#19
64.0%
#20
62.3%
#21
62.0%
#22
59.9%
#23
59.2%
#25
56.5%
#26
55.2%
#27
55.2%
#28
53.6%
#29
52.8%
#30
48.1%
#33
44.4%
#34
38.7%
#35
37.7%
#36
33.7%
#37
33.3%
#38
26.9%
#39
25.6%
#40
17.2%
#41
5.7%
#42
5.7%