Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

SimpleBench

A multiple-choice text benchmark designed to test basic reasoning capabilities where non-specialized humans (high school level) consistently outperform state-of-the-art language models, covering spatio-temporal reasoning, social intelligence, and linguistic adversarial robustness.
Source:

Model Performance

#3
60.0%
#5
56.7%
#6
56.7%
#7
53.1%
#12
44.9%
#13
40.1%
#15
34.5%
#16
31.0%
#17
31.0%
#18
30.9%
#22
27.2%
#23
27.1%
#24
27.0%
#25
25.1%
#26
23.5%
#27
23.0%
#29
22.5%
#30
22.1%
#31
19.9%
#32
18.9%
#33
18.9%
#34
18.1%
#36
10.7%