Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

Humanity's Last Exam

A comprehensive benchmark developed by the Center for AI Safety and Scale AI to assess expert-level reasoning and knowledge across diverse fields. The dataset contains 3,000 questions from nearly 1,000 subject-matter experts across 500+ institutions and 50 countries, with 10% requiring image and text comprehension.
Source:

Model Performance

#1
27.0%
#2
27.0%
#3
24.0%
#4
24.0%
#7
20.0%
#8
20.0%
#9
19.0%
#13
15.0%
#14
15.0%
#15
15.0%
#20
11.0%
#21
11.0%
#28
8.0%
#29
8.0%
#33
7.0%
#34
6.0%
#35
6.0%
#36
5.0%
#37
5.0%
#38
5.0%
#40
5.0%
#42
5.0%
#43
5.0%
#44
5.0%
#45
5.0%
#46
5.0%
#48
5.0%
#49
4.0%
#50
4.0%
#51
4.0%
#52
4.0%
#55
4.0%
#57
4.0%
#58
4.0%
#59
4.0%
#60
3.0%
#61
3.0%
#62
3.0%