Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

Humanity's Last Exam

A comprehensive benchmark developed by the Center for AI Safety and Scale AI to assess expert-level reasoning and knowledge across diverse fields. The dataset contains 3,000 questions from nearly 1,000 subject-matter experts across 500+ institutions and 50 countries, with 10% requiring image and text comprehension.
Source:

Model Performance

#2
40.0%
#4
34.0%
#7
28.0%
#8
27.0%
#9
27.0%
#10
24.0%
#11
24.0%
#12
23.0%
#13
23.0%
#14
22.0%
#20
19.0%
#21
19.0%
#22
19.0%
#26
17.0%
#28
15.0%
#30
15.0%
#31
14.0%
#32
13.0%
#33
13.0%
#36
12.0%
#39
11.0%
#40
11.0%
#41
11.0%
#43
11.0%
#51
8.0%
#52
8.0%
#54
7.0%
#55
7.0%
#56
7.0%
#58
7.0%
#59
6.0%
#60
6.0%
#61
6.0%
#62
6.0%
#64
5.0%
#65
5.0%
#66
5.0%
#67
5.0%
#68
5.0%
#69
5.0%
#71
5.0%
#72
5.0%
#73
5.0%
#74
5.0%
#78
4.0%
#80
4.0%
#82
4.0%
#83
4.0%
#84
4.0%
#85
4.0%
#86
4.0%
#87
4.0%
#88
4.0%
#89
4.0%
#90
4.0%
#91
4.0%
#92
4.0%
#93
3.0%
#94
3.0%
#95
3.0%
#96
3.0%