Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

HumanEval

Evaluates the model's ability to generate correct Python code solutions to programming problems on the first attempt, based on function signatures and docstrings.
Source:

Model Performance

#1
99.0%
#2
99.0%
#6
99.0%
#7
99.0%
#9
98.0%
#10
98.0%
#11
98.0%
#12
98.0%
#14
97.0%
#15
97.0%
#17
97.0%
#18
97.0%
#19
97.0%
#20
96.0%
#22
96.0%
#23
95.0%
#25
95.0%
#26
94.0%
#28
93.0%
#30
92.0%
#31
91.0%
#32
91.0%
#33
90.0%
#34
90.0%
#35
90.0%
#36
90.0%
#38
88.0%
#39
87.0%
#40
86.0%
#41
86.0%
#42
85.0%
#43
85.0%
#44
82.0%
#45
71.0%
#46
71.0%
#47
70.0%
#49
34.0%
#50
13.0%
#51
0.0%