Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

SciCode

A scientific code generation benchmark that evaluates models' ability to solve realistic research problems across 16 subdomains in physics, mathematics, materials science, biology, and chemistry. Problems are derived from real research scenarios rather than exam-style questions.
Source:

Model Performance

#3
46.0%
#4
46.0%
#5
43.0%
#6
43.0%
#8
42.0%
#10
41.0%
#11
41.0%
#13
41.0%
#14
41.0%
#15
41.0%
#16
40.0%
#25
38.0%
#26
38.0%
#29
37.0%
#30
37.0%
#31
37.0%
#32
36.0%
#33
36.0%
#34
36.0%
#35
36.0%
#36
35.0%
#37
35.0%
#38
34.0%
#39
34.0%
#40
33.0%
#43
32.0%
#45
31.0%
#47
30.0%
#48
30.0%
#49
29.0%
#50
29.0%
#51
28.0%
#52
27.0%
#53
26.0%
#54
26.0%
#55
23.0%
#56
23.0%
#57
23.0%
#58
21.0%
#60
12.0%
#61
0.0%