Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

SWE-bench

Evaluates the model's ability to resolve real-world software engineering issues from GitHub repositories. Models are tested on their capacity to generate code patches that fix actual bugs and implement requested features from open-source projects.
Source:

Model Performance

#2
58.6%
#3
58.6%
#4
49.8%
#5
49.8%
#6
47.4%
#8
42.0%
#9
42.0%
#10
34.2%
#14
0.2%