Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

Chatbot Arena Coding

Evaluates the model's coding abilities through head-to-head comparisons in the Chatbot Arena platform, where human judges assess code quality, correctness, and implementation approach.
Source:

Model Performance

#2
105.0%
#4
102.6%
#7
100.0%
#8
99.4%
#10
97.1%
#12
95.4%
#13
95.1%
#14
95.0%
#17
95.0%
#18
95.0%
#19
94.9%
#20
94.4%
#21
94.3%
#22
94.3%
#23
93.3%
#24
93.3%
#25
92.9%
#26
91.7%
#27
91.7%
#29
91.3%
#31
90.9%
#33
90.1%
#34
88.7%
#35
88.6%
#36
88.4%
#37
88.4%
#40
86.4%
#41
86.0%
#42
85.4%
#43
85.1%
#44
85.1%
#45
84.4%
#47
83.9%
#50
83.4%
#53
83.1%
#54
82.9%
#56
82.6%
#57
81.6%
#58
81.4%
#59
80.9%
#60
80.4%
#63
79.9%
#65
77.9%
#66
77.3%
#67
77.1%
#68
76.6%
#70
73.0%
#71
72.4%
#72
69.6%
#73
69.1%
#74
67.0%
#75
63.1%
#76
61.7%
#77
57.1%
#78
52.0%
#79
45.0%
#80
42.7%
#81
39.6%