Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

LiveBench (Reasoning)

Evaluates performance on LiveBench reasoning tasks, including harder versions of Big-Bench Hard's Web of Lies and Zebra Puzzles.
Source:

Model Performance

#1
98.2%
#3
93.7%
#4
91.1%
#9
87.6%
#10
84.8%
#12
83.7%
#13
82.6%
#14
81.7%
#16
78.5%
#26
59.4%
#27
59.2%
#28
58.4%
#29
57.8%
#30
56.4%
#31
54.9%
#32
54.4%
#33
53.2%
#35
49.1%
#36
48.8%
#37
48.2%
#38
44.4%
#39
44.2%
#41
43.2%
#42
42.8%
#43
42.3%
#44
42.2%
#45
42.0%
#46
40.9%
#48
39.7%
#49
39.2%
#50
36.3%
#51
33.9%
#52
33.8%
#53
26.2%
#54
21.6%