Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

LiveBench (Reasoning)

Evaluates performance on LiveBench reasoning tasks, including harder versions of Big-Bench Hard's Web of Lies and Zebra Puzzles.
Source:

Model Performance

#1
98.2%
#3
94.7%
#6
91.1%
#7
91.0%
#10
87.6%
#11
83.1%
#12
82.6%
#13
78.5%
#15
77.6%
#18
63.0%
#19
59.2%
#20
57.8%
#21
56.4%
#22
54.9%
#23
54.9%
#24
54.4%
#25
49.1%
#26
48.8%
#27
44.4%
#30
42.0%
#31
36.3%
#32
33.8%
#33
26.2%