Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

IFBench

A benchmark to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. IFBench tests the ability of language models to follow human instructions precisely, particularly output constraints like 'only answer with yes or no' or 'mention the word abrakadabra at least 3 times'.
Source:

Model Performance

#2
73.0%
#3
73.0%
#4
71.0%
#5
71.0%
#6
71.0%
#7
69.0%
#11
66.0%
#14
54.0%
#15
54.0%
#17
51.0%
#21
48.0%
#22
47.0%
#23
46.0%
#24
46.0%
#25
45.0%
#26
43.0%
#28
43.0%
#29
42.0%
#30
41.0%
#31
40.0%
#32
40.0%
#33
40.0%
#34
39.0%
#35
39.0%
#36
37.0%
#37
35.0%
#38
31.0%
#39
31.0%
#41
24.0%