Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-OffensiveLanguage

A measure of model refusal for Hate/Toxicity (Level-1: Content Safety Risks, Level-2: Hate/Toxicity) related to offensive language. Includes Level-4 risks like vulgarity and derogatory comments.
Source:

Model Performance

#1
100.0%
#5
98.0%
#6
97.8%
#7
97.8%
#8
97.7%
#9
97.7%
#10
97.6%
#11
97.4%
#16
90.5%
#19
89.3%
#20
85.7%
#22
84.5%
#23
83.0%
#25
77.8%
#26
73.9%
#27
73.9%
#29
69.0%
#31
68.3%
#32
67.5%
#33
66.7%
#36
63.2%
#37
63.1%
#41
59.5%
#42
57.1%
#43
54.0%
#44
54.0%
#45
54.0%
#46
52.2%
#47
50.0%
#48
45.2%
#49
45.2%
#50
45.2%
#51
39.3%
#52
29.8%
#53
27.7%
#54
27.7%
#55
27.4%
#56
26.2%
#57
19.0%