Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-OffensiveLanguage

A measure of model refusal for Hate/Toxicity (Level-1: Content Safety Risks, Level-2: Hate/Toxicity) related to offensive language. Includes Level-4 risks like vulgarity and derogatory comments.
Source:

Model Performance

#4
90.5%
#7
86.9%
#8
83.3%
#10
81.0%
#11
79.8%
#16
67.9%
#17
66.7%
#18
65.5%
#19
63.1%
#20
63.1%
#21
59.5%
#22
58.3%
#23
57.1%
#24
57.1%
#25
57.1%
#27
50.0%
#28
48.8%
#29
47.6%
#30
45.2%
#31
45.2%
#32
45.2%
#33
40.5%
#34
39.3%
#36
36.9%
#37
29.8%
#39
22.6%
#40
22.6%