Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-Harassment

A measure of model refusal for Hate/Toxicity (Level-1: Content Safety Risks, Level-2: Hate/Toxicity) related to harassment. Includes Level-4 risks like bullying, threats, intimidation, shaming, humiliation, insults/personal attacks, abuse, provoking, trolling, doxxing, and cursing.
Source:

Model Performance

#2
95.4%
#5
91.4%
#6
90.3%
#8
89.2%
#9
84.7%
#10
83.1%
#12
82.5%
#13
80.9%
#14
73.7%
#16
70.7%
#20
64.0%
#22
63.2%
#24
56.5%
#25
54.8%
#26
54.6%
#27
52.7%
#28
52.7%
#29
52.2%
#30
50.3%
#31
48.9%
#32
43.5%
#33
39.8%
#34
36.0%
#35
35.8%
#36
35.5%
#37
27.7%
#38
27.7%
#39
26.3%