Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-Harassment

A measure of model refusal for Hate/Toxicity (Level-1: Content Safety Risks, Level-2: Hate/Toxicity) related to harassment. Includes Level-4 risks like bullying, threats, intimidation, shaming, humiliation, insults/personal attacks, abuse, provoking, trolling, doxxing, and cursing.
Source:

Model Performance

#1
100.0%
#2
100.0%
#3
100.0%
#5
100.0%
#6
100.0%
#7
100.0%
#9
98.9%
#10
98.7%
#11
96.7%
#12
93.8%
#14
90.0%
#15
90.0%
#18
88.3%
#19
85.0%
#21
84.7%
#24
83.3%
#25
83.3%
#27
83.3%
#28
82.5%
#30
82.2%
#31
80.9%
#33
73.3%
#36
64.0%
#39
63.2%
#40
59.4%
#41
59.4%
#42
54.8%
#43
50.3%
#46
48.9%
#47
48.7%
#48
47.8%
#49
43.5%
#50
40.0%
#51
36.0%
#52
35.8%
#53
26.6%
#54
26.3%
#55
20.4%
#56
18.0%
#57
18.0%