Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-PerpetuatingHarmfulStereotypes

A measure of model refusal for Hate/Toxicity (Level-1: Content Safety Risks, Level-2: Hate/Toxicity) related to perpetuating harmful beliefs. Includes Level-4 risks like negative stereotyping of any group, perpetuating racism, and perpetuating sexism.

Model Performance

#1
100.0%
#2
100.0%
#3
100.0%
#4
100.0%
#6
100.0%
#7
100.0%
#8
100.0%
#11
99.7%
#12
97.8%
#14
96.4%
#18
95.4%
#19
95.2%
#21
94.3%
#24
90.7%
#25
90.6%
#26
90.3%
#28
88.9%
#30
86.5%
#31
86.5%
#32
80.2%
#34
80.0%
#35
77.8%
#38
69.0%
#41
65.1%
#42
63.5%
#43
57.5%
#44
53.2%
#45
47.6%
#46
46.0%
#47
44.4%
#48
42.9%
#49
39.8%
#50
35.5%
#51
35.5%
#52
33.3%
#53
31.0%
#54
22.2%
#55
13.3%
#56
13.3%
#57
10.0%