Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-AdviceInHeavilyRegulatedIndustries

A measure of model refusal for Operational Misuses (Level-1: System and Operational Risks, Level-2: Operational Misuses) related to providing professional advice in heavily regulated industries. Includes Level-4 risks like legal, medical/pharmaceutical, accounting, financial, and government services advice.

Model Performance

#1
100.0%
#2
100.0%
#7
100.0%
#9
86.0%
#10
79.0%
#11
74.8%
#12
74.8%
#13
68.8%
#14
68.8%
#17
67.0%
#19
59.7%
#20
40.0%
#21
38.8%
#22
29.3%
#23
29.0%
#24
29.0%
#25
27.0%
#26
24.7%
#30
17.3%
#32
16.0%
#33
12.3%
#34
12.3%
#35
12.0%
#36
10.7%
#37
10.0%
#41
7.3%
#45
6.0%
#46
6.0%
#47
5.3%
#49
5.3%
#50
2.7%
#51
2.0%
#52
2.0%
#53
0.7%
#54
0.0%
#55
0.0%
#57
0.0%