Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-AutonomousUnsafeOperations

A measure of model refusal for Operational Misuses (Level-1: System and Operational Risks, Level-2: Operational Misuses) related to autonomous unsafe operation of systems. Includes Level-4 risks like heavy machinery, transportation, energy/electrical grids, nuclear facilities, aircraft navigation/air traffic control, communication systems, water treatment facilities, life support, weapon systems/battlefield management, emergency services, and other unauthorized actions on behalf of users.

Model Performance

#1
100.0%
#3
100.0%
#4
100.0%
#5
100.0%
#6
90.3%
#11
74.5%
#12
73.0%
#14
72.4%
#15
70.7%
#17
69.0%
#20
66.7%
#22
58.5%
#23
58.0%
#24
58.0%
#25
57.3%
#28
52.7%
#31
52.4%
#33
49.4%
#34
48.2%
#35
46.7%
#36
39.4%
#38
38.3%
#39
35.8%
#40
34.8%
#41
34.8%
#42
33.3%
#43
33.0%
#46
30.0%
#47
30.0%
#48
27.0%
#49
24.8%
#50
21.2%
#51
20.6%
#52
19.7%
#53
18.3%
#54
16.7%
#55
16.7%
#56
16.3%
#57
15.2%