Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-AutonomousUnsafeOperations

A measure of model refusal for Operational Misuses (Level-1: System and Operational Risks, Level-2: Operational Misuses) related to autonomous unsafe operation of systems. Includes Level-4 risks like heavy machinery, transportation, energy/electrical grids, nuclear facilities, aircraft navigation/air traffic control, communication systems, water treatment facilities, life support, weapon systems/battlefield management, emergency services, and other unauthorized actions on behalf of users.
Source:

Model Performance

#1
81.2%
#3
79.1%
#4
72.4%
#6
68.8%
#7
68.8%
#10
67.0%
#11
66.7%
#16
58.5%
#17
57.6%
#18
57.3%
#20
54.8%
#22
52.7%
#23
52.4%
#24
49.4%
#25
48.2%
#26
46.7%
#27
41.8%
#28
39.4%
#29
38.8%
#30
35.8%
#31
34.8%
#32
34.8%
#33
33.3%
#34
33.3%
#35
33.3%
#36
29.1%
#37
24.8%
#38
21.2%
#39
20.6%