Benchmark Details - Model Trust Scores

AIR-Bench-Unauthorizedprivacyviolationssensitivedata

Combination of two metrics: UnauthorizedPrivacyViolations and SensitiveData. The former is a measure of model refusal for Privacy (Level-1: Legal and Rights-Related Risks, Level-2: Privacy) related to unauthorized privacy violations. Includes Level-4 risks like unauthorized generation, disclosure, distribution, collection/gathering/theft, processing, inference/synthesis, non-consensual tracking/monitoring/stalking/spyware, and model attacks (membership inference, model inversion). The latter is a measure of model refusal for Privacy (Level-1: Legal and Rights-Related Risks, Level-2: Privacy) related to types of sensitive data. Includes Level-4 sensitive data categories like personal identifiable information, health data, location data, demographic data, biometric data (facial recognition), educational records, financial records, behavioral/preference data, and communication records.

Model Performance

Claude 3 Sonnet

100.0%

Claude 4.0 Opus (Thinking)

100.0%

Claude 4.0 Opus

100.0%

Claude Sonnet 4.5 (Thinking)

100.0%

Claude Sonnet 4.5

100.0%

GPT-4.1

100.0%

Gemini 2.5 Pro (Thinking)

100.0%

Grok 3 (Thinking)

100.0%

GPT OSS 120B

100.0%

#10

OpenAI o1

100.0%

#11

OpenAI o4 Mini (Medium Effort)

100.0%

#12

OpenAI o3 (Medium Effort)

100.0%

#13

Grok 3

100.0%

#14

Gemini 2.5 Pro

100.0%

#15

OpenAI o3 (High Effort)

100.0%

#16

OpenAI o4 Mini (High Effort)

100.0%

#17

GPT-5 (Thinking)

98.7%

#18

GPT-5

98.7%

#19

Claude Haiku 4.5 (Thinking)

98.0%

#20

Claude Haiku 4.5

98.0%

#21

Claude 4.0 Sonnet

96.7%

#22

Claude 4.0 Sonnet (Thinking)

96.7%

#23

Granite 3.0

90.0%

#24

Gemini 2.0 Pro

90.0%

#25

Claude 3.5 Haiku

90.0%

#26

GPT-4.5

86.7%

#27

Claude 3.5 Sonnet

84.8%

#28

Claude 3 Opus

83.3%

#29

Kimi K2

81.6%

#30

Claude 3.7 Sonnet

80.3%

#31

Claude 3.7 Sonnet (Thinking)

80.3%

#32

Gemini 3.1 Pro Preview

73.3%

#33

Gemini 2.0 Flash

68.0%

#34

Gemini 2.0 Flash Thinking (Jan 2025)

68.0%

#35

Gemini 2.0 Flash (Thinking)

68.0%

#36

OpenAI o3 Mini (Medium Effort)

67.8%

#37

OpenAI o3 Mini (High Effort)

67.8%

#38

GPT-4o Mini

66.9%

#39

Llama 3.1 405B

64.3%

#40

Llama 4 Maverick 17B

63.9%

#41

Qwen 3

60.2%

#42

Qwen 3 (Thinking)

60.2%

#43

Gemini 1.5 Pro

57.4%

#44

GPT-4o

56.0%

#45

DeepSeek R1

55.5%

#46

Gemini 2.5 Flash

52.2%

#47

Gemini 2.5 Flash (Thinking)

52.2%

#48

Grok 3 Mini

50.0%

#49

Llama 3.3 70B

44.9%

#50

DeepSeek V3

41.1%

#51

GPT-3.5 Turbo

38.9%

#52

Mistral Large 2

36.6%

#53

Grok 4

33.3%

#54

Grok 4 (Thinking)

33.3%

#55

Cohere Command R+

31.1%

#56

Cohere Command A

31.0%

#57

OpenAI o1 Mini

26.7%

Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AIR-Bench-Unauthorizedprivacyviolationssensitivedata

Model Performance