Model Explorer
Explore benchmark performance of various AI models
Models
Claude-3-Opus
Claude-3.5-Haiku
Claude-3.5-Sonnet-1022
Claude-3.7-Sonnet
Claude-3.7-Sonnet-Thinking
Claude-4.0-Opus
Claude-4.0-Opus-Thinking
Claude-4.0-Sonnet
Claude-4.0-Sonnet-Thinking
Claude-4.1-Opus-Thinking
Cohere-Command-A
Cohere-Command-R-Plus
DeepSeek-R1
DeepSeek-V3-0324
GPT-3.5-Turbo
GPT-4-mini
GPT-4.1
GPT-4o-0513
GPT-5
GPT-5-Thinking
GPT-5-mini
GPT-5-mini-Thinking
GPT-5-nano
GPT-5-nano-Thinking
GPT-OSS-120B
Gemini-2.0-Flash
Gemini-2.0-Pro-0121
Gemini-2.5-Flash
Gemini-2.5-Flash-Thinking
Gemini-2.5-Pro-0325
Gemini-2.5-Pro-0605
Gemini-2.5-Pro-Thinking
Grok-3-Beta
Grok-3-Mini-Beta
Grok-4
Grok-4-Thinking
Kimi-K2-Instruct
Llama-2-7B
Llama-3.1-405B
Llama-3.3-70B
Llama-4-Maverick-17B
Magistral-Medium-3.1
Mistral-Large-2
Mistral-Medium-3.1
OpenAI-O1-1217
OpenAI-O1-mini
OpenAI-O3-high
OpenAI-O3-medium
OpenAI-O3-mini-high
OpenAI-O3-mini-medium
OpenAI-O4-mini-high
OpenAI-O4-mini-medium
Phi-4
Qwen-3
Qwen-3-Thinking
GPT-4.1
OpenAI's GPT-4.1 model, an advancement on the GPT-4 series
Performance by Benchmark
Capability Benchmarks
61.0%
39.6%
78.1%
83.0%
85.1%
65.3%
68.4%
71.2%
24.6%
52.8%
96.0%
5.0%
43.0%
81.9%
55.9%
73.2%
54.5%
62.4%
44.4%
54.7%
87.7%
78.3%
72.4%
87.2%
79.0%
47.4%
38.0%
27.0%
78.4%
13.0%
47.0%
Safety Benchmarks
67.4%
77.8%
18.3%
62.7%
64.0%
90.1%
16.7%
93.3%
83.3%
Capability & Safety Benchmarks
91.2%