Model Selection Wizard
Custom Use Case
Model Browser
Model Comparison
Benchmarks
About
Benchmark Explorer
Explore how models perform on various benchmarks
Select Benchmark
AA-LCR
AIME
AIR-Bench-AcademicDishonesty
AIR-Bench-AdultContent
AIR-Bench-AdviceInHeavilyRegulatedIndustries
AIR-Bench-AutomatedDecisionmaking
AIR-Bench-AutonomousUnsafeOperations
AIR-Bench-Availability
AIR-Bench-CelebratingSuffering
AIR-Bench-ChildSexualAbuse
AIR-Bench-Confidentiality
AIR-Bench-DepictingViolence
AIR-Bench-DeterringDemocraticParticipation
AIR-Bench-DiscriminationprotectedCharacteristics
AIR-Bench-DisempoweringWorkers
AIR-Bench-DisruptingSocialOrder
AIR-Bench-EndangermentHarmOrLossOfLife
AIR-Bench-Erotic
AIR-Bench-Fraud
AIR-Bench-FraudulentSchemes
AIR-Bench-Harassment
AIR-Bench-HateSpeechIncitingViolence
AIR-Bench-HighRiskFinancialActivities
AIR-Bench-IllegalRegulatedSubstances
AIR-Bench-IllegalServicesExploitation
AIR-Bench-InfluencingPolitics
AIR-Bench-Integrity
AIR-Bench-MilitaryAndWarfare
AIR-Bench-Misdisinformation
AIR-Bench-Misrepresentation
AIR-Bench-Monetized
AIR-Bench-NonconsensualNudity
AIR-Bench-OffensiveLanguage
AIR-Bench-OtherIllegalunlawfulActivity
AIR-Bench-PerpetuatingHarmfulStereotypes
AIR-Bench-PoliticalPersuasion
AIR-Bench-SowingDivision
AIR-Bench-SpecificTypesOfRights
AIR-Bench-SuicidalAndNonsuicidalSelfinjury
AIR-Bench-SupportingMaliciousOperations
AIR-Bench-TypesOfDefamation
AIR-Bench-Unauthorizedprivacyviolationssensitivedata
AIR-Bench-UnfairMarketPractices
AIR-Bench-ViolentActs
AIR-Bench-WeaponUsageDevelopment
ARC-AGI
Blended Price (USD/1M Tokens)
CaseLaw
Chatbot Arena (Win Rate)
Chatbot Arena AAII
Chatbot Arena Coding
Chatbot Arena Vision
ContractLaw
CorpFin
FinanceAgent
GPQA
HumanEval
Humanity's Last Exam
IFBench
IOI
LegalBench
LiveBench (Agentic Coding)
LiveBench (Average)
LiveBench (Coding)
LiveBench (Data Analysis)
LiveBench (Instruction Following)
LiveBench (Language)
LiveBench (Math)
LiveBench (Reasoning)
LiveCodeBench
MGSM
MMLU Pro
MMMU
Math500
MedQA
Median Tokens/s
MortgageTax
SAGE
SWE-bench
SciCode
SimpleBench
TaxEval
Terminal-Bench Hard
Vals Index
Vals Multimodal Index
Vibe Code Bench
τ²-Bench Telecom
Benchmarks
🔍
Capability Benchmarks
AA-LCR
AIME
ARC-AGI
CaseLaw
Chatbot Arena (Win Rate)
Chatbot Arena AAII
Chatbot Arena Coding
Chatbot Arena Vision
ContractLaw
CorpFin
FinanceAgent
GPQA
HumanEval
Humanity's Last Exam
IFBench
IOI
LegalBench
LiveBench (Agentic Coding)
LiveBench (Average)
LiveBench (Coding)
LiveBench (Data Analysis)
LiveBench (Instruction Following)
LiveBench (Language)
LiveBench (Math)
LiveBench (Reasoning)
LiveCodeBench
MGSM
MMLU Pro
MMMU
Math500
MortgageTax
SAGE
SWE-bench
SciCode
SimpleBench
TaxEval
Terminal-Bench Hard
Vals Index
Vals Multimodal Index
Vibe Code Bench
τ²-Bench Telecom
Safety Benchmarks
AIR-Bench-AcademicDishonesty
AIR-Bench-AdultContent
AIR-Bench-AdviceInHeavilyRegulatedIndustries
AIR-Bench-AutomatedDecisionmaking
AIR-Bench-AutonomousUnsafeOperations
AIR-Bench-Availability
AIR-Bench-CelebratingSuffering
AIR-Bench-ChildSexualAbuse
AIR-Bench-Confidentiality
AIR-Bench-DepictingViolence
AIR-Bench-DeterringDemocraticParticipation
AIR-Bench-DiscriminationprotectedCharacteristics
AIR-Bench-DisempoweringWorkers
AIR-Bench-DisruptingSocialOrder
AIR-Bench-EndangermentHarmOrLossOfLife
AIR-Bench-Erotic
AIR-Bench-Fraud
AIR-Bench-FraudulentSchemes
AIR-Bench-Harassment
AIR-Bench-HateSpeechIncitingViolence
AIR-Bench-HighRiskFinancialActivities
AIR-Bench-IllegalRegulatedSubstances
AIR-Bench-IllegalServicesExploitation
AIR-Bench-InfluencingPolitics
AIR-Bench-Integrity
AIR-Bench-MilitaryAndWarfare
AIR-Bench-Misdisinformation
AIR-Bench-Misrepresentation
AIR-Bench-Monetized
AIR-Bench-NonconsensualNudity
AIR-Bench-OffensiveLanguage
AIR-Bench-OtherIllegalunlawfulActivity
AIR-Bench-PerpetuatingHarmfulStereotypes
AIR-Bench-PoliticalPersuasion
AIR-Bench-SowingDivision
AIR-Bench-SpecificTypesOfRights
AIR-Bench-SuicidalAndNonsuicidalSelfinjury
AIR-Bench-SupportingMaliciousOperations
AIR-Bench-TypesOfDefamation
AIR-Bench-Unauthorizedprivacyviolationssensitivedata
AIR-Bench-UnfairMarketPractices
AIR-Bench-ViolentActs
AIR-Bench-WeaponUsageDevelopment
Capability & Safety Benchmarks
MedQA
Speed & Latency Metrics
Median Tokens/s
Cost & Pricing Metrics
Blended Price (USD/1M Tokens)
Median Tokens/s
The median number of tokens processed per second, measuring the model's throughput in generating responses.
Source:
Model Performance
#1
GPT OSS 120B
253.0
#2
Gemini 2.5 Flash (Thinking)
235.0
#3
Gemini 3.1 Flash Lite Preview
234.0
#4
Grok 4.20 (Reasoning)
230.0
#5
Grok 4.20
214.0
#6
GPT-5.4 Mini
210.0
#7
Gemini 2.5 Flash
209.0
#8
Grok 3 Mini
191.0
#9
GPT-5.4 Nano
191.0
#10
Devstral Small 2
185.0
#11
Gemini 3.0 Flash
163.0
#12
GPT-5
143.1
#13
GPT-5 Nano (Thinking)
139.0
#14
GPT-5 Nano
135.0
#15
OpenAI o3 Mini (Medium Effort)
131.0
#16
MiMo V2 Flash
127.0
#17
Gemini 2.5 Pro (Thinking)
124.0
#18
Llama 4 Maverick 17B
124.0
#19
GPT-5.1 Codex Max
123.0
#20
OpenAI o3 Mini (High Effort)
123.0
#21
OpenAI o4 Mini (Medium Effort)
122.0
#22
OpenAI o4 Mini (High Effort)
122.0
#23
Llama 2 7B
116.8
#24
GPT-5.2 Codex
116.0
#25
Gemini 3.1 Pro Preview
110.0
#26
Claude Haiku 4.5 (Thinking)
101.0
#27
Claude Haiku 4.5
95.0
#28
GPT-3.5 Turbo
93.6
#29
OpenAI o1
91.0
#30
Llama 3.3 70B
85.0
#31
GPT-4.1
82.0
#32
Grok 3
81.0
#33
GLM-4.7
78.0
#34
Devstral 2
78.0
#35
OpenAI o3 (High Effort)
77.0
#36
OpenAI o3 (Medium Effort)
77.0
#37
Mistral Medium 3.1
76.0
#38
GPT-5 (Thinking)
74.0
#39
GPT-5 Mini
72.0
#40
GPT-5 Mini (Thinking)
69.0
#41
GPT-5.3 Codex
68.0
#42
GPT-5.2
66.0
#43
GPT-5.4
66.0
#44
GPT-4o
64.0
#45
GLM-5
60.0
#46
MiniMax M2.1
57.0
#47
MiniMax M2.5
54.0
#48
Qwen 3
53.0
#49
Claude Opus 4.5 (Thinking)
52.0
#50
Claude 4.0 Sonnet (Thinking)
51.0
#51
MiniMax M2.7
51.0
#52
Grok 4
49.0
#53
Grok 4 (Thinking)
49.0
#54
Claude Sonnet 4.5 (Thinking)
49.0
#55
Mistral Large 3
48.0
#56
Qwen 3 Max Preview
47.0
#57
Claude Opus 4.5
46.0
#58
Qwen 3 Max (Thinking)
46.0
#59
Kimi K2.5
45.0
#60
Claude Opus 4.6 (Thinking)
45.0
#61
Claude Sonnet 4.5
43.0
#62
Claude 4.0 Sonnet
43.0
#63
Claude Sonnet 4.6
43.0
#64
Cohere Command A
43.0
#65
Claude Opus 4.6
41.0
#66
Qwen 3 (Thinking)
40.0
#67
Mistral Large 2
40.0
#68
Kimi K2
39.0
#69
GPT-4o Mini
39.0
#70
Claude 4.1 Opus (Thinking)
35.0
#71
Claude 4.0 Opus (Thinking)
34.0
#72
DeepSeek V3.2 (Thinking)
33.0
#73
DeepSeek V3.2
32.0
#74
Claude 4.0 Opus
32.0
#75
Claude 4.1 Opus
32.0
#76
Llama 3.1 405B
30.0
#77
Phi-4
12.0
#78
Claude 3 Sonnet
0.0
#79
Claude 3.5 Haiku
0.0
#80
Claude 3 Opus
0.0
#81
DeepSeek V3 (Mar 2025)
0.0
#82
DeepSeek V3
0.0
#83
Claude 3.7 Sonnet (Thinking)
0.0
#84
Claude 3.5 Sonnet
0.0
#85
Claude 3.7 Sonnet
0.0
#86
GPT-4.5
0.0
#87
DeepSeek R1
0.0
#88
Cohere Command R+
0.0
#89
Gemini 2.0 Flash (Thinking)
0.0
#90
Gemini 2.0 Pro
0.0
#91
Gemini 2.5 Pro (Jun 2025)
0.0
#92
Gemini 1.5 Pro
0.0
#93
Gemini 2.0 Flash
0.0
#94
Grok 3 (Thinking)
0.0
#95
Gemini 2.0 Flash Thinking (Jan 2025)
0.0
#96
Gemini 2.5 Pro
0.0
#97
Llama 2 13B
0.0
#98
Mistral Large
0.0
#99
Magistral Medium 3.1
0.0
#100
Llama 2 70B
0.0
#101
OpenAI o1 Mini
0.0