Model Selection Wizard
Custom Use Case
Model Browser
Benchmarks
About
Benchmark Explorer
Explore how models perform on various benchmarks
Select Benchmark
AA-LCR
AIME
AIR-Bench-AcademicDishonesty
AIR-Bench-AdultContent
AIR-Bench-AdviceInHeavilyRegulatedIndustries
AIR-Bench-AutomatedDecisionmaking
AIR-Bench-AutonomousUnsafeOperations
AIR-Bench-Availability
AIR-Bench-CelebratingSuffering
AIR-Bench-ChildSexualAbuse
AIR-Bench-Confidentiality
AIR-Bench-DepictingViolence
AIR-Bench-DeterringDemocraticParticipation
AIR-Bench-DiscriminationprotectedCharacteristics
AIR-Bench-DisempoweringWorkers
AIR-Bench-DisruptingSocialOrder
AIR-Bench-EndangermentHarmOrLossOfLife
AIR-Bench-Erotic
AIR-Bench-Fraud
AIR-Bench-FraudulentSchemes
AIR-Bench-Harassment
AIR-Bench-HateSpeechIncitingViolence
AIR-Bench-HighRiskFinancialActivities
AIR-Bench-IllegalRegulatedSubstances
AIR-Bench-IllegalServicesExploitation
AIR-Bench-InfluencingPolitics
AIR-Bench-Integrity
AIR-Bench-MilitaryAndWarfare
AIR-Bench-Misdisinformation
AIR-Bench-Misrepresentation
AIR-Bench-Monetized
AIR-Bench-NonconsensualNudity
AIR-Bench-OffensiveLanguage
AIR-Bench-OtherIllegalunlawfulActivity
AIR-Bench-PerpetuatingHarmfulStereotypes
AIR-Bench-PoliticalPersuasion
AIR-Bench-SowingDivision
AIR-Bench-SpecificTypesOfRights
AIR-Bench-SuicidalAndNonsuicidalSelfinjury
AIR-Bench-SupportingMaliciousOperations
AIR-Bench-TypesOfDefamation
AIR-Bench-Unauthorizedprivacyviolationssensitivedata
AIR-Bench-UnfairMarketPractices
AIR-Bench-ViolentActs
AIR-Bench-WeaponUsageDevelopment
Blended Price (USD/1M Tokens)
CaseLaw
Chatbot Arena (Win Rate)
Chatbot Arena Coding
Chatbot Arena Vision
ContractLaw
CorpFin
FinanceAgent
GPQA
HumanEval
Humanity's Last Exam
IFBench
IOI
LegalBench
LiveBench (Agentic Coding)
LiveBench (Average)
LiveBench (Coding)
LiveBench (Data Analysis)
LiveBench (Instruction Following)
LiveBench (Language)
LiveBench (Math)
LiveBench (Reasoning)
LiveCodeBench
MGSM
MMLU Pro
MMMU
Math500
MedQA
Median Tokens/s
MortgageTax
SWE-bench
SciCode
SimpleBench
TaxEval
Terminal-Bench Hard
τ²-Bench Telecom
Benchmarks
🔍
Capability Benchmarks
AA-LCR
AIME
Blended Price (USD/1M Tokens)
CaseLaw
Chatbot Arena (Win Rate)
Chatbot Arena Coding
Chatbot Arena Vision
ContractLaw
CorpFin
FinanceAgent
GPQA
HumanEval
Humanity's Last Exam
IFBench
IOI
LegalBench
LiveBench (Agentic Coding)
LiveBench (Average)
LiveBench (Coding)
LiveBench (Data Analysis)
LiveBench (Instruction Following)
LiveBench (Language)
LiveBench (Math)
LiveBench (Reasoning)
LiveCodeBench
MGSM
MMLU Pro
MMMU
Math500
Median Tokens/s
MortgageTax
SWE-bench
SciCode
SimpleBench
TaxEval
Terminal-Bench Hard
τ²-Bench Telecom
Safety Benchmarks
AIR-Bench-AcademicDishonesty
AIR-Bench-AdultContent
AIR-Bench-AdviceInHeavilyRegulatedIndustries
AIR-Bench-AutomatedDecisionmaking
AIR-Bench-AutonomousUnsafeOperations
AIR-Bench-Availability
AIR-Bench-CelebratingSuffering
AIR-Bench-ChildSexualAbuse
AIR-Bench-Confidentiality
AIR-Bench-DepictingViolence
AIR-Bench-DeterringDemocraticParticipation
AIR-Bench-DiscriminationprotectedCharacteristics
AIR-Bench-DisempoweringWorkers
AIR-Bench-DisruptingSocialOrder
AIR-Bench-EndangermentHarmOrLossOfLife
AIR-Bench-Erotic
AIR-Bench-Fraud
AIR-Bench-FraudulentSchemes
AIR-Bench-Harassment
AIR-Bench-HateSpeechIncitingViolence
AIR-Bench-HighRiskFinancialActivities
AIR-Bench-IllegalRegulatedSubstances
AIR-Bench-IllegalServicesExploitation
AIR-Bench-InfluencingPolitics
AIR-Bench-Integrity
AIR-Bench-MilitaryAndWarfare
AIR-Bench-Misdisinformation
AIR-Bench-Misrepresentation
AIR-Bench-Monetized
AIR-Bench-NonconsensualNudity
AIR-Bench-OffensiveLanguage
AIR-Bench-OtherIllegalunlawfulActivity
AIR-Bench-PerpetuatingHarmfulStereotypes
AIR-Bench-PoliticalPersuasion
AIR-Bench-SowingDivision
AIR-Bench-SpecificTypesOfRights
AIR-Bench-SuicidalAndNonsuicidalSelfinjury
AIR-Bench-SupportingMaliciousOperations
AIR-Bench-TypesOfDefamation
AIR-Bench-Unauthorizedprivacyviolationssensitivedata
AIR-Bench-UnfairMarketPractices
AIR-Bench-ViolentActs
AIR-Bench-WeaponUsageDevelopment
Capability & Safety Benchmarks
MedQA
GPQA
Measures accuracy on the most challenging subset of the Google-Proof Question Answering benchmark, which includes difficult, expert-level questions.
Source:
Model Performance
#1
Grok-4-Thinking
84.1%
#2
Grok-4
84.1%
#3
OpenAI-O3-medium
78.1%
#4
OpenAI-O3-high
78.1%
#5
Gemini-2.5-Pro-Thinking
73.7%
#6
Gemini-2.5-Pro-0325
73.7%
#7
GPT-5-mini
73.7%
#8
Grok-3-Mini-Beta
72.0%
#9
Qwen-3-Max-Preview
70.4%
#10
Claude-4.1-Opus-Thinking
67.3%
#11
Claude-3.7-Sonnet-Thinking
67.1%
#12
OpenAI-O3-mini-high
66.7%
#13
OpenAI-O3-mini-medium
66.7%
#14
Claude-4.0-Sonnet-Thinking
66.0%
#15
OpenAI-O4-mini-high
66.0%
#16
OpenAI-O4-mini-medium
66.0%
#17
Grok-3-Beta
64.9%
#18
Grok-3-Think
64.9%
#19
OpenAI-O1-1217
64.0%
#20
Claude-4.0-Opus
62.3%
#21
Kimi-K2-Instruct
62.0%
#22
Claude-4.1-Opus
59.9%
#23
Claude-4.0-Sonnet
59.2%
#24
Llama-4-Maverick-17B
56.9%
#25
Claude-3.7-Sonnet
56.5%
#26
Qwen-3
55.2%
#27
Qwen-3-Thinking
55.2%
#28
Gemini-2.0-Flash
53.6%
#29
GPT-4.1
52.8%
#30
DeepSeek-V3-0324
48.1%
#31
Claude-3.5-Sonnet-1022
45.5%
#32
Gemini-2.5-Flash-Thinking
44.8%
#33
Gemini-1.5-Pro
44.4%
#34
DeepSeek-V3
38.7%
#35
Gemini-2.5-Flash
37.7%
#36
GPT-4o-0513
33.7%
#37
Llama-3.3-70B
33.3%
#38
Mistral-Large-2
26.9%
#39
GPT-4-mini
25.6%
#40
Claude-3.5-Haiku
17.2%
#41
Cohere-Command-A
5.7%
#42
GPT-3.5-Turbo
5.7%
#43
Cohere-Command-R-Plus
5.3%