Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

🔍

Capability Benchmarks

AA-LCR

AIME

Blended Price (USD/1M Tokens)

CaseLaw

Chatbot Arena (Win Rate)

Chatbot Arena Coding

Chatbot Arena Vision

ContractLaw

CorpFin

FinanceAgent

GPQA

HumanEval

Humanity's Last Exam

IFBench

IOI

LegalBench

LiveBench (Agentic Coding)

LiveBench (Average)

LiveBench (Coding)

LiveBench (Data Analysis)

LiveBench (Instruction Following)

LiveBench (Language)

LiveBench (Math)

LiveBench (Reasoning)

LiveCodeBench

MGSM

MMLU Pro

MMMU

Math500

Median Tokens/s

MortgageTax

SWE-bench

SciCode

SimpleBench

TaxEval

Terminal-Bench Hard

τ²-Bench Telecom

Safety Benchmarks

AIR-Bench-AcademicDishonesty

AIR-Bench-AdultContent

AIR-Bench-AdviceInHeavilyRegulatedIndustries

AIR-Bench-AutomatedDecisionmaking

AIR-Bench-AutonomousUnsafeOperations

AIR-Bench-Availability

AIR-Bench-CelebratingSuffering

AIR-Bench-ChildSexualAbuse

AIR-Bench-Confidentiality

AIR-Bench-DepictingViolence

AIR-Bench-DeterringDemocraticParticipation

AIR-Bench-DiscriminationprotectedCharacteristics

AIR-Bench-DisempoweringWorkers

AIR-Bench-DisruptingSocialOrder

AIR-Bench-EndangermentHarmOrLossOfLife

AIR-Bench-Erotic

AIR-Bench-Fraud

AIR-Bench-FraudulentSchemes

AIR-Bench-Harassment

AIR-Bench-HateSpeechIncitingViolence

AIR-Bench-HighRiskFinancialActivities

AIR-Bench-IllegalRegulatedSubstances

AIR-Bench-IllegalServicesExploitation

AIR-Bench-InfluencingPolitics

AIR-Bench-Integrity

AIR-Bench-MilitaryAndWarfare

AIR-Bench-Misdisinformation

AIR-Bench-Misrepresentation

AIR-Bench-Monetized

AIR-Bench-NonconsensualNudity

AIR-Bench-OffensiveLanguage

AIR-Bench-OtherIllegalunlawfulActivity

AIR-Bench-PerpetuatingHarmfulStereotypes

AIR-Bench-PoliticalPersuasion

AIR-Bench-SowingDivision

AIR-Bench-SpecificTypesOfRights

AIR-Bench-SuicidalAndNonsuicidalSelfinjury

AIR-Bench-SupportingMaliciousOperations

AIR-Bench-TypesOfDefamation

AIR-Bench-Unauthorizedprivacyviolationssensitivedata

AIR-Bench-UnfairMarketPractices

AIR-Bench-ViolentActs

AIR-Bench-WeaponUsageDevelopment

Capability & Safety Benchmarks

MedQA

HumanEval

Evaluates the model's ability to generate correct Python code solutions to programming problems on the first attempt, based on function signatures and docstrings.

Source:

Model Performance

#1

GPT-5-Thinking

99.0%

#2

GPT-5

99.0%

#3

Gemini-2.5-Pro-0325

99.0%

#4

OpenAI-O4-mini-medium

99.0%

#5

OpenAI-O4-mini-high

99.0%

#6

OpenAI-O3-medium

99.0%

#7

OpenAI-O3-high

99.0%

#8

Gemini-2.5-Pro-0605

99.0%

#9

Grok-4

98.0%

#10

Grok-3-Mini-Beta

98.0%

#11

Grok-4-Thinking

98.0%

#12

Qwen-3-Thinking

98.0%

#13

Claude-3.7-Sonnet-Thinking

98.0%

#14

DeepSeek-R1

97.0%

#15

Claude-4.0-Sonnet

97.0%

#16

OpenAI-O3-mini-medium

97.0%

#17

OpenAI-O1-1217

97.0%

#18

Claude-4.0-Opus

97.0%

#19

OpenAI-O1-mini

97.0%

#20

GPT-4.1

96.0%

#21

Gemini-2.5-Flash-Thinking

96.0%

#22

Qwen-3

96.0%

#23

Claude-3.7-Sonnet

95.0%

#24

Gemini-2.0-Pro-0121

95.0%

#25

Gemini-2.5-Flash

95.0%

#26

GPT-4o-0513

94.0%

#27

Gemini-2.0-Flash-Thinking

94.0%

#28

Kimi-K2-Instruct

93.0%

#29

Claude-3.5-Sonnet-1022

93.0%

#30

DeepSeek-V3-0324

92.0%

#31

Grok-3-Beta

91.0%

#32

DeepSeek-V3

91.0%

#33

Gemini-1.5-Pro

90.0%

#34

Mistral-Medium-3.1

90.0%

#35

Mistral-Large-2

90.0%

#36

Gemini-2.0-Flash

90.0%

#37

Llama-4-Maverick-17B

88.0%

#38

GPT-4-mini

88.0%

#39

Phi-4

87.0%

#40

Llama-3.3-70B

86.0%

#41

Claude-3.5-Haiku

86.0%

#42

Claude-3-Opus

85.0%

#43

Llama-3.1-405B

85.0%

#44

Cohere-Command-A

82.0%

#45

Mistral-Large

71.0%

#46

Claude-3-Sonnet

71.0%

#47

GPT-3.5-Turbo

70.0%

#48

Cohere-Command-R-Plus

64.0%

#49

Llama-2-70B

34.0%

#50

Llama-2-7B

13.0%

#51

Llama-2-13B

0.0%