Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

🔍

Capability Benchmarks

AA-LCR

AIME

Blended Price (USD/1M Tokens)

CaseLaw

Chatbot Arena (Win Rate)

Chatbot Arena Coding

Chatbot Arena Vision

ContractLaw

CorpFin

FinanceAgent

GPQA

HumanEval

Humanity's Last Exam

IFBench

IOI

LegalBench

LiveBench (Agentic Coding)

LiveBench (Average)

LiveBench (Coding)

LiveBench (Data Analysis)

LiveBench (Instruction Following)

LiveBench (Language)

LiveBench (Math)

LiveBench (Reasoning)

LiveCodeBench

MGSM

MMLU Pro

MMMU

Math500

Median Tokens/s

MortgageTax

SWE-bench

SciCode

SimpleBench

TaxEval

Terminal-Bench Hard

τ²-Bench Telecom

Safety Benchmarks

AIR-Bench-AcademicDishonesty

AIR-Bench-AdultContent

AIR-Bench-AdviceInHeavilyRegulatedIndustries

AIR-Bench-AutomatedDecisionmaking

AIR-Bench-AutonomousUnsafeOperations

AIR-Bench-Availability

AIR-Bench-CelebratingSuffering

AIR-Bench-ChildSexualAbuse

AIR-Bench-Confidentiality

AIR-Bench-DepictingViolence

AIR-Bench-DeterringDemocraticParticipation

AIR-Bench-DiscriminationprotectedCharacteristics

AIR-Bench-DisempoweringWorkers

AIR-Bench-DisruptingSocialOrder

AIR-Bench-EndangermentHarmOrLossOfLife

AIR-Bench-Erotic

AIR-Bench-Fraud

AIR-Bench-FraudulentSchemes

AIR-Bench-Harassment

AIR-Bench-HateSpeechIncitingViolence

AIR-Bench-HighRiskFinancialActivities

AIR-Bench-IllegalRegulatedSubstances

AIR-Bench-IllegalServicesExploitation

AIR-Bench-InfluencingPolitics

AIR-Bench-Integrity

AIR-Bench-MilitaryAndWarfare

AIR-Bench-Misdisinformation

AIR-Bench-Misrepresentation

AIR-Bench-Monetized

AIR-Bench-NonconsensualNudity

AIR-Bench-OffensiveLanguage

AIR-Bench-OtherIllegalunlawfulActivity

AIR-Bench-PerpetuatingHarmfulStereotypes

AIR-Bench-PoliticalPersuasion

AIR-Bench-SowingDivision

AIR-Bench-SpecificTypesOfRights

AIR-Bench-SuicidalAndNonsuicidalSelfinjury

AIR-Bench-SupportingMaliciousOperations

AIR-Bench-TypesOfDefamation

AIR-Bench-Unauthorizedprivacyviolationssensitivedata

AIR-Bench-UnfairMarketPractices

AIR-Bench-ViolentActs

AIR-Bench-WeaponUsageDevelopment

Capability & Safety Benchmarks

MedQA

AA-LCR

Artificial Analysis Long Context Reasoning (AA-LCR) Dataset includes 100 hard text-based questions that require reasoning across multiple real-world documents, with each document set averaging ~100k input tokens. Questions are designed such that answers cannot be directly retrieved from documents and must instead be reasoned from multiple information sources.

Source:

Model Performance

#1

GPT-5

76.0%

#2

GPT-5-Thinking

76.0%

#3

OpenAI-O3-medium

69.0%

#4

OpenAI-O3-high

69.0%

#5

Grok-4-Thinking

68.0%

#6

Grok-4

68.0%

#7

GPT-5-mini-Thinking

68.0%

#8

Qwen-3-Thinking

67.0%

#9

Gemini-2.5-Pro-0325

66.0%

#10

Gemini-2.5-Pro-Thinking

66.0%

#11

GPT-5-mini

66.0%

#12

Claude-4.1-Opus-Thinking

66.0%

#13

Claude-4.0-Sonnet-Thinking

65.0%

#14

Gemini-2.5-Flash-Thinking

62.0%

#15

GPT-4.1

61.0%

#16

DeepSeek-R1

55.0%

#17

OpenAI-O4-mini-medium

55.0%

#18

OpenAI-O4-mini-high

55.0%

#19

Kimi-K2-Instruct

51.0%

#20

GPT-OSS-120B

51.0%

#21

Grok-3-Mini-Beta

50.0%

#22

Gemini-2.5-Flash

46.0%

#23

Llama-4-Maverick-17B

46.0%

#24

Claude-4.0-Sonnet

44.0%

#25

GPT-5-nano-Thinking

42.0%

#26

DeepSeek-V3-0324

41.0%

#27

Qwen-3-Max-Preview

40.0%

#28

GPT-5-nano

40.0%

#29

Claude-4.0-Opus

36.0%

#30

Claude-4.0-Opus-Thinking

34.0%

#31

Qwen-3

31.0%

#32

DeepSeek-V3

29.0%

#33

Gemini-2.0-Flash

28.0%

#34

Mistral-Medium-3.1

20.0%

#35

Cohere-Command-A

18.0%

#36

Llama-3.3-70B

15.0%

#37

Mistral-Large-2

5.0%

#38

Llama-3.1-405B

0.0%

#39

Magistral-Medium-3.1

0.0%

#40

Phi-4

0.0%