Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

🔍

Capability Benchmarks

AA-LCR

AIME

Blended Price (USD/1M Tokens)

CaseLaw

Chatbot Arena (Win Rate)

Chatbot Arena Coding

Chatbot Arena Vision

ContractLaw

CorpFin

FinanceAgent

GPQA

HumanEval

Humanity's Last Exam

IFBench

IOI

LegalBench

LiveBench (Agentic Coding)

LiveBench (Average)

LiveBench (Coding)

LiveBench (Data Analysis)

LiveBench (Instruction Following)

LiveBench (Language)

LiveBench (Math)

LiveBench (Reasoning)

LiveCodeBench

MGSM

MMLU Pro

MMMU

Math500

Median Tokens/s

MortgageTax

SWE-bench

SciCode

SimpleBench

TaxEval

Terminal-Bench Hard

τ²-Bench Telecom

Safety Benchmarks

AIR-Bench-AcademicDishonesty

AIR-Bench-AdultContent

AIR-Bench-AdviceInHeavilyRegulatedIndustries

AIR-Bench-AutomatedDecisionmaking

AIR-Bench-AutonomousUnsafeOperations

AIR-Bench-Availability

AIR-Bench-CelebratingSuffering

AIR-Bench-ChildSexualAbuse

AIR-Bench-Confidentiality

AIR-Bench-DepictingViolence

AIR-Bench-DeterringDemocraticParticipation

AIR-Bench-DiscriminationprotectedCharacteristics

AIR-Bench-DisempoweringWorkers

AIR-Bench-DisruptingSocialOrder

AIR-Bench-EndangermentHarmOrLossOfLife

AIR-Bench-Erotic

AIR-Bench-Fraud

AIR-Bench-FraudulentSchemes

AIR-Bench-Harassment

AIR-Bench-HateSpeechIncitingViolence

AIR-Bench-HighRiskFinancialActivities

AIR-Bench-IllegalRegulatedSubstances

AIR-Bench-IllegalServicesExploitation

AIR-Bench-InfluencingPolitics

AIR-Bench-Integrity

AIR-Bench-MilitaryAndWarfare

AIR-Bench-Misdisinformation

AIR-Bench-Misrepresentation

AIR-Bench-Monetized

AIR-Bench-NonconsensualNudity

AIR-Bench-OffensiveLanguage

AIR-Bench-OtherIllegalunlawfulActivity

AIR-Bench-PerpetuatingHarmfulStereotypes

AIR-Bench-PoliticalPersuasion

AIR-Bench-SowingDivision

AIR-Bench-SpecificTypesOfRights

AIR-Bench-SuicidalAndNonsuicidalSelfinjury

AIR-Bench-SupportingMaliciousOperations

AIR-Bench-TypesOfDefamation

AIR-Bench-Unauthorizedprivacyviolationssensitivedata

AIR-Bench-UnfairMarketPractices

AIR-Bench-ViolentActs

AIR-Bench-WeaponUsageDevelopment

Capability & Safety Benchmarks

MedQA

Terminal-Bench Hard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks. The "hard" subset contains 47 challenging tasks that test agents' abilities to compile code, train models, configure servers, play games, and debug systems in representative scenarios for real-world problems and terminal use patterns.

Source:

Model Performance

#1

Grok-4-Thinking

38.0%

#2

OpenAI-O3-high

35.0%

#3

OpenAI-O3-medium

35.0%

#4

Claude-4.1-Opus-Thinking

32.0%

#5

GPT-5-mini-Thinking

31.0%

#6

GPT-5-Thinking

31.0%

#7

Claude-4.0-Sonnet-Thinking

30.0%

#8

Claude-4.0-Opus-Thinking

29.0%

#9

Claude-4.0-Sonnet

26.0%

#10

Gemini-2.5-Pro-Thinking

25.0%

#11

GPT-OSS-120B

22.0%

#12

Qwen-3-Max-Preview

18.0%

#13

Grok-3-Mini-Beta

16.0%

#14

DeepSeek-R1

15.0%

#15

Kimi-K2-Instruct

15.0%

#16

DeepSeek-V3-0324

14.0%

#17

Qwen-3

14.0%

#18

OpenAI-O4-mini-medium

14.0%

#19

OpenAI-O4-mini-high

14.0%

#20

GPT-4.1

13.0%

#21

Qwen-3-Thinking

13.0%

#22

Gemini-2.5-Flash-Thinking

13.0%

#23

Gemini-2.5-Flash

11.0%

#24

GPT-5-nano-Thinking

11.0%

#25

Mistral-Medium-3.1

10.0%

#26

Llama-3.1-405B

6.0%

#27

Llama-4-Maverick-17B

6.0%

#28

Llama-3.3-70B

3.0%