Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

Terminal-Bench Hard

An agentic benchmark evaluating AI capabilities in terminal environments through software engineering, system administration, and data processing tasks. The "hard" subset contains 47 challenging tasks that test agents' abilities to compile code, train models, configure servers, play games, and debug systems in representative scenarios for real-world problems and terminal use patterns.
Source:

Model Performance