Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

τ²-Bench Telecom

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues. The benchmark introduces a new paradigm for evaluating conversational AI by simulating both the agent and user to actively modify a shared world state.
Source:

Model Performance