Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

τ²-Bench Telecom

A dual-control conversational AI benchmark simulating technical support scenarios where both agent and user must coordinate actions to resolve telecom service issues. The benchmark introduces a new paradigm for evaluating conversational AI by simulating both the agent and user to actively modify a shared world state.
Source:

Model Performance

#2
97.0%
#4
95.0%
#5
94.0%
#7
92.0%
#8
91.0%
#11
90.0%
#12
86.0%
#13
85.0%
#14
85.0%
#15
85.0%
#16
85.0%
#17
84.0%
#19
83.0%
#20
81.0%
#23
80.0%
#24
79.0%
#26
75.0%
#27
75.0%
#30
71.0%
#31
70.0%
#33
66.0%
#35
63.0%
#36
61.0%
#42
53.0%
#43
52.0%
#44
50.0%
#45
49.0%
#47
47.0%
#48
47.0%
#49
43.0%
#50
41.0%
#51
37.0%
#53
35.0%
#54
35.0%
#55
33.0%
#56
33.0%
#57
33.0%
#58
32.0%
#62
31.0%
#63
30.0%
#65
27.0%
#66
26.0%
#67
25.0%
#68
25.0%
#69
25.0%
#70
23.0%
#72
23.0%
#73
23.0%
#74
19.0%
#76
15.0%
#77
15.0%
#78
0.0%