Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AA-LCR

Artificial Analysis Long Context Reasoning (AA-LCR) Dataset includes 100 hard text-based questions that require reasoning across multiple real-world documents, with each document set averaging ~100k input tokens. Questions are designed such that answers cannot be directly retrieved from documents and must instead be reasoned from multiple information sources.
Source:

Model Performance

#1
76.0%
#2
76.0%
#3
69.0%
#4
69.0%
#5
68.0%
#6
68.0%
#8
67.0%
#11
66.0%
#15
61.0%
#16
55.0%
#19
51.0%
#20
51.0%
#21
50.0%
#22
46.0%
#24
44.0%
#26
41.0%
#27
40.0%
#28
40.0%
#29
36.0%
#31
31.0%
#32
29.0%
#33
28.0%
#34
20.0%
#35
18.0%
#36
15.0%
#37
5.0%
#38
0.0%
#40
0.0%