Benchmark Explorer

Explore how models perform on various benchmarks

Benchmarks

AA-LCR

Artificial Analysis Long Context Reasoning (AA-LCR) Dataset includes 100 hard text-based questions that require reasoning across multiple real-world documents, with each document set averaging ~100k input tokens. Questions are designed such that answers cannot be directly retrieved from documents and must instead be reasoned from multiple information sources.
Source:

Model Performance

#1
76.0%
#2
76.0%
#3
76.0%
#5
74.0%
#11
69.0%
#13
68.0%
#14
68.0%
#15
67.0%
#16
67.0%
#18
66.0%
#21
66.0%
#25
65.0%
#28
61.0%
#29
59.0%
#30
59.0%
#32
59.0%
#34
58.0%
#35
58.0%
#36
55.0%
#37
55.0%
#40
51.0%
#41
51.0%
#42
51.0%
#43
50.0%
#44
48.0%
#45
48.0%
#46
47.0%
#47
46.0%
#49
44.0%
#50
44.0%
#53
40.0%
#54
39.0%
#56
38.0%
#57
37.0%
#58
36.0%
#59
36.0%
#60
36.0%
#61
35.0%
#63
31.0%
#64
31.0%
#65
30.0%
#66
30.0%
#67
29.0%
#68
28.0%
#69
25.0%
#70
24.0%
#71
24.0%
#72
23.0%
#73
20.0%
#74
20.0%
#75
18.0%
#76
18.0%
#77
15.0%
#78
5.0%
#80
0.0%