Spatial Reasoning: Germany

This benchmark evaluates an LLM's knowledge of German and Central European geography, focusing on relative positioning, proximity, and spatial orientation between cities and landmarks.

Jan 9, 2026
11 tasks
110 models
$0.4142
user_c636b9d7
Link only

ResultsPreliminary

Vote in the arena

27 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Claude Sonnet 4.5
by Anthropic
97%
score
Claude Haiku 4.5
by Anthropic
97%
score
Claude Opus 4.5
by Anthropic
95%
score
4
Gemini 3 Flash Preview
by Google
95%
score
5
Kimi K2 Thinking
by MoonshotAI
95%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Claude Sonnet 4.5
by Anthropic on OpenRouter
2.9s
$0.0254
97%
Claude Haiku 4.5
by Anthropic on OpenRouter
1.7s
$0.0075
97%
Claude Opus 4.5
by Anthropic on OpenRouter
3.4s
$0.0424
95%
Gemini 3 Flash Preview
by Google on OpenRouter
1.3s
$0.0032
95%
Kimi K2 Thinking
by MoonshotAI on OpenRouter
26.6s
$0.0059
95%
GPT-5 Mini
by OpenAI on OpenRouter
12.0s
$0.0252
95%
GPT-5.2
by OpenAI on OpenRouter
4.2s
$0.0363
95%
Gemini 2.5 Pro
by Google on OpenRouter
10.4s
$0.2173
95%
Claude 3.5 Haiku
by Anthropic on OpenRouter
2.2s
$0.0059
94%
Gemini 2.5 Flash Lite
by Google on OpenRouter
787ms
$0.0007
94%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

GPT-5 NanoOpenRouter
1,312 avg (27 in / 1,284 out)
Gemini 2.5 ProOpenRouter
1,007 avg (22 in / 985 out)
GLM 4.7OpenRouter
947 avg (27 in / 920 out)
gpt-oss-20bOpenRouter
714 avg (87 in / 627 out)
Kimi K2 ThinkingOpenRouter
670 avg (44 in / 626 out)