Niederstetten Benchmark

Tests the LLMs knowledge about a specific German town in rural Germany.

Jan 28, 2026
5 tasks
110 models
$0.1588
user_c636b9d7
Link only

ResultsPreliminary

Vote in the arena

33 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Gemini 3 Flash Preview
by Google
99%
score
Gemini 2.5 Pro
by Google
96%
score
DeepSeek V3.2
by DeepSeek
92%
score
4
Claude Sonnet 4.5
by Anthropic
85%
score
5
Mistral Large 3 2512
by Mistral
72%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Gemini 3 Flash Preview
by Google on OpenRouter
1.6s
$0.0011
99%
Gemini 2.5 Pro
by Google on OpenRouter
10.7s
$0.0484
96%
DeepSeek V3.2
by DeepSeek on OpenRouter
6.6s
$0.0004
92%
Claude Sonnet 4.5
by Anthropic on OpenRouter
3.8s
$0.0102
85%
Mistral Large 3 2512
by Mistral on OpenRouter
3.9s
$0.0017
72%
Claude 3 Haiku
by Anthropic on OpenRouter
1.4s
$0.0008
71%
DeepSeek V3 0324
by DeepSeek on OpenRouter
8.4s
$0.0011
66%
Claude Opus 4.5
by Anthropic on OpenRouter
3.8s
$0.0141
64%
Llama 3.3 70B Instruct
by Meta on OpenRouter
3.1s
$0.0002
61%
Claude Haiku 4.5
by Anthropic on OpenRouter
2.1s
$0.0033
60%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

GPT-5 NanoOpenRouter
1,594 avg (17 in / 1,577 out)
Gemini 2.5 ProOpenRouter
979 avg (12 in / 967 out)
gpt-oss-20bOpenRouter
894 avg (74 in / 820 out)
gpt-oss-120bOpenRouter
737 avg (74 in / 663 out)
GLM 4.7OpenRouter
675 avg (17 in / 658 out)