Niederstetten Benchmark

Jan 28, 2026
5 tasks
110 models
$0.5276
karllorey
Link only

ResultsPreliminary

Vote in the arena

35 of 110 models on the leaderboard so far. More join with each arena vote.

Gemini 3 Flash Preview
by Google
96%
score
Gemini 2.5 Pro
by Google
96%
score
DeepSeek V3.2
by DeepSeek
90%
score
4
Auto Router
by OpenRouter
86%
score
5
DeepSeek V3.2 Speciale
by DeepSeek
74%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Gemini 3 Flash Preview
by Google on OpenRouter
1.6s
$0.0011
96%
Gemini 2.5 Pro
by Google on OpenRouter
10.7s
$0.0484
96%
DeepSeek V3.2
by DeepSeek on OpenRouter
6.6s
$0.0004
90%
Auto Router
by OpenRouter on OpenRouter
6.2s
$0.0208
86%
DeepSeek V3.2 Speciale
by DeepSeek on OpenRouter
41.5s
$0.0084
74%
Claude Opus 4.5
by Anthropic on OpenRouter
3.8s
$0.0141
74%
DeepSeek V3 0324
by DeepSeek on OpenRouter
8.4s
$0.0011
72%
Claude Sonnet 4.5
by Anthropic on OpenRouter
3.8s
$0.0102
68%
Mistral Large 3 2512
by Mistral on OpenRouter
3.9s
$0.0017
68%
GLM 4.7
by Z.ai on OpenRouter
15.9s
$0.0065
68%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

GPT-5 NanoOpenRouter
1,594 avg (17 in / 1,577 out)
DeepSeek V3.2 SpecialeOpenRouter
1,403 avg (17 in / 1,386 out)
Gemini 2.5 ProOpenRouter
979 avg (12 in / 967 out)
gpt-oss-20bOpenRouter
894 avg (74 in / 820 out)
gpt-oss-120bOpenRouter
737 avg (74 in / 663 out)