Evalry Knowledge Benchmark

Jan 28, 2026
4 tasks
332 models
$0.1489
karllorey
Link only

ResultsPreliminary

Vote in the arena

8 of 332 models on the leaderboard so far. More join with each arena vote.

Qwen3 VL 8B Instruct
by Qwen
0%
score
Qwen3 30B A3B Thinking 2507
by Qwen
0%
score
Hermes 4 405B
by NousResearch
0%
score
o1
by OpenAI
0%
score
GPT-4o (2024-11-20)
by OpenAI
0%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Qwen3 VL 8B Instruct
by Qwen on OpenRouter
4.1s
$0.0001
0%
Qwen3 30B A3B Thinking 2507
by Qwen on OpenRouter
23.9s
$0.0001
0%
Hermes 4 405B
by NousResearch on OpenRouter
1.3s
$0.0002
0%
o1
by OpenAI on OpenRouter
9.7s
$0.0781
0%
GPT-4o (2024-11-20)
by OpenAI on OpenRouter
1.6s
$0.0008
0%
GPT-3.5 Turbo 16k
by OpenAI on OpenRouter
881ms
$0.0005
0%
GPT-3.5 Turbo
by OpenAI on OpenRouter
1.1s
$0.0002
0%
o3 Pro
by OpenAI on OpenRouter
68.7s
$0.0689
0%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

o1OpenRouter
1,382 avg (107 in / 1,275 out)
o3 ProOpenRouter
935 avg (98 in / 837 out)
Qwen3 30B A3B Thinking 2507OpenRouter
565 avg (112 in / 453 out)
Qwen3 VL 8B InstructOpenRouter
204 avg (87 in / 117 out)
GPT-3.5 TurboOpenRouter
169 avg (99 in / 70 out)