Evalry Knowledge Benchmark

Jan 28, 2026

4 tasks

320 models

$0.1489

user_c636b9d7

Link only

ResultsPreliminary

Vote in the arena

8 of 320 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Qwen3 VL 8B Instruct

by Qwen

4.1s

$0.0001

score

Qwen3 30B A3B Thinking 2507

by Qwen

23.9s

$0.0001

score

GPT-3.5 Turbo

by OpenAI

1.1s

$0.0002

score

Hermes 4 405B

by Nous

1.3s

$0.0002

score

GPT-3.5 Turbo 16k

by OpenAI

881ms

$0.0005

score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Qwen3 VL 8B Instruct

by Qwen on OpenRouter

4.1s

$0.0001

Qwen3 30B A3B Thinking 2507

by Qwen on OpenRouter

23.9s

$0.0001

GPT-3.5 Turbo

by OpenAI on OpenRouter

1.1s

$0.0002

Hermes 4 405B

by Nous on OpenRouter

1.3s

$0.0002

GPT-3.5 Turbo 16k

by OpenAI on OpenRouter

881ms

$0.0005

GPT-4o (2024-11-20)

by OpenAI on OpenRouter

1.6s

$0.0008

o3 Pro

by OpenAI on OpenRouter

68.7s

$0.0689

by OpenAI on OpenRouter

9.7s

$0.0781

Model	Duration	Cost	Score
Qwen3 VL 8B Instruct by Qwen on OpenRouter	4.1s	$0.0001	0%
Qwen3 30B A3B Thinking 2507 by Qwen on OpenRouter	23.9s	$0.0001	0%
GPT-3.5 Turbo by OpenAI on OpenRouter	1.1s	$0.0002	0%
Hermes 4 405B by Nous on OpenRouter	1.3s	$0.0002	0%
GPT-3.5 Turbo 16k by OpenAI on OpenRouter	881ms	$0.0005	0%
GPT-4o (2024-11-20) by OpenAI on OpenRouter	1.6s	$0.0008	0%
o3 Pro by OpenAI on OpenRouter	68.7s	$0.0689	0%
o1 by OpenAI on OpenRouter	9.7s	$0.0781	0%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier

Best value

Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

o1OpenRouter

1,382 avg (107 in / 1,275 out)

o3 ProOpenRouter

935 avg (98 in / 837 out)

Qwen3 30B A3B Thinking 2507OpenRouter

565 avg (112 in / 453 out)

Qwen3 VL 8B InstructOpenRouter

204 avg (87 in / 117 out)

GPT-3.5 TurboOpenRouter

169 avg (99 in / 70 out)