Product Recommendation Bench

Benchmarks product recommendations for a diverse set of SaaS products

Jan 13, 2026
9 tasks
110 models
$0.0645
user_c636b9d7
Public

ResultsPreliminary

Vote in the arena

30 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Claude Sonnet 4.6
by Anthropic
100%
score
GPT-5.1 Chat
by OpenAI
85%
score
Qwen3.5-9B
by Qwen
83%
score
4
Llama 4 Maverick
by Meta
76%
score
5
DeepSeek V3.2
by DeepSeek
75%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Claude Sonnet 4.6
by Anthropic on OpenRouter
7.1s
$0.0045
100%
GPT-5.1 Chat
by OpenAI on OpenRouter
3.0s
$0.0022
85%
Qwen3.5-9B
by Qwen on OpenRouter
128.1s
$0.0011
83%
Llama 4 Maverick
by Meta on OpenRouter
5.2s
$0.0001
76%
DeepSeek V3.2
by DeepSeek on OpenRouter
9.8s
$0.0001
75%
DeepSeek V3.2 Exp
by DeepSeek on OpenRouter
9.4s
$0.0002
74%
GLM 5
by Z.ai on OpenRouter
14.4s
$0.0032
74%
DeepSeek V4 Flash
by DeepSeek on OpenRouter
2.7s
$0.0001
74%
GLM 5 Turbo
by Z.ai on OpenRouter
23.5s
$0.0091
68%
GPT-5.5
by OpenAI on OpenRouter
17.2s
$0.0198
67%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Qwen3.5-9BOpenRouter
3,903 avg (112 in / 3,791 out)
Auto RouterOpenRouter
2,947 avg (100 in / 2,847 out)
Step 3.5 FlashOpenRouter
2,737 avg (129 in / 2,608 out)
GPT-5 NanoOpenRouter
2,659 avg (93 in / 2,566 out)
Qwen3.6 35B A3BOpenRouter
2,138 avg (104 in / 2,034 out)