Product Recommendation Bench

Benchmarks product recommendations for a diverse set of SaaS products

Jan 13, 2026

9 tasks

110 models

$0.0645

user_c636b9d7

Public

ResultsPreliminary

Vote in the arena

30 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.

Claude Sonnet 4.6

by Anthropic

7.1s

$0.0045

100%

score

GPT-5.1 Chat

by OpenAI

3.0s

$0.0022

85%

score

Qwen3.5-9B

by Qwen

128.1s

$0.0011

83%

score

Llama 4 Maverick

by Meta

5.2s

$0.0001

76%

score

DeepSeek V3.2

by DeepSeek

9.8s

$0.0001

75%

score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

Claude Sonnet 4.6

by Anthropic on OpenRouter

7.1s

$0.0045

100%

GPT-5.1 Chat

by OpenAI on OpenRouter

3.0s

$0.0022

85%

Qwen3.5-9B

by Qwen on OpenRouter

128.1s

$0.0011

83%

Llama 4 Maverick

by Meta on OpenRouter

5.2s

$0.0001

76%

DeepSeek V3.2

by DeepSeek on OpenRouter

9.8s

$0.0001

75%

DeepSeek V3.2 Exp

by DeepSeek on OpenRouter

9.4s

$0.0002

74%

GLM 5

by Z.ai on OpenRouter

14.4s

$0.0032

74%

DeepSeek V4 Flash

by DeepSeek on OpenRouter

2.7s

$0.0001

74%

GLM 5 Turbo

by Z.ai on OpenRouter

23.5s

$0.0091

68%

GPT-5.5

by OpenAI on OpenRouter

17.2s

$0.0198

67%

Model	Duration	Cost	Score
Claude Sonnet 4.6 by Anthropic on OpenRouter	7.1s	$0.0045	100%
GPT-5.1 Chat by OpenAI on OpenRouter	3.0s	$0.0022	85%
Qwen3.5-9B by Qwen on OpenRouter	128.1s	$0.0011	83%
Llama 4 Maverick by Meta on OpenRouter	5.2s	$0.0001	76%
DeepSeek V3.2 by DeepSeek on OpenRouter	9.8s	$0.0001	75%
DeepSeek V3.2 Exp by DeepSeek on OpenRouter	9.4s	$0.0002	74%
GLM 5 by Z.ai on OpenRouter	14.4s	$0.0032	74%
DeepSeek V4 Flash by DeepSeek on OpenRouter	2.7s	$0.0001	74%
GLM 5 Turbo by Z.ai on OpenRouter	23.5s	$0.0091	68%
GPT-5.5 by OpenAI on OpenRouter	17.2s	$0.0198	67%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier

Best value

Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Qwen3.5-9BOpenRouter

3,903 avg (112 in / 3,791 out)

Auto RouterOpenRouter

2,947 avg (100 in / 2,847 out)

Step 3.5 FlashOpenRouter

2,737 avg (129 in / 2,608 out)

GPT-5 NanoOpenRouter

2,659 avg (93 in / 2,566 out)

Qwen3.6 35B A3BOpenRouter

2,138 avg (104 in / 2,034 out)