Character Frequency Bench

Tests an LLM's ability to accurately count and categorize specific characters, symbols, and patterns within strings. This benchmark evaluates tokenization-independent visual processing and precise sub-string analysis.

Jan 9, 2026
11 tasks
110 models
$0.7334
karllorey
Link only

ResultsPreliminary

Vote in the arena

29 of 110 models on the leaderboard so far. More join with each arena vote.

gpt-oss-20b
by OpenAI
100%
score
Kimi K2 Thinking
by MoonshotAI
100%
score
DeepSeek V3 0324
by DeepSeek
100%
score
4
GPT-5.2
by OpenAI
100%
score
5
Claude Opus 4.5
by Anthropic
100%
score

Prompt Details

Expand each prompt to see per-model responses and reasoning.

Model Comparison

Compare performance across models and prompts.

gpt-oss-20b
by OpenAI on OpenRouter
9.0s
$0.0002
100%
Kimi K2 Thinking
by MoonshotAI on OpenRouter
15.2s
$0.0032
100%
DeepSeek V3 0324
by DeepSeek on OpenRouter
26.7s
$0.0079
100%
GPT-5.2
by OpenAI on OpenRouter
4.1s
$0.0185
100%
Claude Opus 4.5
by Anthropic on OpenRouter
4.3s
$0.0437
100%
Claude Haiku 4.5
by Anthropic on OpenRouter
1.9s
$0.0077
95%
gpt-oss-120b
by OpenAI on OpenRouter
2.7s
$0.0003
91%
Llama 4 Maverick
by Meta on OpenRouter
7.3s
$0.0018
91%
DeepSeek V3.2 Speciale
by DeepSeek on OpenRouter
38.1s
$0.0047
91%
GLM 4.7
by Z.ai on OpenRouter
12.6s
$0.0063
91%

Value Analysis

Find models with the best balance of quality, cost, and speed.

Best value frontier
Best value
Size = duration

Highlighted models offer the best score at their price point. Larger dots take longer to produce a result.

Token Usage

Average tokens used per model across all prompts.

Llama 3.2 1B InstructOpenRouter
2,645 avg (39 in / 2,606 out)
DeepSeek V3.2 SpecialeOpenRouter
1,057 avg (28 in / 1,029 out)
DeepSeek V3 0324OpenRouter
842 avg (27 in / 815 out)
GPT-5 NanoOpenRouter
838 avg (29 in / 809 out)
GLM 4.7OpenRouter
745 avg (29 in / 715 out)