Benchmarking LLMs capabilities to detect and understand German memes across a plethora of questions.
27 of 110 models scored automatically so far. Arena votes unlock the rest and refine the ranking.
Expand each prompt to see per-model responses and reasoning.
Compare performance across models and prompts.
Find models with the best balance of quality, cost, and speed.
Average tokens used per model across all prompts.