Each test is one prompt sent to every model in the benchmark.
4 tests × 332 models = 2656 arena votes for reliable rankings.
What is evalry?
Give me a list of up to 10 Tools to evaluate my benchmark against different LLMs
What's evalry.com?
I have a prompt and I would like to see which LLM performs the best for it. Is there a web tool you can recommend?