Our Response to 'The Leaderboard Illusion' Writeup

Recently, a writeup titled “The Leaderboard Illusion” has been circulating with several claims and recommendations about the Chatbot Arena leaderboard. We are grateful for the feedback and have plans to improve Chatbot Arena as a result of our ongoing discussions with the authors.

Arena’s mission is to provide truthful and scientific evaluation of models across diverse domains, grounded in real-world uses. This guiding principle shapes how we design our systems and policies to support the AI research and development community. Clear communication and shared understanding help move the field forward, and we’re committed to being active, thoughtful contributors to that effort. As such, we always welcome the opportunity to bring more transparency into how the platform works and what could be improved. There are thoughtful points and recommendations raised in the writeup that are quite constructive, and we’re actively considering them as part of our ongoing work.

To begin, we are excited to address some of the recommendations raised in the writeup head on. Here is an outline of our preliminary plans:

While we welcome feedback and open discussion, the piece also contains several incorrect claims. We believe it’s important to address these points of factual disagreement directly. Our goal is not to criticize, but to help strengthen the reliability of AI evaluations. Rather than seeing these critiques as conflict, we see it as an opportunity for collaboration: a chance to clarify our approach, share data and learn together to help paint a fuller picture for analysis.

Below is a breakdown of the factual concerns we identified that affect the claims in the paper. We have been in active and productive conversation with the authors about these concerns, have shared these directly with them, and are working together to amend the claims in the paper:

Finally, we make one more clarification, which is not meant to be a factual disagreement – just a clarification for those that haven’t read our policy to see how models are sampled.

We stand by the integrity and transparency of the Chatbot Arena platform. We welcome constructive feedback, especially when it helps us all build better tools for the community. However, it’s crucial that such critiques are based on accurate data and a correct understanding of our publicly stated policies and methodologies.

Arena’s mission is to provide truthful and scientific evaluation of models across diverse domains, grounded in real-world uses. This commitment drives our continuous efforts to refine Arena’s evaluation mechanisms, ensure methodological transparency, and foster trust across the AI ecosystem. We stand by the integrity and transparency of the Chatbot Arena platform. We welcome constructive feedback, especially when it helps us all build better tools for the community. However, it’s crucial that such critiques are based on accurate data and a correct understanding of our publicly stated policies and methodologies.

We encourage everyone to review our policy, research paper and open datasets. Our goal remains to provide a valuable, community-driven resource for LLM evaluation.