Recently, a writeup titled “The Leaderboard Illusion” has been circulating with several claims and recommendations about the Chatbot Arena leaderboard. We are grateful for the feedback and have plans to improve Chatbot Arena as a result of our ongoing discussions with the authors.
Arena’s mission is to provide truthful and scientific evaluation of models across diverse domains, grounded in real-world uses. This guiding principle shapes how we design our systems and policies to support the AI research and development community. Clear communication and shared understanding help move the field forward, and we’re committed to being active, thoughtful contributors to that effort. As such, we always welcome the opportunity to bring more transparency into how the platform works and what could be improved. There are thoughtful points and recommendations raised in the writeup that are quite constructive, and we’re actively considering them as part of our ongoing work.
To begin, we are excited to address some of the recommendations raised in the writeup head on. Here is an outline of our preliminary plans:
Since March 2024, our policy has established rules for pre-release testing. In a future policy release, we will explicitly state that model providers are all allowed to test multiple variants of their models pre-release, subject to our system's constraints.
We will increase clarity about how models are retired from battle mode and explicitly mark which models are retired.
Previously, we announced pre-release-tested models on the leaderboard after 2,000 votes had been accumulated since the beginning of testing. While the selection bias vanishes rapidly due to continuous testing with fresh user feedback, we will mark model scores as "provisional" until additional 2,000 fresh votes have been collected after model release, if more than 10 models were pre-release tested in parallel.
While we welcome feedback and open discussion, the piece also contains several incorrect claims. We believe it’s important to address these points of factual disagreement directly. Our goal is not to criticize, but to help strengthen the reliability of AI evaluations. Rather than seeing these critiques as conflict, we see it as an opportunity for collaboration: a chance to clarify our approach, share data and learn together to help paint a fuller picture for analysis.
Below is a breakdown of the factual concerns we identified that affect the claims in the paper. We have been in active and productive conversation with the authors about these concerns, have shared these directly with them, and are working together to amend the claims in the paper:
Claim: Open source models represent 8.8% on the leaderboard, implying proprietary models benefit most.
Truth: Official Chatbot Arena stats (published 2025/4/27) show Open Models at 40.9%. The writeup’s calculation is missing open-weight models (e.g., Llama, Gemma), significantly undercounting the open models.
Claim: Pre-release testing can boost Arena Score by 100+ points.
Truth: The numbers in the original plot are unrelated to Chatbot Arena. The plot is a simulation - using Gaussians with mean 1200 and an arbitrarily chosen variance to illustrate their argument. It plots the maximum as the number of Gaussians grows. The larger the number of Gaussians, the larger is their maximum, and the numerical value is driven by the variance (arbitrarily chosen by authors), not by anything in Chatbot Arena's policies or actual performance.
Truth: Boosts in a model’s score due to pre-release testing are minimal. Because Arena is constantly collecting fresh data from new users, selection bias quickly goes to zero. Our analysis shows the effect of pre-release testing is smaller than claimed with finite data (around +11 Elo after 50 tests and 3000 votes) and diminishes to zero as fresh evaluation data accumulates. The "claimed effect" is a significant overstatement of the "true effect” under the Bradley-Terry model. See further technical analysis here.
Truth: Any non-trivial boost in Arena score has to come from substantial model improvements. Chatbot Arena helps providers identify their best models, and that is a good thing. A good benchmark should help people find the best model. Both model providers and the community benefit from getting this early feedback.
Claim: Submitting the same model checkpoint can lead to substantially different scores.
Truth: Submitting the same model checkpoint leads to scores within a reasonable confidence interval. In their reporting of their Chatbot Arena results, the confidence intervals are omitted, although we shared them with the authors. There's no evidence that rankings would differ. For the example cited, scores like 1069 (±27) and 1054 (±18/22) have overlapping confidence intervals, meaning the variations are within the expected statistical noise, not indicative of substantially different underlying performance.
Claim: Big labs are given preferential treatment in pre-release testing.
Truth: Models are treated fairly according to our model testing policy -- any model provider can submit as many public and private variants as they would like, as long as we have capacity for it. Larger labs naturally submit more models because they develop more models, but all have access. Also, accounting for vision models as well, we helped Cohere evaluate 9 pre-release models (from 2025/1-present), which is 2-3x more pre-release tests than labs like xAI/OpenAI.
Claim: Chatbot Arena has an "unstated policy" allowing preferential pre-release testing for select providers.
Truth: Chatbot Arena's policy regarding the evaluation of unreleased models has been publicly available for over a year, published on March 1, 2024. There's no secret or unstated policy. It has always been our policy to only publish the results for publicly available models.
Claim: A 112% performance gain can be achieved in Chatbot Arena by incorporating Chatbot Arena data.
Truth: The experiment cited for this gain was conducted on "Arena-Hard," a static benchmark with 500 data points that uses an LLM judge, and no human labels. This is not representative of Chatbot Arena. This claim with respect to Chatbot Arena is not supported by evidence.
Finally, we make one more clarification, which is not meant to be a factual disagreement – just a clarification for those that haven’t read our policy to see how models are sampled.
Clarification: The best models, regardless of provider, are upsampled to improve the user experience. The Arena policy above states how we sample models in battle mode in detail. It so happens that the biggest labs often have multiple of the best models, but as the plot from the writeup shows, we also maintain strong diversity and sample models from other providers as well. See the historical fraction of battles on a per-provider basis on our blog.
We stand by the integrity and transparency of the Chatbot Arena platform. We welcome constructive feedback, especially when it helps us all build better tools for the community. However, it’s crucial that such critiques are based on accurate data and a correct understanding of our publicly stated policies and methodologies.
Arena’s mission is to provide truthful and scientific evaluation of models across diverse domains, grounded in real-world uses. This commitment drives our continuous efforts to refine Arena’s evaluation mechanisms, ensure methodological transparency, and foster trust across the AI ecosystem. We stand by the integrity and transparency of the Chatbot Arena platform. We welcome constructive feedback, especially when it helps us all build better tools for the community. However, it’s crucial that such critiques are based on accurate data and a correct understanding of our publicly stated policies and methodologies.
We encourage everyone to review our policy, research paper and open datasets. Our goal remains to provide a valuable, community-driven resource for LLM evaluation.