Live and Community-Driven LLM Evaluation
Chatbot Arena (lmarena.ai) is an open-source project created by members from LMSYS and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish leaderboard periodically.
Chatbot Arena was first launched in May 2023 and has emerged as a critical platform for live, community-driven LLM evaluation, attracting millions of participants and collecting over 800,000 votes. This extensive engagement has enabled the evaluation of more than 90 LLMs, including both commercial GPT-4, Gemini/Bard and open-weight Llama and Mistral models, significantly enhancing our understanding of their capabilities and limitations.
Our periodic leaderboard and blog post updates have become a valuable resource for the community, offering critical insights into model performance that guide the ongoing development of LLMs. Our commitment to open science is further demonstrated through the sharing of user preference data and one million user prompts, supporting research and model improvement.
We also collaborate with open-source and commercial model providers to bring their latest models to community for preview testing. We believe this initiative helps advancing the field and encourages user engagement to collect crucial votes for evaluating all the models in the Arena. Moreover, it provides an opportunity for the community to test and provide anonymized feedback before the models are officially released.
The platform’s infrastructure (FastChat) and evaluation tools, available on GitHub, emphasize our dedication to transparency and community engagement in the evaluation process. This approach not only enhances the reliability of our findings but also fosters a collaborative environment for advancing LLMs.
In our ongoing efforts, we feel obligated to establish policies that guarantee evaluation transparency and trustworthiness. Moreover, we actively involve the community in shaping any modifications to the evaluation process, reinforcing our commitment to openness and collaborative progress.
Open source: The platform (FastChat) including UI frontend, model serving backend, model evaluation and ranking pipelines are all open source and available on GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard.
Transparent: The evaluation process, including rating computation, identifying anomalous users, and LLM selection are all made publicly available so others can reproduce our analysis and fully understand the process of collecting data. Furthermore, we will involve the community in deciding any changes in the evaluation process.
Listing models on the leaderboard: The public leaderboard will only include models that are generally available to the public. Specifically, models must meet at least one of the following criteria to qualify as publicly released models:
Once a publicly released model is listed on the leaderboard, the model will remain accessible at lmarena.ai for at least two weeks for the community to evaluate it.
The leaderboard distinguishes between first-party endpoints and third-party endpoints:
We prioritize listing first-party endpoints by default but may include third-party endpoints under the following conditions:
Third-party endpoints will be explicitly labeled as “third-party” on the leaderboard.
Evaluating publicly released models. Evaluating such a model consists of the following steps:
Evaluating unreleased models: We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.
Model providers can test their unreleased models anonymously, meaning the models’ names will be anonymized. A model is considered unreleased if its weights are neither open, nor available via a public API or service. Evaluating an unreleased model consists of the following steps:
If while we test an unreleased model, that model is publicly released, we immediately switch to the publicly released model evaluation process.
To ensure the leaderboard accurately reflects model rankings, we rely on live comparisons between models. Hence, we may deprecate models from the leaderboard one month after they are no longer available online or publicly accessible.
Sharing data with the community: We will periodically share data with the community. Specifically, we will share 20% of the arena vote data we have collected, including the prompts, the answers, the identity of the model providing each answer (if the model is or has been on the leaderboard), and the votes. This data will only be shared if users have explicitly consented to its inclusion in the public dataset. For models that have not appeared on the public leaderboard, we may still release data, but the model will be labeled as “anonymous”.
Sharing data with the model providers: Upon request, we will offer early data access with model providers who wish to improve their models. In particular, with a model provider, we will share the data that includes their model’s answers. For battles, we may not reveal the opponent model and may use “anonymous” label. If the model is not on the leaderboard at the time of sharing, the model’s answers will also be labeled as “anonymous”. Before sharing the data, we will remove user PII (e.g., Azure PII detection for texts).
Most LLM benchmarks are static, which makes them prone to contamination, as these LLMs are trained on most available data on the Internet. Chatbot Arena aims to alleviate this problem by providing live evaluation with a continuous stream of new prompts from real people. We also believe that the open nature of the platform will attract users that accurately reflect the broader set of LLM users and real use cases.
We will continuously add new models and retire old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process, i.e., it might take too much to accumulate enough votes to accurately rate each model. Today, the decision to add new models is rather ad-hoc: we add models based on the community’s perceived interest. We intend to formalize his process in the near future.
We seek to provide transparency and all tools as well as the platform we are using in open-source. We invite the community to use our platform and tools to statistically reproduce our results.
We share 20% of the data to balance transparency with the need to prevent overfitting and benchmark leakage. Sharing the entire dataset could lead to models being overly optimized for specific distribution. By providing a representative subset, we ensure researchers and developers gain meaningful insights while maintaining the integrity of the evaluation process. This policy is regularly reviewed and may adapt based on community feedback to align with best practices.
Chatbot Arena is only funded by gifts, in money, cloud credits, or API credits. The gifts have no strings attached.
Feel free to send us email or leave feedback on Github!