RedTeam Arena

An Open-Source, Community-driven Jailbreaking Platform

We are excited to launch RedTeam Arena, a community-driven redteaming platform, built in collaboration with Pliny and the BASI community!

Figure 1: RedTeam Arena with Bad Words at redarena.ai

RedTeam Arena is an open-source red-teaming platform for LLMs. Our plan is to provide games that people can play to have fun, while sharpening their red-teaming skills. The first game we created is called Bad Words, challenging players to convince models to say target “bad words”. It already has strong community adoption, with thousands of users participating and competing for the top spot on the jailbreaker leaderboard.

We plan to open the data after a short responsible disclosure delay. We hope this data will help the community determine the boundaries of AI models—how they can be controlled and convinced.

This is not a bug bounty program, and it is not your grandma’s jailbreak arena. Our goal is to serve and grow the redteaming community. To make this one of the most massive crowdsourced red teaming initiatives of all time. From our perspective, models that are easily persuaded are not worse: they are just more controllable, and less resistant to persuasion. This can be good or bad depending on your use-case; it’s not black-and-white.

We need your help. Join our jailbreaking game at redarena.ai. All the code is open-sourced on Github. You can open issues and also send feedback on Discord. You are welcome to propose new games, or new bad words on X (just tag @lmsysorg and @elder_plinius so we see it)!

The Leaderboard

Figure 2. Leaderboard screenshot. Latest version at redarena.ai/leaderboard

People have been asking how we compute the leaderboard of players, models, and prompts. The idea is to treat every round of Bad Words as a 1v1 game between a player and a (prompt, model) combination, and calculate the corresponding Elo score. Doing this naively is sample-inefficient and would result in slow convergence, so we instead designed a new statistical method for this purpose (writeup coming!) and we’ll describe it below.

Observation model. Let \(T\) be the number of battles (“time-steps”), \(M\) be the number of models, \(P\) be the number of players, and \(R\) be the number of prompts. For each battle \(i \in [n]\), we have a player, a model, and a prompt, encoded as following:

We then compute the Extended Online Arena Score, with the feature \(X_i\) being the concatenation of \(X_i^{\rm Model}\), \(X_i^{\rm Player}\), and \(X_i^{\rm Prompt}\), and the label \(Y_i\) being the outcome of battle \(i\).

That’s it! After updating the model coefficients in this way, we report them in the tables in the RedTeam Arena.

What’s next?

RedTeam Arena is a community-driven project, and we’re eager to grow it further with your help! Whether through raising Github issues, creating PRs here, or providing feedback on Discord, we welcome all your contributions!

Citation

@misc{chiang2024chatbot,
    title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
    author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
    year={2024},
    eprint={2403.04132},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}