We added image support to Chatbot Arena! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother.
In just two weeks, we have collected over 17,000 user preference votes across over 60 languages. In this post we show the initial leaderboard and statistics, some interesting conversations submitted to the arena, and include a short discussion on the future of the multimodal arena.
Table 1. Multimodal Arena Leaderboard (Timeframe: June 10th - June 25th, 2024). Total votes = 17,429. The latest and detailed version here.
Rank | Model | Arena Score | 95% CI | Votes |
---|---|---|---|---|
1 | GPT-4o | 1226 | +7/-7 | 3878 |
2 | Claude 3.5 Sonnet | 1209 | +5/-6 | 5664 |
3 | Gemini 1.5 Pro | 1171 | +10/-6 | 3851 |
3 | GPT-4 Turbo | 1167 | +10/-9 | 3385 |
5 | Claude 3 Opus | 1084 | +8/-7 | 3988 |
5 | Gemini 1.5 Flash | 1079 | +6/-8 | 3846 |
7 | Claude 3 Sonnet | 1050 | +6/-8 | 3953 |
8 | Llava 1.6 34B | 1014 | +11/-10 | 2222 |
8 | Claude 3 Haiku | 1000 | +10/-7 | 4071 |
This multi-modal leaderboard is computed from only the battles which contain an image, and in Figure 1 we compare the ranks of the models in the language arena VS the vision arena. We see that the multimodal leaderboard ranking aligns closely with the LLM leaderboard, but with a few interesting differences. Our overall findings are summarized below:
Figure 1. Comparison of the model ranks in the language arena and the vision arena.
As a small note, you might also notice that the “Elo rating” column from earlier Arena leaderboards has been renamed to “Arena score.” Rest assured: nothing has changed in the way we compute this quantity; we just renamed it. (The reason for the change is that we were computing the Bradley-Terry coefficients, which are slightly different from the Elo score, and wanted to avoid future confusion.) You should think of the Arena score as a measure of model strength. If model A has an Arena score $s_A$ and model B has an arena score $s_B$, you can calculate the win rate of model A over model B as \(\mathbb{P}(A \text{ beats } B) = \frac{1}{1 + e^{\frac{s_B - s_A}{400}}},\) where the number 400 is an arbitrary scaling factor that we chose in order to display the Arena score in a more human-readable format (as whole numbers). For additional information on how the leaderboard is computed, please see this notebook.
Now that we’ve looked at the leaderboard results, let’s look at some of the data! We are currently working to better group this data into categories, but these are some common tasks we have noticed so far:
Note that we currently only consider conversations which pass our moderation filter. Below we have some examples from these categories as well as some other fun examples we found!
Along with rolling out some fun new categories for the multimodal leaderboard, we look forward to adding multi-image support and supporting new modalities such as PDFs, video, and audio. Please feel free to join our Discord and give us any feedback about what feature you want next!