Copilot Arena's Initial Leaderboard, Insights, and a New Prompting Method for Code Completions
As LLMs are embedded more and more within production workflows, it’s time to rethink how we measure LLM capabilities to better reflect real-world usage. A few weeks ago, we launched Copilot Arena, a free AI coding assistant that provides paired responses from different state-of-the-art LLMs. We first introduced paired code completions and more recently rolled out inline editing-a feature where users can highlight code segments, write a prompt, and receive two diff-based suggestions for modifying that code.
Thus far, Copilot Arena has been downloaded 2.5K times on the VSCode Marketplace, served over 100K completions, and accumulated over 10K code completion battles. In this blog post, we’ll cover:
As an initial set of models, we selected 9 of the best models across multiple model providers that include both open, code-specific, and commercial models. To ensure a fair comparison between models, we do the following…
Model | Arena Score | Confidence Intervals | Median Latency (s) |
---|---|---|---|
Deepseek V2.5 | 1074 | +16/-11 | 2.13 |
Claude Sonnet 3.5 (06/20) | 1053 | +18/-17 | 2.29 |
Codestral (05/24) | 1046 | +12/-10 | 1.01 |
Meta-Llama-3.1-405B-Instruct | 1024 | +17/-15 | 1.12 |
GPT-4o (08/06) | 1016 | +17/-20 | 0.75 |
Gemini-1.5-Pro-002 | 1014 | +19/-18 | 1.44 |
Meta-Llama-3.1-70B-Instruct | 1013 | +14/-15 | 0.88 |
Gemini-1.5-Flash-002 | 1005 | +16/-22 | 0.55 |
GPT-4o-mini (07/18) | 962 | +17/-15 | 0.74 |
Table 1. Elo ratings and median latency of nine popular models based on over 10K votes collected between October 16-November 11, 2024. We color rows based on tiers determined by confidence intervals. Each model has at least 1K votes.
Table 1 presents the current code completion leaderboard and stratifies them into tiers. Here are our main takeaways:
We follow the same leaderboard computation as the latest version of Chatbot Arena, which is based on learning Bradley-Terry coefficients that minimize loss when predicting whether one model will beat the other. Please check out this blog post for a more in-depth description.
Figure 2. Fraction of model A wins for all battles
While the Arena scores (Table 1) do not explicitly factor in model latency since both completions are shown simultaneously, we explore whether Arena Scores correlate with latency. We include median latency as a separate column in the results. In general, we find that people don’t necessarily prefer faster models. However, this may be partially because code completions are only generated in Copilot Arena after a user pauses.
What kind of languages do people code in?
Most current Copilot Arena users code in Python, followed by javascript/typescript, html/markdown, and C++. This statistic is determined based on the file extension.
Figure 3. Filetypes requested in Copilot Arena. Filetypes are determined based on file extension.
What kind of context lengths are we looking at?
The mean context length is 1002 tokens and the median is 560 tokens. This is much longer than tasks considered in existing static benchmarks. For example, human eval has a median length of ~100 tokens.
Figure 4. Context length of files requested in Copilot Arena.
Are people biased towards the top completion? Yes. In fact, 82% of accepted completions were the top completion. We are still analyzing our data, but here are some of our insights.
Figure 5. Distribution of user response times. Most users are taking a few seconds to read the responses.
How many people are regular users? In total, we have had votes from 833 unique users and between 200-250 daily active users.
How do you handle ties in Arena? We do not currently have an option for people to select that both responses are equally good (or bad).
How do you handle models pre-trained on FiM? For Deepseek V2.5 and Codestral, we use their API which directly allows for FiM capabilities.
Figure 6. (Top) Example of code completion that requires infilling capabilities. (Bottom) Example of formatting issue that chat models encounters when prompted to complete code given the prefix and suffix.
During real development processes, developers frequently modify or expand on existing code, rather than only write code in a left-to-right manner. As such, “fill in the middle” (FiM) capabilities when generating code completions are critical for any models to be used in Copilot Arena. Many code-specific models, including DeepSeek and Codestral, are specifically trained to perform FiM. However, most models in Copilot Arena are not because they are chat models, and thus, they struggle to appropriately format a completion when provided with the prefix and suffix. We explore a simple prompting trick that allows chat models to perform code completions with high success.
Model | PSM | SPM | Mask |
---|---|---|---|
Claude-3.5-sonnet | 0.67 (+0.16) | 0.66 (+0.15) | 0.66 (+0.14) |
GPT-4o-2024-08-06 | 0.71 (+0.02) | 0.55 (+0.19) | 0.62 (+0.12) |
GPT-4o-mini-2024-07-18 | 0.18 (+0.39) | 0.12 (+0.54) | 0.15 (+0.36) |
Gemini-1.5-pro-001 | 0.38 (+0.28) | 0.34 (+0.36) | 0.43 (-0.04) |
Gemini-1.5-flash-001 | 0.34 (+0.24) | 0.27 (+0.37) | 0.36 (+0.19) |
Llama-3.1-70B-Instruct | 0.14 (+0.46) | 0.15 (+0.48) | 0.12 (+0.27) |
Table 2: Percentage of well-formatted code completions with different prompt templates (PSM, SPM, Mask). We denote the gain by our prompting method in parentheses.
Evaluation Set-up. To verify that chat models would indeed struggle to perform FiM, we use the HumanEval-infilling dataset as an imperfect proxy to benchmark chat models’ FiM capabilities. We adopt three prompt templates considered in prior work (e.g., Gong et al.) like Prefix-suffix-middle (PSM), Suffix-prefix-middle (SPM), and Mask. Instead of measuring pass@1, we only consider whether the returned infilled code is formatted correctly.
Chat models can’t naively FiM. Table 2 shows that standard prompt templates are insufficient for chat models to complete FiM tasks. This is not necessarily an indication that models cannot code as clearly many SOTA chat models are proficient coders. Instead, the vast majority of the errors resulted from issues in formatting or duplicate code segments rather than logical errors, indicating that these models cannot generalize their code outputs to FiM tasks. While it is not feasible to retrain these models, we explore alternative approaches via prompting to enable models to complete FiM tasks.
Our solution significantly reduces formatting errors. Instead of forcing chat models to output code in a format unaligned with its training (e.g. FiM), we allow the model to generate code snippets, which is a more natural format, and then post-process them into a FiM completion. Our approach is as follows: in addition to the same prompt templates above, the models are provided with instructions to begin by re-outputting a portion of the prefix and similarly end with a portion of the suffix. We then match portions of the output code in the input and delete the repeated code. As you can see in Table 2, the models make much fewer formatting issues. These benefits hold regardless of the prompt template.
@misc{chi2024copilot,
title={Copilot Arena},
author={Wayne Chi and Valerie Chen and Wei-Lin Chiang and Anastasios N. Angelopoulos and Naman Jain and Tianjun Zhang and Ion Stoica and Chris Donahue and Ameet Talwalkar}
year={2024},
}