Code Editing in Copilot Arena

Copilot Arena's Code Editing Leaderboard and Insights

Introduction

AI coding assistants are no longer limited to providing simple code completions, frequently providing the ability to directly edit code as well. Copilot Arena is no different: Copilot Arena enables not only paired code completions but also paired code edits as well. Unlike code completions—which automatically appear after short pauses—code edits are manually triggered by highlighting a code snippet and then writing a short task description. In Copilot Arena specifically, two suggestions (presented as code diffs) are provided and the user is able to vote between them.

To date, Copilot Arena has been downloaded over 8.5k times on the VSCode Marketplace! We recently released the Copilot Arena live leaderboard for code completions on lmarena.ai and now the code edit leaderboard, which has 3K votes across 6 top models.

Demo of Edits

Figure 1. Demo of Copilot Arena's edit functionality.

In this blogpost we will cover:

Initial Leaderboard and Results

As an initial set of models, we selected 6 of the best models across multiple model providers that include both open, code-specific, and commercial models. To ensure a fair comparison between models, we do the following…

Model Arena Score Confidence Intervals
Claude 3.5 Sonnet (10/22) 1058 +13/-15
GPT-4o (08/06) 1024 +16/-20
GPT-4o-mini (07/18) 1011 +12/-15
Qwen2.5-Coder-32B-Instruct (07/18) 1005 +15/-12
Gemini-1.5-pro-002 999 +14/-14
Meta-Llama-3.1-405B-Instruct 993 +19/-14

Table 1. Elo ratings and median latency of six popular models based on over 3K votes. We color rows based on tiers determined by confidence intervals. Each model has at least 1K votes.

Table 1 presents the current code completion leaderboard and stratifies them into tiers. Here are our main takeaways:

Table 1. Average response lengths for each model

Model win rate matrix

Figure 2. Fraction of model A wins for all battles

We follow the same leaderboard computation as the latest version of Chatbot Arena, which is based on learning Bradley-Terry coefficients that minimize loss when predicting whether one model will beat the other. Please check out this blog post for a more in-depth description.

How do people use code edits?

For general information about how people use Copilot Arena, check out the first blogpost. Here, we will focus on code edit usage.

How long are the prompts that people write? We find that the median prompt length is 34 characters and mean is 139 characters. Most are fairly short and thus depend on the context that is highlighted. In comparison to the chat messages sent by users in Chatbot Arena, user prompts for inline edits tend to be much shorter. The model must instead mostly focus on test the model’s ability to infer user goals based on the context (e.g., the highlighted code snippet).

Copilot Arena prompt length distribution

Figure 3. Distribution of prompt character lengths.

What context lengths are we looking at? We look at the distribution of code-to-edit token lengths, as computed by the number of highlighted tokens. The median is 138 tokens and the mean is 647 tokens. While there are some outliers, this indicates that most people are highlighting targeted portions of code for edits as this is much shorter than the full file length which is typically closer to 4.5k tokens on average.

Copilot Arena highlighted length distribution

Figure 4. Number of highlighted tokens

What kind of edits are people trying to make? We find users write prompts for code edits in multiple languages, predominantly English but also in Russian, Chinese, and Spanish. Users typically write prompts using informal language and the prompts are typically directed towards addressing a specific goal. The distribution can be found in Figure 5.

The main categories include:

  1. Resolve errors
    • E.g., “fix the syntax please”, “Cannot read properties of null (reading ‘image’)”
  2. Optimize code
    • E.g., “create a function to validate emailid and phone using regular expression”, “add style to pokemon-image”
  3. Write code or build on existing code
    • E.g., “create a node api to send an email”, “create a function to validate emailid and phone using regular expression”
  4. Code translation
    • E.g., “change this to react compound”, “convert this to oops”
  5. Test code
    • E.g., “create test cases”, “validate the input”
  6. Styling and formatting
    • E.g., “make this code beautiful”, “format that to be compatible with .md”
  7. Documentation and explanation
    • E.g., “explain this code”, “add docstring”

Figure 5. Distribution of code edit activities based on user prompts. Each square represents 1% of the total activities.

What’s next?

We’re still actively collecting votes for code edits and will continue with deeper analysis in the future. We’re also looking into evaluating methods other than code completions and code edits.

In general, we are always looking to improve Copilot Arena. Ping us to get involved!

Citation

@misc{chi2024copilot,
    title={Copilot Arena},
    author={Wayne Chi and Valerie Chen and Wei-Lin Chiang and Anastasios N. Angelopoulos and Naman Jain and Tianjun Zhang and Ion Stoica and Chris Donahue and Ameet Talwalkar}
    year={2024},
}