RepoChat Arena

A Live Benchmark for AI Software Engineers

Introduction

An integral part of using LLMs as part of a user’s coding workflow is navigating and understanding complex code bases. We launched RepoChat – a new arena that answers code-related queries using a user-provided github repository link.

RepoChat lets models automatically retrieve relevant files from the given GitHub repository. It can resolve issues, review PRs, implement code, as well as answer higher level questions about the repositories-all without requiring users to provide extensive context.

Video 1. A demo of RepoChat.

So far, RepoChat has collected around 20k battles and over 4k votes. All statistics calculated in this blog use conversations and votes collected between November 30, 2024 to Feburary 11, 2025 inclusive.

Metric Value
# of conversations 12,732
# of votes 4,857
# of unique GitHub links (in convos) 5,037
# of unique GitHub links (in votes) 2,465

Table 1. Number of conversations, votes, and unique GitHub links logged from 11/30/2024 - 02/11/2025.

In this blog we will cover:

Initial Leaderboards

Since there are two separate components (retriever and answer), we produce two separate leaderboards. Jump to this section for details about how the leaderboards are calculated, and to further analysis for more leaderboards such as style-control. All leaderboards can be reproduced using our google colab notebook.

Rank Model Arena Score 95% CI Votes Org.
1 GPT 4o (20241120) 1088 +12 | -10 1647 OpenAI
2 DeepSeek V3 1032 +24 | -21 362 DeepSeek
3 Claude 3.5 Sonnet (20241022) 1015 +16 | -10 1737 Anthropic
4 Gemini 2.0 Flash Exp 1005 +15 | -13 1066 Google
5 Gemini 1.5 Pro 974 +19 | -16 789 Google
6 GPT 4o mini (20240718) 970 +24 | -17 521 OpenAI
7 LLaMA 3.1 405B Instruct FP8 914 +13 | -22 750 Meta

Table 2. Arena ratings of seven popular models based on over 4K votes collected between November 30, 2024 to Feburary 11, 2025. This ranks the models for generating model response

Rank Model Arena Score 95% CI Org.
1 Gemini 1.5 Flash 002 1013 +14 | -21 Google
2 GPT 4o mini (20240718) 987 +21 | -14 OpenAI

Table 3. Arena ratings of the two retrievers based on 4K votes collected between November 30, 2024 to Feburary 11, 2025. This ranks the models for retrieving relevant files.

How do people use RepoChat?

What types of Github links do users input? RepoChat features a wide range of GitHub links, including repositories, issues, pull requests, and others. We find that the vast majority (almost 80%) of user input links are repository links, followed by issues and branches.

GitHub Link Categories

Figure 1. Breakdown of the GitHub link categories in RepoChat by percentage.

What programming languages do people ask about? The following statistic is calculated based on the file extensions of the relevant files. This serves as an indicator of the programming languages users are most frequently inquiring about/coding in. The abundance of markdown files is primarily due to README files, which are often extracted due to containing critical descriptions and instructions for the repository.

Programming Language Breakdown

Figure 2. Breakdown of retrieved file types by percentage.

What natural languages do the user queries contain? Most of our votes contain user queries in English, followed by Russian and Chinese.

Natural Language Breakdown

Figure 3. Breakdown of user query’s language by count.

How long are the user queries? The user query length varies significantly, ranging from a minimum of 1 token to a maximum of 1,406,325 tokens, with a median of 22 tokens. Short-length queries mostly consist of prose requesting implementations or instructions, whereas longer queries often include extensive code blocks.

Query Length Distribution

Figure 4. Distribution of user query length.

How long are the retrieved file contexts? The distribution of retrieved file content is more stable, with a median of 8,870, an average of 12,508, and a maximum of 126,329 tokens, with the occasional empty retrieval.

Retrieved Context Length Distribution

Figure 5. Distribution of retrieved file context length.

What are the most common user query types?

For more examples, please refer to the appendix. We have also released a dataset of around 4k battles here on Hugging Face.

How Does It Work?

Each generated answer is the collaborative effort between two separate LLM models. The retriever model extracts the relevant files from the github repository according to the given user input. The extracted file contents are then concatenated together with the user query and used as a prompt for the answer model, which then generates the response. RepoChat full pipeline

Figure 6. A graphical representation of the full RepoChat retriever->answer pipeline.

File Retrieval

We select two models as retrievers: gemini-1.5-flash-002 and gpt-4o-mini.

When a valid GitHub link is provided, the repository is cloned, and its file structure is extracted and formatted in an indented layout. The retriever model is then prompted to strategically identify files from the file structure that are potentially useful for answering the query. It then outputs these files in decreasing order of relevance and a specified markdown format.

Click to view retriever prompt The following files are found in the repository:
{file structure}
Please provide a list of files that you would like to search for answering the user query.
Enclose the file paths in a list in a markdown code block as shown below:
```
1. [[ filepath_1 ]]\n
2. [[ filepath_2 ]]\n
3. [[ filepath_3 ]]\n
...
```
Think step-by-step and strategically reason about the files you choose to maximize the chances of finding the answer to the query. Only pick the files that are most likely to contain the information you are looking for in decreasing order of relevance. Once you have selected the files, please submit your response in the appropriate format mentioned above (markdown numbered list in a markdown code block). The filepath within [[ and ]] should contain the complete path of the file in the repository.
{query}

Model Response Generation

The contents of relevant files are extracted, concatenated with the user query, and provided to the responding LLM as a prompt in a specified format. If the provided link is not a direct repository link but instead links to issues or pull requests, a query context section containing the content of the issue or PR thread is also included.

Click to view answer model prompt format Here is a list of files in the repository that may help you answer the query:
{pairs of (file_name, file_content)}
___

[INSTRUCTION]

You are an expert software engineer. Answer the following user query using provided context retrieved from the {repository_name} repository.

[QUERT CONTEXT]

{issue/PR content}

[USER QUERY]

{user query}

Further Analysis and Results

Style Control

Style (length and Markdown formatting) significantly impacts model ranking. A better formatted or visually appealing answer, though not necessarily better, is more likely to win. The style-controlled leaderboard isolates writing style from content quality. For details on our methodology, refer to this blog post.

Rank Model Arena Score 95% CI Org.
1 Claude 3.5 Sonnet (20241022) 1031 +28 | -17 Anthropic
2 GPT 4o (20241120) 1028 +23 | -24 OpenAI
3 DeepSeek V3 1016 +27 | -25 DeepSeek
4 Gemini 1.5 Pro 1007 +21 | -25 Google
5 Gemini 2.0 Flash Exp 997 +20 | -23 Google
6 GPT 4o mini (20240718) 963 +26 | -24 OpenAI
7 LLaMA 3.1 405B Instruct FP8 956 +23 | -23 Meta

Table 4. Style-controlled arena ratings of the answer models.

Claude 3.5 Sonnet’s score and ranking has improved significantly with style control, claiming the top spot. GPT-4o’s score has decreased, greatly narrowing its lead over other models. Gemini 1.5 Pro has also seen a notable boost in score, climbing one rank higher.

Original vs. Style-Controlled Leaderboard

Figure 7. A comparison of the original and the style-controlled leaderboards of the answer models.

Why does style control affect models like Claude and GPT 4o so much?

This is likely because a significant portion of user queries are not focused on strictly code-writing tasks. Instead, many focus on code understanding—such as how to use the repository, its purpose, or other high-level questions—rather than specific code implementations. These types of queries shift the focus away from pure coding abilities, and instead place greater emphasis on organization and presentation, and overall style.

Below is an example battle between Claude 3.5 Sonnet (20241022) and GPT 4o (20241120) with identical retrievals.

The GitHub link is https://github.com/xtekky/gpt4free, and the user query is “Explain why in the new 0.4.0.0 version Blackbox provider is several times slower than in version 0.3.9.7”

Click to expand the example battle of Claude 3.5 Sonnet (20241022) vs. GPT 4o (20241120)
Claude 3.5 Sonnet (20241022) Response GPT 4o (20241120) Response

Let me analyze the code from the provided context to explain the potential reasons for Blackbox provider's slower performance in version 0.4.0.0.

Looking at the Blackbox.py implementation, there are several key changes that could contribute to slower performance:

  1. Validation Process: The provider now implements a more complex validation process that requires fetching and parsing JavaScript files to obtain a validation token:
                @classmethod
                async def fetch_validated(cls):
                    cached_value = cls._load_cached_value()
    
                    async with aiohttp.ClientSession() as session:
                        # Tries multiple URLs
                        urls_to_try = [
                            "https://www.blackbox.ai",
                            "https://api.blackbox.ai"
                        ]
    
                        for base_url in urls_to_try:
                            try:
                                async with session.get(base_url) as response:
                                    if response.status != 200:
                                        continue
                                    page_content = await response.text()
                                    js_files = re.findall(r'static/chunks/\d{4}-[a-fA-F0-9]+\.js', page_content)
                                    # Additional JS file parsing...
                  
  2. Multiple URL Attempts: The code now tries multiple URLs and has to process multiple JavaScript files, which adds overhead:
    • It first tries blackbox.ai.
    • Then falls back to api.blackbox.ai.
    • For each URL, it has to download and parse multiple JS files.
  3. Caching Implementation: While there is a caching mechanism implemented:
                @classmethod
                def _load_cached_value(cls) -> str | None:
                    cache_file = cls._get_cache_file()
                    if cache_file.exists():
                        try:
                            with open(cache_file, 'r') as f:
                                data = json.load(f)
                                return data.get('validated_value')
                

    The cache might not be as effective as in previous versions, requiring more frequent validation token refreshes.

  4. Additional Model Support: The provider now supports many more models and features:
                vision_models = [default_vision_model, 'gpt-4o', 'gemini-pro', 'gemini-1.5-flash', 'llama-3.1-8b', 'llama-3.1-70b', 'llama-3.1-405b']
                web_search_models = ['blackboxai', 'meta-llama/Llama-3.3-70B-Instruct-Turbo', 'meta-llama/Meta-Llama-3.1-405B-Instruct-Lite-Pro']
                

    This expanded model support likely requires additional validation and processing logic.

To improve performance, you could:

  • Optimize the validation process to reduce unnecessary requests.
  • Implement more aggressive caching of the validation token.
  • Only load model information for the specific model being used.
  • Consider using connection pooling to reduce connection overhead.

The tradeoff seems to be between security/functionality and speed - the new version implements more robust validation and supports more features, but at the cost of performance.

Would you like me to elaborate on any of these points or provide specific suggestions for optimizing the provider's performance?

In version 0.4.0.0 of the gpt4free repository, the Blackbox provider is slower compared to version 0.3.9.7 likely due to several changes introduced in the newer implementation. Here's an analysis based on the provided context:


1. Enhanced fetch_validated Method in Blackbox

In version 0.4.0.0, the Blackbox provider introduces a more complex and robust mechanism for fetching and validating the validated_value. This involves:

  • Fetching multiple JavaScript files from the https://www.blackbox.ai or https://api.blackbox.ai endpoints.
  • Regex-based parsing of UUIDs in JavaScript files to extract the validated_value.
  • Checking the context around the UUID to ensure it is valid.
  • Retries and handling multiple URLs, which adds overhead.

This validation mechanism includes additional asynchronous calls and error handling. While it improves reliability, it significantly increases the response time.

Comparison to version 0.3.9.7: The earlier version likely used a simpler or cached approach for this process, avoiding the need to repeatedly process JavaScript files and perform context validation.


2. Increased Error Handling and Retrying

The new implementation introduces more robust error-handling mechanisms, including:

  • Retry loops for fetching JavaScript files and API calls.
  • Exponential backoff with random delays when processing requests.
  • Fallbacks for cached values if validation fails.

While this improves reliability and stability, each retry and delay adds to the total processing time.


3. Web Search Integration

The create_async_generator function now has an optional web search feature:

  • If web_search is enabled, the provider calls an external search engine (via get_search_message) to refine the user's query.
  • Although web_search is optional, if enabled, it introduces significant latency due to network calls and extra processing.

4. Increased Complexity in Request Data Preparation

The new implementation constructs a more detailed payload for the API request:

  • Agent modes (e.g., agentMode, trendingAgentMode) are retrieved and included in the payload.
  • Model-specific settings, including image generation and other configurations, are passed in the data.
  • Image processing (if images are included) involves converting to data URIs.

This additional processing step before sending the request increases the latency.


5. More Advanced Delay and Retry Mechanisms

When an API request fails, the system uses:

  • Exponential backoff delays: delay * (2 ** attempt) + random.uniform(0, 1)
  • Retries (default max_retries=3): Every failed request results in subsequent retries.

These features, while improving resilience, can slow down the overall response time if the API is unreliable or under heavy load.


6. Additional Debugging and Logging

The new version includes enhanced debugging and logging mechanisms (debug.log):

  • Each step, including validation, retries, and errors, is logged.
  • While useful for debugging, this can also slightly slow down execution, especially in development or verbose logging environments.

7. Potential API Changes on Blackbox's Side

If the Blackbox API itself has undergone changes between the two versions, such as stricter validation or slower endpoints, this could also contribute to the slowdown. The need to fetch and validate validated_value repeatedly suggests that the newer version is designed to handle a more restrictive or rate-limited API.


8. Cache Mechanism vs. Real-Time Validation

In version 0.3.9.7, the validated_value may have been cached and reused across requests, reducing overhead. In version 0.4.0.0, the _load_cached_value and _save_cached_value methods are still present, but more emphasis is placed on real-time validation through fetch_validated, which is inherently slower.


Summary of Reasons for Slowdown

  1. More Robust fetch_validated Method: Repeatedly fetching and parsing JavaScript files for UUID validation adds significant overhead.
  2. Increased Error Handling and Retries: New retry mechanisms and delays increase resilience but add latency.
  3. Optional Web Search: If enabled, it introduces additional network calls and processing time.
  4. Enhanced Payload Preparation: More detailed request data adds preprocessing time.
  5. Reliance on Real-Time Validation: Real-time validation of the validated_value is slower than relying on cached values.
  6. API Changes on Blackbox's Side: Stricter validation or rate-limiting by the Blackbox API could also be contributing.

Recommendations to Improve Speed

  1. Cache the validated_value: Implement more aggressive caching and reduce reliance on real-time validation.
  2. Optimize the fetch_validated process: Limit retries or check for updated values less frequently.
  3. Disable features like web_search: Only enable it if explicitly needed.
  4. Reduce retry attempts and exponential backoff delays: Optimize for faster error recovery in non-critical use cases.

Retriever Control

Just like style, the retrievers too have an effect on model answer. A model with with a more complete or relevant retrieval will likely produce a better answer. Similar to style-control, the retriever-controlled leaderboard separates the effect of the retrievers from the answer model ranking.

Rank Model Arena Score 95% CI Org.
1 GPT 4o (20241120) 1088 +12 | -10 OpenAI
2 DeepSeek V3 1032 +24 | -21 DeepSeek
3 Claude 3.5 Sonnet (20241022) 1015 +16 | -10 Anthropic
4 Gemini 2.0 Flash Exp 1005 +15 | -13 Google
5 Gemini 1.5 Pro 973 +20 | -16 Google
6 GPT 4o mini (20240718) 971 +23 | -17 OpenAI
7 LLaMA 3.1 405B Instruct FP8 914 +14 | -22 Meta

Table 5. Retriever-controlled arena ratings of the answer models.

The retriever-controlled leaderboard shows only slight differences from the original, as the two chosen retrievers perform similarly and have little influence on the rankings

Style + Retriever Control

Rank Model Arena Score 95% CI Org.
1 Claude 3.5 Sonnet (20241022) 1031 +28 | -17 Anthropic
2 GPT 4o (20241120) 1028 +23 | -24 OpenAI
3 DeepSeek V3 1016 +27 | -25 DeepSeek
4 Gemini 1.5 Pro 1007 +21 | -25 Google
5 Gemini 2.0 Flash Exp 997 +20 | -23 Google
6 GPT 4o mini (20240718) 963 +26 | -24 OpenAI
7 LLaMA 3.1 405B Instruct FP8 956 +23 | -23 Meta

Table 6. Style-and-retriever-controlled arena ratings of the answer models.

Leaderboard Calculation

How do we get separate leaderboards for the retrievers and the answer models from one set of battles? Extra features!

Chabot Arena leverages the Bradley-Terry model for scoring model strength using pairwise battles. We adopt its statistical extension to evaluate the additional subparts–the retrievers–by integrating them as extra features just like the answer models. Each retriever feature takes values from {-1, 0, 1}, indicating whether it was active in the file retrieval for model_a, neither/both, or model_b, respectively. By performing Logistic Regression on these additional retriever features along with the original model features, we obtain coefficients that are later scaled to become the leaderboard scores. Model Pairwise Winrates

Figure 8. Fraction of model A wins for all non-tied battles.

What’s Next?

We are actively collecting more votes and integrating new models, with plans for more comprehensive analysis down the line. Additionally, we are exploring ways to enhance RepoChat by incorporating features such as support for private repositories, GitLab integration, and improvements to our retrieval process. Community contributions are welcome—feel free to ping us if you’d like to get involved!

Appendix