How Many User Prompts are New?

Analysis of prompt freshness and benchmark contamination

Intro

One of the key reasons why Chatbot Arena is such an enticing benchmark is that it’s live: thousands of new user conversations and votes are collected every day. This constant stream of new data helps prevent benchmark “gaming” - training on the benchmark to get a high score. But how fresh is this data really?

We investigate 355,575 LLM battles from May 2024 to Dec 2024 to answer the following questions:

1. What proportion of prompts have never been seen before (aka “fresh”)?
2. What are common duplicate prompts?
3. How many prompts appear in widely used benchmarks?

We find that:

1. Roughly 75% of the prompts collected each day are significantly different from any prompt on a previous day.
2. Duplicate prompts are largely greetings (e.g., “hi” and “hello”), the same user submitting the same prompt on the same day to multiple models, or common tester prompts like “how many r’s are in strawberry?”
3. Less than 1% of user prompts appear in popular benchmarks

How do we measure prompt duplicates?

Prompt duplicates are measured by the cosine similarity of the text embeddings (OpenAI’s text-embedding-3-small). If the similarity between the embeddings of prompt a and prompt b are greater than or equal to 0.7, we consider it a duplicate. This threshold is set by manually looking through examples to determine when two prompts are asking the same thing. A random sample of prompt pairs with their similarities are provided on our Hugging Face.

Given a prompt at submitted at time \(t\), we examine the following:

How many duplicate prompts are there?

For roughly 75% of the prompts collected each day, there is not a similar prompt submitted on a previous day. This indicates that roughly 75% of the prompts submitted each day are fresh.

Prompt Freshness per Day.

Prompt Freshness per Week.

If you look at the above analysis, the proportion of fresh prompts decreases as a function of \(t\). This is expected, since as \(t\) grows, we are comparing new prompts with an ever-larger set of past prompts. For example, when \(t=1\), there are no previos prompts, so of course, the proportion of unique prompts is \(1/1=100\)%.

However, as \(t\) grows, this number stabilizes to around 70-80% fresh prompts at a similarity threshold of 0.7. This equilibrium represents the fraction of fresh prompts that we expect chatbot arena to generate in the long run.

Interestingly, we also see certain dates where prompt freshness is significantly lower than neighboring dates: we will get to why that is in the next section.

What are the sources of duplicates?

We find that many of the duplicates can be attributed to 3 things: “tester”prompts, hi/hello’s, and prompts asked multiple times by the same user back to back.

Hi’s and Hello’s. We see that 2.1% of our data is some variation of “hi” in various languages. As per our deduplication policy, these are deduplicated when calculating the final rankings.

Tester prompts. There are certain prompts that users have found to stump most LLM’s, like “how many r’s are in strawberry” or “what is bigger, 9.11 or 9.8”. When a new model comes out, these prompts are commonly asked to gauge performance, which can be a source of days with a low proportion of fresh prompts. For instance, the week of August 8th saw a large decrease in prompt freshness, which coincides with a release of an update to GPT-4o. Looking at the top prompts on those days we see many of these prompts are a version of “how many r’s are in strawberry”.

Strawberry and Nearest Neighbor Matches.

Most Common Prompts by Week (excluding "hi" prompts).

Repeated prompts by the same user. Many duplicate prompts are submitted by the same person on the same day. Comparing prompts to every prompt seen at an previous timestep (rather than a previous day or week) 65% of prompts at a given time have been previously seen. However, most of these duplicates occur on the same day, with 60% of these prompts submitted by the same user. This indicates that users are asking a prompt, voting, then starting a new battle with two new models and asking the same prompt. This is encouraging because the models used in each conversation vary, which helps maintain diversity in prompts across different model pairs and results in more consistent voting from the same user. Removing duplicate prompts submitted by the same user on the same day raises the percentage of unique prompts from 65% to 80%.

Days before nearest neighbor is seen.

How many prompts are seen in existing datasets?

Lastly, we wanted to ensure that the prompts are not contained in commonly used benchmarks. Using the same similarity measure, we find a very low percentage of user prompts are seen in existing datasets, reducing the likelihood that models which overfit to these benchmarks are given an advantage in the arena.

Contamination of Prompts in Existing Datasets.

Conclusion

The majority of our user prompts are fresh (i.e., 75%), and the data is not contaminated by existing benchmarks. A sample of the data with their nearest neighbors to the original prompt can be found in this space. We are excited to see how this data evolves over time!

Citation

@misc{dunlap2025freshness,
    title={How Many User Prompts are New?},
    author={Lisa Dunlap and Elva Lu and Joseph E. Gonzalez and Anastasios N. Angelopoulos and Wei-Lin Chiang and Ion Stoica},
    year={2025},
}

Prompt Similarity Examples

To better understand how our similarity threshold works in practice, we’ve provided examples of prompt pairs at different similarity levels. Use the buttons below to explore prompt pairs within specific similarity ranges.

Prompt Nearest Neighbor Sim