Nobody knows which AI model will give the best result to your prompt before you prompt them all
- The current LLM performance rankings are based elo ranking from lmsys.org or prompt database like livebench.ai are calculating a probability that a model will outperform another other one per categories. When the LLM are in the top of the ranking, the probability that the top ranked LLM will outperform the second top is slightly above 50%. So if we were to expand to the top 3 LLMs you are almost only 50% likely to get your best answer from the top 1 LLM.
- Even with a smart algorithms that takes into account the ranking per categories of each LLM (ex : reasoning, coding, writing, languages, …), the algorithm won’t provide a significative improvement.
- So the best solution remains to prompt and compare the models
- When having doubt about an AI assertion, prompting a second LLM is a good way to check if they converge.
- When writing text, generating images or generating code, having a second proposition from a different AI model is a good way to select a favorite result.
Empirical data from Mammouth
Reprompts on LLMs
(GPT, Claude, Llama, Mistral & Gemini)
Reprompts on Image Models
(Midjourney, Dall-e3 & Stable Diffusion)
Number of LLMs solicited per prompts |
% of total prompts |
> = 4 |
7% |
>= 3 |
12% |
>= 2 |
34% |
= 1 |
66% |
Number of AI Model solicited per prompts |
% of total prompts |
>= 3 |
19% |
>= 2 |
41% |
= 1 |
59% |
For 66% of daily LLM queries, users solicit one model
- 66% of user’s queries to LLMs are simple enough to not need a second LLM prompting. It shows that for this majority of queries, LLMs will provide very similar or sufficient answers that don’t justify multi-prompting.
- It is based on Mammouth.ai data, that offers easy prompting to the top 5 LLMs.
For 34% of daily LLM queries, users solicit two or more LLMs
- As a consequence, 34% of total queries benefit from multi-model prompting. Those 34% correspond to the high-value queries. Those queries are more creative and more complex.
- 12% of total prompts are even reprompted to 3 LLMs or more.
- 7% of total prompts are reprompted to more than 4 LLMs
Multi-model is even more popular with Image generation tools than LLMs
- Indeed, 41% of image prompts are sent to at least two models among Midjourney, Dall-E and Stable Diffusion (available models on mammouth.ai).
- 19% of those prompts are prompted on all three available models.
As AI models are getting more performant, the definition of the best result is becoming more subjective and less objective
- There is two different reason to favor one model result to another
-
- The objective reason : User will favor the model that respects the rule of your prompt and provide the correct answer
-
- The subjective reason : When both LLM respect the prompt guidelines and give an objectively correct answer, one model can be favored by the user for subjective reason.
—> As LLM performance will improve, the differentiation will progressively move from objective to subjective. It will make multi-llm prompting even more relevant. That’s why we released the LLM popularity Index at Mammouth.