How are LLM ranked today : performance

Basically, the current way of evaluating LLM performance is based on creating requests, prompt all LLM and evaluate the best answer. It can be automatized when the correct answer is known in advance (like for mathematics tests) or manually evaluated (like for the Chatbot Arena)

The elo models of the chatbot arena

LMSYS Chatbot Arena and Leaderboard

The scored benchmarks

LiveBench

The combined leaderboards

Why is a popularity index now needed

LLM ranking is going to be less objective and more personal.
It’s going to be who I want to get an advice from as we are going to delegate the tasks to LLM at a higher and higher level.
LLM selection is gonna be less about which models gives the correct answer. It’s gonna be more about which model gives me a relevant answer.
It’s not only about the right answer, the best LLM is now also a matter of personal taste
It’s about who you want to have an opinion about your problem : it’s not because you don’t follow your advisor recommandation that you don’t want or trust his recommandation
It’s a percentage and therefore, it is an empirically estimated value

LLM popularity Index : June 2024th edition

Repartition of LLM usage in June 24th.svg

GPT is leading over Claude & Gemini by a fair advance

With 27.2% of the total prompts, GPT-4o is leading over his main competitors.
Claude is prompted on 14% of total requests in June. However, as we demonstrated in a specific study, there was a huge increase in Claude’s popularity following the release of Claude 3.5 Sonnet on June 20th.
Gemini 1.5 Pro’s popularity was surpassed by Claude’s after the release of Claude 3.5 Sonnet. It represents 10% of total prompts.
Mistral Large represent 4,7% of total requests

How are LLM ranked today : performance

Why is a popularity index now needed

LLM popularity Index : June 2024th edition

GPT is leading over Claude & Gemini by a fair advance

How should we interpret Llama 3's score?