How are LLM ranked today : performance
Basically, the current way of evaluating LLM performance is based on creating requests, prompt all LLM and evaluate the best answer. It can be automatized when the correct answer is known in advance (like for mathematics tests) or manually evaluated (like for the Chatbot Arena)
- The elo models of the chatbot arena
LMSYS Chatbot Arena and Leaderboard
- The scored benchmarks
LiveBench
- The combined leaderboards
Why is a popularity index now needed
- LLM ranking is going to be less objective and more personal.
- It’s going to be who I want to get an advice from as we are going to delegate the tasks to LLM at a higher and higher level.
- LLM selection is gonna be less about which models gives the correct answer. It’s gonna be more about which model gives me a relevant answer.
- It’s not only about the right answer, the best LLM is now also a matter of personal taste
- It’s about who you want to have an opinion about your problem : it’s not because you don’t follow your advisor recommandation that you don’t want or trust his recommandation
- It’s a percentage and therefore, it is an empirically estimated value
LLM popularity Index : June 2024th edition
GPT is leading over Claude & Gemini by a fair advance
- With 27.2% of the total prompts, GPT-4o is leading over his main competitors.
- Claude is prompted on 14% of total requests in June. However, as we demonstrated in a specific study, there was a huge increase in Claude’s popularity following the release of Claude 3.5 Sonnet on June 20th.
- Gemini 1.5 Pro’s popularity was surpassed by Claude’s after the release of Claude 3.5 Sonnet. It represents 10% of total prompts.
- Mistral Large represent 4,7% of total requests
How should we interpret Llama 3's score?