Evaluating LLMs with Semantic Similarity
A novel approach to evaluate the semantic meaning of LLM output.
Currently evaluation methods come with significant limitations:
Human evaluation, like in the LMSYS Arena is the gold standard but slow
Benchmarks like MMLU can be cheated
Using another LLM like GPT4 as a Judge is expensive and might be biased
In this article, we introduce SemScore, a method recently introduced, to evaluate LLMs by looking at the semantics of their answers.
What is SemScore?
SemScore was proposed in a recent publication and focuses on the semantic content of a model’s output using embeddings. Embeddings are numerical representations of text which carry semantic meaning. The transformation from text to embedding vectors is done using embedding models. Here is an illustration, according to the paper, they embed the word orange, lemon and money with sentence-transformers/all-mpnet-base-v2, and yields a 768-dimensional vector. And we break these 768 dimensions down to the 2 most important ones using Principal Component Analysis (PCA) the words can be visualised on a 2D plot:
We can see that distances on the plot reflect the difference in meaning of these words. Embeddings not only allows us to turn simple words into interesting plots but to quantify the similarity of entire sentences or paragraphs by suing the Cosine similarity.
Cosine Similarity
It is a metric used to measure how similar two vectors are, regardless of their size
For examples:
Lemon vs Orange: 0.534
Lemon vs Car: 0.291
Lemon vs Money: 0.228
Car vs Money: 0.341
Applying this concept to entire LLM responses (instead of simple words) is what SemScore is all about.
Embedding Conversational Data
Here is a visualization of all the questions of the Open Assistant2 dataset, again embedded with SentenceTransformers/all-mpnet-base-v2, and broken down from high-dimensional space to two dimensions using PCA.
To validate the approach of embedding text passages (questions in this case), let’s look for the pairs of questions which are most similar to each other(highest cosine similarity) and most dissimilar to each other(lowest cosine similarity).
Using SemScore to do Benchmarking?
In the SemScore paper, the authors recreated an LLM ranking based on calculating the similarity of an LLM’s answer(prediction) and the answer by a human(reference).
We will do something similar and apply SemScore to a corpus of conversations with human judgement: the LMSYS arena conversations. The arena conversations assess LLM generated answers through direct comparison, where users ask a question and receive two responses from distinct LLMs. Without knowledge of which LLM provided which answer, the user specifies which response he or she liked Bette than the other. Based on these human ratings, ELO scores were calculated for each model and the ranking gives a global leaderboard reflecting human preference.
To apply SemScore, we need to compare the model’s generated answer to a reference answer. Under the assumption that GPT-4 provides the best answers, we compare the answer of each model to the answer of GPT-4. The more similar a model’s answers are to GPT-4 answers, the higher the ranking will be.
Arena Leaderboard with Semscore
See the code in Kaggle Arena Leaderboard with SemScore.
Evaluating LLMs on Any Using SemScore
See the code example in Kaggle Evaluating LLMs on Any Dataset using SemScore
Acknowledge
https://medium.com/@geronimo7/semscore-evaluating-llms-with-semantic-similarity-2abf5c2fadb9
https://github.com/Aisuko/notebooks/tree/main/nlp/sentence-similarity
https://huggingface.co/tasks/sentence-similarity