Evaluation of RAG pipeline using LLMs — RAG (part 2)

4 min readDec 29, 2023

Overview:

In this article, we delve into the methods of addressing the challenges in optimizing the Retrieval-Augmented Generation (RAG) pipeline as mentioned in the first part of our series. We emphasize the importance of measuring the performance of the RAG pipeline as a precursor to any optimization efforts. The article outlines effective strategies for evaluating and enhancing each component of the RAG pipeline.

What can be done about the challenges?

Data Management: Improving how data is chunked and stored is crucial. Rather than merely extracting raw text, storing more context is beneficial.
Embeddings: Enhancing the representation of stored data chunks through optimized embeddings.
Retrieval Methodologies: Advancing beyond basic top-k embedding lookups for more effective data retrieval.
Synthesis Enhancement: Utilizing Large Language Models (LLMs) for more than just generating responses.
Measurement and Evaluation: Establishing robust methods to measure performance is a fundamental step before proceeding with any optimizations.

Evaluation of the RAG Pipeline

The evaluation process is twofold:

Evaluating each component in isolation:

Retrieval: Ensuring the relevance of retrieved chunks to the input query.
Synthesis: Verifying if the response generated aligns with the retrieved chunks.

Evaluating the pipeline end-to-end:

This involves assessing the final response to a given input by:

Creating a dataset containing ‘user queries’ and corresponding ‘outputs’ (actual answers).
Running the RAG pipeline for user queries and collecting evaluation metrics.

Currently, the field is evolving rapidly with various approaches for RAG evaluation frameworks emerging, such as the RAG Triad of metrics, ROUGE, ARES, BLEU, and RAGAs. In this article, we will discuss briefly about RAG triad and RAGAs. Both these models are known to evaluate RAG pipelines using LLMs rather than using human evals or ground truth evals.

RAG Triad of Metrics

The RAG Triad involves three tests: context relevance, groundedness, and answer relevance.

Context Relevance: Ensuring the retrieved context is pertinent to the user query, utilizing LLMs for context relevance scoring.
Groundedness: Separating the response into statements and verifying each against the retrieved context.
Answer Relevance: Checking if the response aptly addresses the original question.

Context Relevance:

The initial step in any Retrieval-Augmented Generation (RAG) application is content retrieval, which is vital for ensuring the relevance of each context chunk to the input query. Any irrelevant context risks being incorporated into inaccurate answers. To assess this, we utilize the LLM to generate a context relevance score relative to the user’s query, applying a chain of thought approach for more transparent reasoning. We use this LLM reasoning capabilities for Groundedness and Answer relevancy metrics as well.
Example:

Groundedness:

Once the context is retrieved, a Language Model (LLM) crafts it into an answer. However, LLMs can sometimes deviate from the given facts, leading to embellished or overextended responses that seem correct but aren’t. To ensure our application’s groundedness, we dissect the response into distinct statements, and then independently verify the factual support for each within the retrieved context.

Answer Relevance:

Finally, our response must effectively address the user’s original query. We assess this by examining how relevant the final response is to the user input, ensuring that it not only answers the question but does so in a manner that is directly applicable to the user’s needs.

By reaching satisfactory evaluations for this triad, we can make a nuanced statement about our RAG application’s correctness; our application is verified to be hallucination-free up to the limit of its knowledge base.

RAGAs

This is a framework that is developed specifically for RAG pipelines.

It calculates isolation metrics and also overall metrics. The following are the metrics that are used in RAGAs to evaluate RAG:

Retrieval Stage Metrics:

Context recall: Recall measures the ability of the retrieval system to find all relevant items in the database. It answers the question, “Of all the relevant items that exist, how many has the system successfully retrieved?”. Recall = (Number of Relevant Items Retrieved) / (Total Number of Relevant Items)
Context Precision: Measures the accuracy of the retrieval system in terms of retrieving only relevant items. In other words, it answers the question, “Of all the items the system has retrieved, how many are actually relevant?”. Precision = (Number of Relevant Items Retrieved) / (Total Number of Items Retrieved)

Generation or Synthesis Stage Metrics:

Faithfulness: Measuring the factual accuracy of the generated answer. The number of correct statements from the given contexts is divided by the total number of statements in the generated answer.
Answer relevancy: Evaluating the pertinence of the generated answer to the question.

Overall Evaluation Metrics:

Harmonic mean of all four metrics. Including answer semantic similarity and answer correctness.

All metrics are scaled from 0 to 1, with higher values indicating better performance.

Conclusion

This article provides a comprehensive overview of evaluating and optimizing the RAG pipeline using LLMs. By effectively measuring each component and the pipeline as a whole, we can enhance the performance and reliability of RAG applications. The field is rapidly evolving, and staying abreast of new methodologies and frameworks is key to maintaining a cutting-edge RAG system. In the next part of this series, we will look into efficient parsing and chunking techniques.