From 877662344c89a814e61278f1ae58219ebbce0665 Mon Sep 17 00:00:00 2001 From: welisheva22 Date: Wed, 14 Aug 2024 01:41:39 -0400 Subject: [PATCH] Update rag_support.rst --- copy edits (grammar, consistency, clarity) (#1140) Signed-off-by: welisheva22 Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> --- docs/docs/rag_support.rst | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/docs/docs/rag_support.rst b/docs/docs/rag_support.rst index 73d215614..b81973745 100644 --- a/docs/docs/rag_support.rst +++ b/docs/docs/rag_support.rst @@ -72,10 +72,10 @@ By computing ``Context Relevance`` over results from different vector stores and * Implementation Details -We employ a small LLM - ``google/flan-t-5-small`` - that is known to show strong results in faithfulness assessment, and prompt it with the instruction ``Generate a question based on the given content:`` followed by one retrieved text at a time. As the model generates the question iteratively, token by token, we employ a teacher forcing strategy that uses the tokens from the actual question as ground-truth. Thus, at each step, the model uses the ground-truth tokens as input rather than the output from previous steps, and predicts the probability of generating the next ground-truth token. The geometric mean over these probabilities defines the perplexity of the retrieved text. +We employ a small LLM - ``google/flan-t-5-small`` - that is known to show strong results in a faithfulness assessment, and we prompt it with the instruction ``Generate a question based on the given content:`` followed by one retrieved text at a time. As the model generates the question iteratively, token by token, we employ a teacher forcing strategy that uses the tokens from the actual question as ground truth. Thus, at each step, the model uses the ground-truth tokens as input rather than the output from previous steps, and predicts the probability of generating the next ground-truth token. The geometric mean over these probabilities defines the perplexity of the retrieved text. * Limitations and Future Plans -In future releases we will add a list of complementary metrics ``Context Relevance @ K`` for $K = {1, 3, 5, ...}$ that are computed by averaging the perplexity scores of the top-K retrieved texts. This will be useful for assessing the ranking of the retrieval as normally in RAG applications only the top results from the search are passed to the LLM for generating an answer. +In future releases we will add a list of complementary metrics ``Context Relevance @ K`` for $K = {1, 3, 5, ...}$ that are computed by averaging the perplexity scores of the top-K retrieved texts. This will be useful for assessing the ranking of the retrieval. After all, normally in RAG applications only the top results from the search are passed to the LLM for generating an answer. ----- @@ -95,9 +95,9 @@ We use the well known `Mean Reciprocal Rank `_ (MAP) metric. +Another issue with the current metric is that only the top-ranked ground truth is used in the metric score. It does not penalize the retrieval for assigning a low rank to other ground truths. In the future this will be mitigated by supplementing MRR by the `Mean Average Precision `_ (MAP) metric. ------------------ @@ -108,9 +108,9 @@ Faithfulness This is a reference-less metric gauging the groundedness of the generated answer in the retrieved texts. The metric range is [0, 1], where higher is better. * Motivation and Approach -We based our approach on `Adlakha et. al (2023) `_ - "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering", which found that fast and inexpensive lexical analysis can provide relatively high correlation with Human judgement on Faithfulness. +We based our approach on `Adlakha et. al (2023) `_ - "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering", which found that fast and inexpensive lexical analysis can provide a relatively high correlation with human judgement on faithfulness. -Table 4 from the paper is provided below, showing that the `K-Precision` lexical approach is close to GPT-4. The main advantage of lexical strategies over the LLM as a Judge strategy is that they are easy to implement, fast to run, and inexpensive to deploy (do not require GPUs). +Table 4 from the paper is provided below, showing that the `K-Precision` lexical approach is close to GPT-4. The main advantage of lexical strategies over the LLM as a Judge strategy is that they are easy to implement, fast to run, and inexpensive to deploy (in other words, they do not require GPUs). .. image:: ../../assets/rag/adlaka_table4.png :alt: Table 2 of Adlakha et. al (2023) @@ -119,12 +119,12 @@ Table 4 from the paper is provided below, showing that the `K-Precision` lexical * Implementation Details -The `K-Precision` ("Knowledge Precision") metric mentioned in the paper has been part of public open source projects for a long while, and now it is also adopted in the Unitxt package for computing faithfulness scores. +The `K-Precision` ("Knowledge Precision") metric that is mentioned in the paper has been part of public open source projects for a long time, and now it is also adopted in the Unitxt package for computing faithfulness scores. -The metric is essentially token precision: we count how many of the generated tokens in the system response are included in the context retrieved from the index. +The metric is essentially token precision: we count how many of the generated tokens in the system response are included in the context retrieved from the index. * Limitations and Future Plans -Lexical strategies look at words in isolation, ignoring word order and context. This is clearly a suboptimal approach that can lead to inaccurate assessment in many cases. We plan to switch to a more robust LLM as a Judge approach once we have models that can offer a better trade-off between speed, cost and quality. +Lexical strategies look at words in isolation, ignoring word order and context. This is clearly a suboptimal approach that can lead to inaccurate assessments in many cases. We plan to switch to a more robust LLM as a Judge approach once we have models that can offer a better trade-off between speed, cost and quality. ------------ @@ -136,11 +136,11 @@ This is a reference-less metric that predicts which generated answer is better j * Motivation and Approach -When it comes to the assessment of answer quality, we typically see an attempt to characterize this abstract property using various, more basic and apparently well-defined, aspects, such as: factual correctness, naturalness, appropriateness, conciseness, faithfulness, relevance, clarity, among others. However, due to the convoluted inter-relations between these properties, labeling each one of them in isolation effectively and consistently by humans is a non-trivial task that is hardly practical. It requires an exhaustive and well-defined but also clear and intuitive annotation scheme, as well as long-term training and monitoring of the labelers. +When it comes to the assessment of answer quality, we typically see an attempt to characterize this abstract property using various more basic and apparently well-defined aspects, such as: factual correctness, naturalness, appropriateness, conciseness, faithfulness, relevance, and clarity, among others. However, due to the convoluted interrelationships between these properties, labeling each one of them in isolation effectively and consistently by humans is a non-trivial task that is hardly practical. It requires an exhaustive and well-defined, but also clear and intuitive, annotation scheme. It also requires long-term training and monitoring of the labelers. -As a counter approach, the holistic view on quality aims to characterize this property using simple, direct, questions in a realistic scenario. For example, in the comparative setup, instead of asking human labelers to rate answers by various abstract properties as mentioned above and then somehow mixing all the scores together and concluding which answer is better, it directly asks the labelers to indicate which answer is better in the use-case in which the answer is to be given (e.g. a chatbot about enterprise HR policies). +As a counter approach, the holistic view on quality aims to characterize this property using simple, direct questions in a realistic scenario. For example, in the comparative setup, instead of asking human labelers to rate answers by various abstract properties as mentioned above and then somehow mixing all the scores together to conclude which answer is better, it does something else: it directly asks the labelers to indicate which answer is better in the use case in which the answer is to be given (e.g. a chatbot about enterprise HR policies). -The underlying assumption here is that the labeler implicitly considers all fine-grained properties like naturalness, conciseness, faithfulness, etc. and reward the overall better answer. +The underlying assumption here is that the labeler implicitly considers all fine-grained properties like naturalness, conciseness, faithfulness, etc. and favors the overall better answer. For completeness, in a non-comparative setup, the holistic approach could, for example, ask the labeler to indicate if he/she would recommend the answer to a friend who asks the question, appealing, as in the comparative setup, to overall judgement. @@ -154,7 +154,7 @@ Although the model was trained in a comparative setup (one question, multiple an * Limitations and Future Plans -The reward model provides a meaningful signal on the quality of answers, but in some cases pinpointing specific qualities such as relevance is desired. In future we plan to add metrics that address these qualities. +The reward model provides a meaningful signal on the quality of answers, but in some cases pinpointing specific qualities such as relevance is desired. In the future, we plan to add metrics that address these qualities. ------ @@ -167,7 +167,7 @@ This is a reference-based metric gauging the similarity between the generated an * Motivation and Approach -As with [Faithfulness](#Faithfulness), we based our approach on `Adlakha et. al (2023) `_, who reported relatively high correlation of lexical strategies with Human judgement on answer correctness. +As with [Faithfulness](#Faithfulness), we based our approach on `Adlakha et. al (2023) `_, who reported a relatively high correlation of lexical strategies with human judgement on answer correctness. Table 2 from the paper is provided below. The results indicate that the `Recall` lexical approach is close to GPT 3.5 and GPT-4 while being easier to implement, faster to run and inexpensive to deploy.