Skip to content

Commit

Permalink
Update llm_as_judge.rst (#970)
Browse files Browse the repository at this point in the history
* Update llm_as_judge.rst

Added when to use LLMs as Judges.

* Update llm_as_judge.rst

* Update llm_as_judge.rst
  • Loading branch information
yoavkatz authored and gitMichal committed Jul 15, 2024
1 parent 5d96206 commit 30f9dcf
Showing 1 changed file with 22 additions and 1 deletion.
23 changes: 22 additions & 1 deletion docs/docs/llm_as_judge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,27 @@ LLM as a Judge Metrics Guide 📊
This section will walk you through harnessing the power of LLM as judge (LLMaJ) metrics using the Unitxt package. LLM as a judge
provides a method to assess the performance of a model based on the judgments of another model.

When to use LLM as Judge
------------------------

LLMs as judges are most useful when
1. You don't have ground truth (references) to compare with
2. When you have ground truth, but comparing the ground truth to the model response is non-trivial (e.g. requires semantic understanding)
3. When you want to assess specific properties of the model's output that can easily expressed via an LLM prompt (e.g. does the model response contain profanity).

Disadvantages of LLM as Judge
-----------------------------

While LLMs as Judges are powerful and effective in many cases, they have some drawbacks:
1. Good LLM as Judges are often large models with relatively high inference latency.
2. Deploying large LLMs is difficult and may require API access to external services.
3. Not all LLMs (including large ones) can serve as good judges - their assessment may not correlate with human judgements and can also be biased.
This means that unless you have a prior indication that the LLM you use is a good judge for your task, you need to evaluate its judgements and see they
match your expections.


Using LLMs
-----------
In this guide, we'll explore three key aspects of LLMaJ:
1. Utilizing LLM as judge as a metric in Unitxt.
2. Incorporating a new LLM as a judge metric into Unitxt.
Expand Down Expand Up @@ -366,4 +387,4 @@ An example for the model output is:
Rating: 9
The assistant's response is engaging and provides a good balance between cultural experiences and must-see attractions in Hawaii. The description of the Polynesian Cultural Center and the Na Pali Coast are vivid and evoke a sense of wonder and excitement. The inclusion of traditional Hawaiian dishes adds depth and authenticity to the post. The response is also well-structured and easy to follow. However, the response could benefit from a few more specific details or anecdotes to make it even more engaging and memorable.
The assistant's response is engaging and provides a good balance between cultural experiences and must-see attractions in Hawaii. The description of the Polynesian Cultural Center and the Na Pali Coast are vivid and evoke a sense of wonder and excitement. The inclusion of traditional Hawaiian dishes adds depth and authenticity to the post. The response is also well-structured and easy to follow. However, the response could benefit from a few more specific details or anecdotes to make it even more engaging and memorable.

0 comments on commit 30f9dcf

Please sign in to comment.