[SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics #1282

sfc-gh-dhuang · 2024-07-10T00:00:46Z

jira: https://snowflakecomputing.atlassian.net/browse/SNOW-1531256

Changes made in this PR:

Prompt parametrization v0 for eval criteria and output space - there'll be followup PRs to further refactor / continue the work on V2 feedback directory and add few-shot examples as param to the prompt
Evaluator calibration experiments.
note this was using the old eval_as_recommendationscript and I'll have a separate PR to run more experiments using the new TruBenchmarkExperiment construct
Meta-evaluation as a TruCustom App
Aggregator in GroundTruth class for meta-eval metrics computation.

Near term future work:

V2 feedback prompts parametrization - might be a continuation of what @sfc-gh-pmardziel has previously laid out in v2/feedback.py. Some immediate action items include mapping output_space to score range (min_score_val, max_score_val)
GT dataset loading and persistence in SF table
BEIR and other curated datasets (i.e. QAGS and Topical Chat) transformation util
Actual benchmark runs to update context_relevance_benchmark_calibration.ipynb using the new trubenchmark app to test the correctness of various metrics.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

review-notebook-app · 2024-07-11T08:33:39Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

trulens_eval/trulens_eval/feedback/provider/base.py

trulens_eval/trulens_eval/feedback/v2/feedback.py

trulens_eval/trulens_eval/feedback/provider/base.py

sfc-gh-jreini · 2024-07-16T21:22:59Z

trulens_eval/trulens_eval/feedback/provider/base.py

            user_prompt=user_prompt,
            temperature=temperature
        )

+    def context_relevance_verb_confidence(


Would be good to have both _verb_confidence and _cot_reasons as feedback parameters instead of split as different functions. I see users wanting to use both at the same time. Also this lets us maintain fewer different methods and have less user paralysis on which method to use

will do this is the following PR - where I'll planning to continue / pick up the feedback v2 re-design work

Since this is planned to be a short-lived method, should we mark it as internal (_ prefix) or experimental?

trulens_eval/trulens_eval/feedback/groundtruth.py

trulens_eval/trulens_eval/feedback/benchmark_frameworks/eval_as_recommendation.py

trulens_eval/trulens_eval/tru.py

trulens_eval/trulens_eval/utils/generated.py

sfc-gh-chu · 2024-08-05T20:12:51Z

trulens_eval/trulens_eval/feedback/groundtruth.py

@@ -168,15 +172,17 @@ def agreement_measure(
                prompt, response, ground_truth_response
            )
            ret = (
-                re_0_10_rating(agreement_txt) / 10,
+                re_configured_rating(agreement_txt) / 3,


I'm guessing re_configured_rating is missing max_value=3 here

re_configured_rating is default to likert_0_3 now - but I've added the argument 3 to max_value to be more explicit (in some other few places where re_configured_rating is currently used as well).

And just to add that this is in incomplete shape, and I've added near-term action items in the PR description including mapping output_space to score range

trulens_eval/trulens_eval/feedback/groundtruth.py

trulens_eval/trulens_eval/feedback/v2/feedback.py

trulens_eval/trulens_eval/utils/generated.py

sfc-gh-dkurokawa · 2024-08-02T06:00:01Z

trulens_eval/trulens_eval/utils/generated.py

+    for match in matches:
+        try:
+            vals.add(
+                validate_rating(


Given it seems like validate_rating is really only used here, probably should just have it return a bool and use that to check if you should add it to vals. This try/except stuff is prone to being broken easier in case anyone puts stuff in the try block that could throw its own ValueError. It's also much slower (I realize it's not a big deal here) and personally I find it harder to read (though that's subjective haha).

sg - removed the try / except block and simplified

trulens_eval/trulens_eval/feedback/benchmark_frameworks/meta_evaluation_examples.ipynb

trulens_eval/trulens_eval/tru.py

trulens_eval/trulens_eval/feedback/benchmark_frameworks/meta_evaluation_examples.ipynb

trulens_eval/trulens_eval/tru.py

sfc-gh-dkurokawa · 2024-08-06T05:42:37Z

trulens_eval/trulens_eval/feedback/benchmark_frameworks/meta_evaluation_examples.ipynb

+    "    app_id=\"MAE\",\n",
+    "    ground_truth=golden_set,\n",
+    "    trace_to_score_fn=context_relevance_ff_to_score,\n",
+    "    agg_funcs=[mae_agg_func],\n",


I'm confused by the SDK, like we know

mae_agg_func = GroundTruthAggregator(true_labels=true_labels).mae

but why do we have to say the true labels in mae_agg_func and in the arg ground_truth to tru.BenchmarkExperiment? It seems kinda redundent.

removed redundant argument in this commit

trulens_eval/trulens_eval/feedback/provider/base.py

sfc-gh-jreini · 2024-08-06T19:30:36Z

trulens_eval/trulens_eval/feedback/groundtruth.py

+    )
+    """Aggregate benchmarking metrics for ground-truth-based evaluation on feedback fuctions."""
+
+    true_labels: List[int]


can you align this arg name to the one in the GroundTruth class? I'm okay with either (ground_truth or true_labels)

I've addressed the redundant ground_truth attribute in TruBenchmarkExperiment. This commit handles it.

But to clarify, ground_truth has different meaning than true_labels in our current implementation.

ground_truth is eq to the golden set or ground truth data collection, and true_labels are just the labels or often time one of the columns in the GT dataset table (we don't have the schema defined yet but soon we will).

sfc-gh-pdharmana · 2024-08-06T19:28:30Z

trulens_eval/trulens_eval/feedback/benchmark_frameworks/tru_benchmark_experiment.py

+    output_space: Optional[str] = None
+    # TODO: support more parameters
+    # "use_verb_confidence": False,
+    # K should not be part of benchmark params b/c each set of benchmark params could have multiple set of K values for different metric aggregators


What is our line length limit? This seems very high. Should this be changed in our linter?

sfc-gh-pdharmana · 2024-08-06T19:30:42Z

trulens_eval/trulens_eval/feedback/benchmark_frameworks/tru_benchmark_experiment.py

+        benchmark_params_dict: dict = self.benchmark_params.model_dump()
+        ret = feedback_fn(row["query"], row["response"], benchmark_params_dict)
+
+        # TODO: better define the shape of arguments of feedback_fn


Is this TODO stil valid?

yes still valid - this is b/c the line above feedback_fn(row["query"], row["response"], benchmark_params_dict) only handles feedback functions that take 2 strings arguments and should be made more flexible. Will come back to this in the work of GT dataset schema

sfc-gh-pdharmana · 2024-08-07T00:03:48Z

trulens_eval/trulens_eval/tru.py

+        from trulens_eval.feedback.benchmark_frameworks.tru_benchmark_experiment import (
+            TruBenchmarkExperiment,
+        )
+


Is this a common python practice? To import inside a function ?

I'm not sure how common it is but I've seen this in other code bases. This is done so here to avoid circular dependency, and I believe it's the same reason for several other occurrences in TruLens.

sfc-gh-jreini reviewed Jul 16, 2024

View reviewed changes

sfc-gh-dhuang changed the title ~~Calibrating evaluators~~ Calibrating evaluators - parametrizing feedback prompts and basic calibration experiment notebook for context relevance Jul 19, 2024

sfc-gh-dhuang force-pushed the eval-calibration branch 2 times, most recently from 5a889be to 74c184d Compare July 22, 2024 20:26

sfc-gh-dhuang force-pushed the eval-calibration branch from f79fad0 to 30f927e Compare July 25, 2024 08:57

sfc-gh-jreini reviewed Jul 25, 2024

View reviewed changes

trulens_eval/trulens_eval/feedback/groundtruth.py Outdated Show resolved Hide resolved

sfc-gh-dhuang force-pushed the eval-calibration branch 4 times, most recently from b492f39 to d8dcd0e Compare July 30, 2024 02:22

sfc-gh-dhuang requested review from sfc-gh-dkurokawa, sfc-gh-chu, sfc-gh-pdharmana and sfc-gh-pmardziel July 30, 2024 02:32

sfc-gh-dhuang changed the title ~~Calibrating evaluators - parametrizing feedback prompts and basic calibration experiment notebook for context relevance + meta-eval as GT eval~~ [SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics Jul 30, 2024

sfc-gh-jreini reviewed Jul 31, 2024

View reviewed changes

trulens_eval/trulens_eval/feedback/benchmark_frameworks/eval_as_recommendation.py Outdated Show resolved Hide resolved

sfc-gh-dhuang force-pushed the eval-calibration branch 6 times, most recently from 89bfcc1 to 321daf5 Compare August 2, 2024 00:11

resolving merge conflict

b331b3c

sfc-gh-dhuang force-pushed the eval-calibration branch from 321daf5 to b331b3c Compare August 2, 2024 00:51

sfc-gh-dhuang added 4 commits August 1, 2024 17:58

fix stuff

e994a0c

pre-commit test

0fba75c

fmt

019a373

scipy ver

629b9ea

sfc-gh-chu reviewed Aug 5, 2024

View reviewed changes

sfc-gh-dhuang added 3 commits August 5, 2024 16:27

confidence prompt

60c3f0b

addressed comments

5ed7bd0

touchup

a4f8e09

sfc-gh-dhuang requested review from sfc-gh-chu and sfc-gh-pdharmana August 6, 2024 00:50

sfc-gh-dhuang added 2 commits August 5, 2024 17:58

docstring

f1e41fa

score gen fix

c61d97f

sfc-gh-dkurokawa reviewed Aug 6, 2024

View reviewed changes

sfc-gh-jreini requested changes Aug 6, 2024

View reviewed changes

sfc-gh-dhuang added 5 commits August 6, 2024 11:51

addressed pr comments 1st part

a99f2a2

remove validate_rating

aae747e

add warning

57831dc

renamed trace_to_score_fn back to feedback_fn

761c0ba

updated doc string

a8312f4

sfc-gh-jreini reviewed Aug 6, 2024

View reviewed changes

sfc-gh-dhuang added 5 commits August 6, 2024 14:27

Merge branch 'main' into eval-calibration

5154fdc

remove redundant groundtruth attribute in trubenchmarkexperiment class

08ff349

allow min max score as arguments in ff

9ea986f

remove ref to old context relevance prompt

ccfa430

touchup

35e9e59

sfc-gh-dhuang requested review from sfc-gh-jreini and sfc-gh-dkurokawa August 6, 2024 23:14

sfc-gh-pdharmana reviewed Aug 7, 2024

View reviewed changes

sfc-gh-jreini approved these changes Aug 7, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 7, 2024

comments

09eac4b

sfc-gh-dhuang requested a review from sfc-gh-pdharmana August 7, 2024 01:11

sfc-gh-dhuang merged commit 58bc3f4 into main Aug 7, 2024
9 checks passed

sfc-gh-dhuang deleted the eval-calibration branch August 7, 2024 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics #1282

[SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics #1282

sfc-gh-dhuang commented Jul 10, 2024 •

edited

Loading

review-notebook-app bot commented Jul 11, 2024

sfc-gh-jreini Jul 16, 2024

sfc-gh-dhuang Jul 31, 2024

sfc-gh-jreini Aug 6, 2024 •

edited

Loading

sfc-gh-chu Aug 5, 2024

sfc-gh-dhuang Aug 6, 2024 •

edited

Loading

sfc-gh-dkurokawa Aug 2, 2024

sfc-gh-dhuang Aug 6, 2024

sfc-gh-dkurokawa Aug 6, 2024

sfc-gh-dhuang Aug 6, 2024

sfc-gh-jreini Aug 6, 2024

sfc-gh-dhuang Aug 6, 2024 •

edited

Loading

sfc-gh-pdharmana Aug 6, 2024

sfc-gh-pdharmana Aug 6, 2024

sfc-gh-dhuang Aug 7, 2024

sfc-gh-pdharmana Aug 7, 2024

sfc-gh-dhuang Aug 7, 2024

[SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics #1282

[SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics #1282

Conversation

sfc-gh-dhuang commented Jul 10, 2024 • edited Loading

Type of change

review-notebook-app bot commented Jul 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-jreini Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-dhuang Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-dhuang Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-dhuang commented Jul 10, 2024 •

edited

Loading

sfc-gh-jreini Aug 6, 2024 •

edited

Loading

sfc-gh-dhuang Aug 6, 2024 •

edited

Loading

sfc-gh-dhuang Aug 6, 2024 •

edited

Loading