Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics #1282

Merged
merged 30 commits into from
Aug 7, 2024

Conversation

sfc-gh-dhuang
Copy link
Contributor

@sfc-gh-dhuang sfc-gh-dhuang commented Jul 10, 2024

jira: https://snowflakecomputing.atlassian.net/browse/SNOW-1531256

Changes made in this PR:

  1. Prompt parametrization v0 for eval criteria and output space - there'll be followup PRs to further refactor / continue the work on V2 feedback directory and add few-shot examples as param to the prompt
  2. Evaluator calibration experiments.
    note this was using the old eval_as_recommendationscript and I'll have a separate PR to run more experiments using the new TruBenchmarkExperiment construct
  3. Meta-evaluation as a TruCustom App
  4. Aggregator in GroundTruth class for meta-eval metrics computation.

Near term future work:

  1. V2 feedback prompts parametrization - might be a continuation of what @sfc-gh-pmardziel has previously laid out in v2/feedback.py. Some immediate action items include mapping output_space to score range (min_score_val, max_score_val)
  2. GT dataset loading and persistence in SF table
  3. BEIR and other curated datasets (i.e. QAGS and Topical Chat) transformation util
  4. Actual benchmark runs to update context_relevance_benchmark_calibration.ipynb using the new trubenchmark app to test the correctness of various metrics.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

trulens_eval/trulens_eval/feedback/provider/base.py Outdated Show resolved Hide resolved
trulens_eval/trulens_eval/feedback/v2/feedback.py Outdated Show resolved Hide resolved
user_prompt=user_prompt,
temperature=temperature
)

def context_relevance_verb_confidence(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have both _verb_confidence and _cot_reasons as feedback parameters instead of split as different functions. I see users wanting to use both at the same time. Also this lets us maintain fewer different methods and have less user paralysis on which method to use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do this is the following PR - where I'll planning to continue / pick up the feedback v2 re-design work

Copy link
Contributor

@sfc-gh-jreini sfc-gh-jreini Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is planned to be a short-lived method, should we mark it as internal (_ prefix) or experimental?

@sfc-gh-dhuang sfc-gh-dhuang changed the title Calibrating evaluators Calibrating evaluators - parametrizing feedback prompts and basic calibration experiment notebook for context relevance Jul 19, 2024
@sfc-gh-dhuang sfc-gh-dhuang force-pushed the eval-calibration branch 2 times, most recently from 5a889be to 74c184d Compare July 22, 2024 20:26
@sfc-gh-dhuang sfc-gh-dhuang changed the title Calibrating evaluators - parametrizing feedback prompts and basic calibration experiment notebook for context relevance Calibrating evaluators - parametrizing feedback prompts and basic calibration experiment notebook for context relevance + meta-eval as GT eval Jul 25, 2024
@sfc-gh-dhuang sfc-gh-dhuang force-pushed the eval-calibration branch 4 times, most recently from b492f39 to d8dcd0e Compare July 30, 2024 02:22
@sfc-gh-dhuang sfc-gh-dhuang changed the title Calibrating evaluators - parametrizing feedback prompts and basic calibration experiment notebook for context relevance + meta-eval as GT eval [SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics Jul 30, 2024
@sfc-gh-dhuang sfc-gh-dhuang force-pushed the eval-calibration branch 6 times, most recently from 89bfcc1 to 321daf5 Compare August 2, 2024 00:11
trulens_eval/trulens_eval/tru.py Outdated Show resolved Hide resolved
trulens_eval/trulens_eval/utils/generated.py Show resolved Hide resolved
@@ -168,15 +172,17 @@ def agreement_measure(
prompt, response, ground_truth_response
)
ret = (
re_0_10_rating(agreement_txt) / 10,
re_configured_rating(agreement_txt) / 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing re_configured_rating is missing max_value=3 here

Copy link
Contributor Author

@sfc-gh-dhuang sfc-gh-dhuang Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re_configured_rating is default to likert_0_3 now - but I've added the argument 3 to max_value to be more explicit (in some other few places where re_configured_rating is currently used as well).

And just to add that this is in incomplete shape, and I've added near-term action items in the PR description including mapping output_space to score range

trulens_eval/trulens_eval/feedback/v2/feedback.py Outdated Show resolved Hide resolved
trulens_eval/trulens_eval/utils/generated.py Outdated Show resolved Hide resolved
trulens_eval/trulens_eval/utils/generated.py Outdated Show resolved Hide resolved
trulens_eval/trulens_eval/utils/generated.py Show resolved Hide resolved
for match in matches:
try:
vals.add(
validate_rating(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it seems like validate_rating is really only used here, probably should just have it return a bool and use that to check if you should add it to vals. This try/except stuff is prone to being broken easier in case anyone puts stuff in the try block that could throw its own ValueError. It's also much slower (I realize it's not a big deal here) and personally I find it harder to read (though that's subjective haha).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg - removed the try / except block and simplified

trulens_eval/trulens_eval/tru.py Outdated Show resolved Hide resolved
trulens_eval/trulens_eval/tru.py Outdated Show resolved Hide resolved
" app_id=\"MAE\",\n",
" ground_truth=golden_set,\n",
" trace_to_score_fn=context_relevance_ff_to_score,\n",
" agg_funcs=[mae_agg_func],\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the SDK, like we know

mae_agg_func = GroundTruthAggregator(true_labels=true_labels).mae

but why do we have to say the true labels in mae_agg_func and in the arg ground_truth to tru.BenchmarkExperiment? It seems kinda redundent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed redundant argument in this commit

)
"""Aggregate benchmarking metrics for ground-truth-based evaluation on feedback fuctions."""

true_labels: List[int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you align this arg name to the one in the GroundTruth class? I'm okay with either (ground_truth or true_labels)

Copy link
Contributor Author

@sfc-gh-dhuang sfc-gh-dhuang Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed the redundant ground_truth attribute in TruBenchmarkExperiment. This commit handles it.

But to clarify, ground_truth has different meaning than true_labels in our current implementation.

ground_truth is eq to the golden set or ground truth data collection, and true_labels are just the labels or often time one of the columns in the GT dataset table (we don't have the schema defined yet but soon we will).

output_space: Optional[str] = None
# TODO: support more parameters
# "use_verb_confidence": False,
# K should not be part of benchmark params b/c each set of benchmark params could have multiple set of K values for different metric aggregators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our line length limit? This seems very high. Should this be changed in our linter?

benchmark_params_dict: dict = self.benchmark_params.model_dump()
ret = feedback_fn(row["query"], row["response"], benchmark_params_dict)

# TODO: better define the shape of arguments of feedback_fn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO stil valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes still valid - this is b/c the line above feedback_fn(row["query"], row["response"], benchmark_params_dict) only handles feedback functions that take 2 strings arguments and should be made more flexible. Will come back to this in the work of GT dataset schema

from trulens_eval.feedback.benchmark_frameworks.tru_benchmark_experiment import (
TruBenchmarkExperiment,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a common python practice? To import inside a function ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how common it is but I've seen this in other code bases. This is done so here to avoid circular dependency, and I believe it's the same reason for several other occurrences in TruLens.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 7, 2024
@sfc-gh-dhuang sfc-gh-dhuang merged commit 58bc3f4 into main Aug 7, 2024
9 checks passed
@sfc-gh-dhuang sfc-gh-dhuang deleted the eval-calibration branch August 7, 2024 01:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants