-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SNOW-1531256] Benchmarking - meta-evaluation as app + v1 prompt param + aggregator metrics #1282
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
user_prompt=user_prompt, | ||
temperature=temperature | ||
) | ||
|
||
def context_relevance_verb_confidence( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have both _verb_confidence
and _cot_reasons
as feedback parameters instead of split as different functions. I see users wanting to use both at the same time. Also this lets us maintain fewer different methods and have less user paralysis on which method to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do this is the following PR - where I'll planning to continue / pick up the feedback v2 re-design work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is planned to be a short-lived method, should we mark it as internal (_ prefix) or experimental?
5a889be
to
74c184d
Compare
f79fad0
to
30f927e
Compare
b492f39
to
d8dcd0e
Compare
trulens_eval/trulens_eval/feedback/benchmark_frameworks/eval_as_recommendation.py
Outdated
Show resolved
Hide resolved
89bfcc1
to
321daf5
Compare
321daf5
to
b331b3c
Compare
@@ -168,15 +172,17 @@ def agreement_measure( | |||
prompt, response, ground_truth_response | |||
) | |||
ret = ( | |||
re_0_10_rating(agreement_txt) / 10, | |||
re_configured_rating(agreement_txt) / 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing re_configured_rating is missing max_value=3
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re_configured_rating
is default to likert_0_3 now - but I've added the argument 3 to max_value
to be more explicit (in some other few places where re_configured_rating
is currently used as well).
And just to add that this is in incomplete shape, and I've added near-term action items in the PR description including mapping output_space to score range
for match in matches: | ||
try: | ||
vals.add( | ||
validate_rating( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it seems like validate_rating
is really only used here, probably should just have it return a bool and use that to check if you should add it to vals
. This try
/except
stuff is prone to being broken easier in case anyone puts stuff in the try
block that could throw its own ValueError
. It's also much slower (I realize it's not a big deal here) and personally I find it harder to read (though that's subjective haha).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg - removed the try / except block and simplified
trulens_eval/trulens_eval/feedback/benchmark_frameworks/meta_evaluation_examples.ipynb
Outdated
Show resolved
Hide resolved
trulens_eval/trulens_eval/feedback/benchmark_frameworks/meta_evaluation_examples.ipynb
Outdated
Show resolved
Hide resolved
trulens_eval/trulens_eval/feedback/benchmark_frameworks/meta_evaluation_examples.ipynb
Outdated
Show resolved
Hide resolved
" app_id=\"MAE\",\n", | ||
" ground_truth=golden_set,\n", | ||
" trace_to_score_fn=context_relevance_ff_to_score,\n", | ||
" agg_funcs=[mae_agg_func],\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by the SDK, like we know
mae_agg_func = GroundTruthAggregator(true_labels=true_labels).mae
but why do we have to say the true labels in mae_agg_func
and in the arg ground_truth
to tru.BenchmarkExperiment
? It seems kinda redundent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed redundant argument in this commit
) | ||
"""Aggregate benchmarking metrics for ground-truth-based evaluation on feedback fuctions.""" | ||
|
||
true_labels: List[int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you align this arg name to the one in the GroundTruth class? I'm okay with either (ground_truth or true_labels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed the redundant ground_truth
attribute in TruBenchmarkExperiment. This commit handles it.
But to clarify, ground_truth
has different meaning than true_labels
in our current implementation.
ground_truth
is eq to the golden set or ground truth data collection, and true_labels
are just the labels or often time one of the columns in the GT dataset table (we don't have the schema defined yet but soon we will).
output_space: Optional[str] = None | ||
# TODO: support more parameters | ||
# "use_verb_confidence": False, | ||
# K should not be part of benchmark params b/c each set of benchmark params could have multiple set of K values for different metric aggregators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is our line length limit? This seems very high. Should this be changed in our linter?
benchmark_params_dict: dict = self.benchmark_params.model_dump() | ||
ret = feedback_fn(row["query"], row["response"], benchmark_params_dict) | ||
|
||
# TODO: better define the shape of arguments of feedback_fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO stil valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes still valid - this is b/c the line above feedback_fn(row["query"], row["response"], benchmark_params_dict)
only handles feedback functions that take 2 strings arguments and should be made more flexible. Will come back to this in the work of GT dataset schema
from trulens_eval.feedback.benchmark_frameworks.tru_benchmark_experiment import ( | ||
TruBenchmarkExperiment, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a common python practice? To import inside a function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how common it is but I've seen this in other code bases. This is done so here to avoid circular dependency, and I believe it's the same reason for several other occurrences in TruLens.
jira: https://snowflakecomputing.atlassian.net/browse/SNOW-1531256
Changes made in this PR:
eval criteria
andoutput space
- there'll be followup PRs to further refactor / continue the work on V2 feedback directory and add few-shot examples as param to the promptnote this was using the old
eval_as_recommendation
script and I'll have a separate PR to run more experiments using the new TruBenchmarkExperiment constructNear term future work:
min_score_val
,max_score_val
)context_relevance_benchmark_calibration.ipynb
using the new trubenchmark app to test the correctness of various metrics.Type of change