-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable regex to extract floats in score generation #1223
Conversation
092ea37
to
6d55f0a
Compare
|
||
vals = set() | ||
for match in matches: | ||
try: | ||
vals.add(validate_rating(int(match))) | ||
vals.add( | ||
validate_rating(int(float(match))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinda late to the game, but:
- Why are we constrained to ints?
- If we are constrained to ints, shouldn't we round it instead of flooring it?
- The doc on L54 says "If the string does not match an integer ... raises an error ...", but this won't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- No particular reason beyond easier intrepretability for the end users AFAIK.
- Make sense - I can make a change to this
- Same as 1. - I do recall we went back and forth a bit on this and the doc just became outdated b/c of my change. Will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1244 fix PR
* match floats and integers for score generation * bb * updated expected test cases * minor fix
* match floats and integers for score generation * bb * updated expected test cases * minor fix
* match floats and integers for score generation * bb * updated expected test cases * minor fix
Items to add to release announcement:
During benchmarking of various feedback providers, I notice there are models (i.e. a finetuned
mixtral-8x7b
that tend to give10.0
instead of10
in their feedback scoring before normalization. In the current implementation,PATTERN_INTEGER
will extract0
and10
from10.0
and eventually pick the lesser value.Failing example when testing groundedness feedback functions - where the score accompanying COT was interpreted as
0
, instead of the expected10
:I'm switching to
PATTERN_NUMBER
to unblock for now.Other details that are good to know but need not be announced:
PATTERN_INTEGER
was used overPATTERN_NUMBER
in the previous PR. cc @sfc-gh-pmardziel to add more background if I'm missing sth obvious.I do believe we should move toward structured and systematic feedback score generation mechanisms with some self-refining prompt iterations (i.e. via DSPy) ASAP for more robust score generation, ideally before integrating w/ the monitoring stack, even at the cost of slightly higher token usage/cost/latency (which can also be alleviated via better prompts and instruction tuning).