-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add to_list_by_hyphen_space processor #872
Conversation
Signed-off-by: Kazuki Matsumaru <marukaz.jh@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about it but i think its easier to have all the logic defined in the prepare file.
add_to_catalog( | ||
SequentialOperator( | ||
steps=[ | ||
ToListByHyphenSpace(field="prediction", process_every_value=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToListByHyphenSpace(field="prediction", process_every_value=False), | |
SplitStrip(delimiter="- ", strip_every_element=True, process_every_value=False), |
I think it it is cleaner to have it like this rather then define a new class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! Yes, I agree, and in fact, when I personally used this processor, I implemented it in the same file. However, since the implementation of to_list_by_comma defined a class and separated the files, I followed that approach. I don't have a strong preference, so we could either consolidate this processor into one file, refactor to_list_by_comma, or keep it as is following the current implementation of to_list_by_comma. Which option do you prefer?
Also, wouldn't be safer to split by "\n-"? What do you think @marukaz? |
@elronbandel I think the leading hyphen will remain when splitting by '\n-' |
So just to be on the safe side how about: from unitxt.string_operators import RegexSplit
RegexSplit(by="(?<=^|\n)- ") Which will split by either |
That sounds better! I have updated the code accordingly. Also, I followed your comment and removed the class, consolidating everything into the same file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
close #871
I ran
python -m unittest tests/library/test_postprocessors.py
it passed.