Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add to_list_by_hyphen_space processor #872

Merged
merged 3 commits into from
Jun 5, 2024
Merged

Add to_list_by_hyphen_space processor #872

merged 3 commits into from
Jun 5, 2024

Conversation

marukaz
Copy link
Contributor

@marukaz marukaz commented May 30, 2024

close #871

I ran python -m unittest tests/library/test_postprocessors.py it passed.

Signed-off-by: Kazuki Matsumaru <marukaz.jh@gmail.com>
Copy link
Member

@elronbandel elronbandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about it but i think its easier to have all the logic defined in the prepare file.

add_to_catalog(
SequentialOperator(
steps=[
ToListByHyphenSpace(field="prediction", process_every_value=False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ToListByHyphenSpace(field="prediction", process_every_value=False),
SplitStrip(delimiter="- ", strip_every_element=True, process_every_value=False),

I think it it is cleaner to have it like this rather then define a new class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! Yes, I agree, and in fact, when I personally used this processor, I implemented it in the same file. However, since the implementation of to_list_by_comma defined a class and separated the files, I followed that approach. I don't have a strong preference, so we could either consolidate this processor into one file, refactor to_list_by_comma, or keep it as is following the current implementation of to_list_by_comma. Which option do you prefer?

@elronbandel
Copy link
Member

Also, wouldn't be safer to split by "\n-"? What do you think @marukaz?

@marukaz
Copy link
Contributor Author

marukaz commented Jun 4, 2024

@elronbandel I think the leading hyphen will remain when splitting by '\n-'

@elronbandel
Copy link
Member

elronbandel commented Jun 5, 2024

So just to be on the safe side how about:

from unitxt.string_operators import RegexSplit

RegexSplit(by="(?<=^|\n)- ")

Which will split by either "- " if it is at the beginning of the string or if its after "\n"
@marukaz if you do want to go there we can close merge as it is.

@marukaz
Copy link
Contributor Author

marukaz commented Jun 5, 2024

That sounds better! I have updated the code accordingly. Also, I followed your comment and removed the class, consolidating everything into the same file.

@marukaz marukaz requested a review from elronbandel June 5, 2024 09:37
Copy link
Member

@elronbandel elronbandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

@elronbandel elronbandel enabled auto-merge (squash) June 5, 2024 18:50
@elronbandel elronbandel merged commit b3b0187 into IBM:main Jun 5, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert outputs to list by hyphen with a space ("- ")
2 participants