Add to_list_by_hyphen_space processor #872

marukaz · 2024-05-30T06:42:36Z

close #871

I ran python -m unittest tests/library/test_postprocessors.py it passed.

Signed-off-by: Kazuki Matsumaru <marukaz.jh@gmail.com>

elronbandel

Not sure about it but i think its easier to have all the logic defined in the prepare file.

elronbandel · 2024-06-03T09:26:50Z

prepare/processors/to_list_by_hyphen.py

+add_to_catalog(
+    SequentialOperator(
+        steps=[
+            ToListByHyphenSpace(field="prediction", process_every_value=False),


Suggested change

ToListByHyphenSpace(field="prediction", process_every_value=False),

SplitStrip(delimiter="- ", strip_every_element=True, process_every_value=False),

I think it it is cleaner to have it like this rather then define a new class.

Thank you for the review! Yes, I agree, and in fact, when I personally used this processor, I implemented it in the same file. However, since the implementation of to_list_by_comma defined a class and separated the files, I followed that approach. I don't have a strong preference, so we could either consolidate this processor into one file, refactor to_list_by_comma, or keep it as is following the current implementation of to_list_by_comma. Which option do you prefer?

elronbandel · 2024-06-04T07:12:43Z

Also, wouldn't be safer to split by "\n-"? What do you think @marukaz?

marukaz · 2024-06-04T07:32:22Z

@elronbandel I think the leading hyphen will remain when splitting by '\n-'

elronbandel · 2024-06-05T07:03:34Z

So just to be on the safe side how about:

from unitxt.string_operators import RegexSplit

RegexSplit(by="(?<=^|\n)- ")

Which will split by either "- " if it is at the beginning of the string or if its after "\n"
@marukaz if you do want to go there we can close merge as it is.

marukaz · 2024-06-05T09:37:37Z

That sounds better! I have updated the code accordingly. Also, I followed your comment and removed the class, consolidating everything into the same file.

elronbandel

Neat!

Add to_list_by_hyphen_space processor

ddd87ff

Signed-off-by: Kazuki Matsumaru <marukaz.jh@gmail.com>

elronbandel reviewed Jun 3, 2024

View reviewed changes

Use RegexSplit to make it robust

6762b35

marukaz requested a review from elronbandel June 5, 2024 09:37

elronbandel approved these changes Jun 5, 2024

View reviewed changes

elronbandel enabled auto-merge (squash) June 5, 2024 18:50

Merge branch 'main' into main

57f19f3

elronbandel merged commit b3b0187 into IBM:main Jun 5, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add to_list_by_hyphen_space processor #872

Add to_list_by_hyphen_space processor #872

marukaz commented May 30, 2024

elronbandel left a comment

elronbandel Jun 3, 2024

marukaz Jun 3, 2024

elronbandel commented Jun 4, 2024

marukaz commented Jun 4, 2024

elronbandel commented Jun 5, 2024 •

edited

Loading

marukaz commented Jun 5, 2024 •

edited

Loading

elronbandel left a comment

	ToListByHyphenSpace(field="prediction", process_every_value=False),
	SplitStrip(delimiter="- ", strip_every_element=True, process_every_value=False),

Add to_list_by_hyphen_space processor #872

Add to_list_by_hyphen_space processor #872

Conversation

marukaz commented May 30, 2024

elronbandel left a comment

Choose a reason for hiding this comment

elronbandel Jun 3, 2024

Choose a reason for hiding this comment

marukaz Jun 3, 2024

Choose a reason for hiding this comment

elronbandel commented Jun 4, 2024

marukaz commented Jun 4, 2024

elronbandel commented Jun 5, 2024 • edited Loading

marukaz commented Jun 5, 2024 • edited Loading

elronbandel left a comment

Choose a reason for hiding this comment

elronbandel commented Jun 5, 2024 •

edited

Loading

marukaz commented Jun 5, 2024 •

edited

Loading