initial creation of ManipulatePreds class + tests #184

franknovak · 2024-01-22T21:51:19Z

No description provided.

Scott771 · 2024-01-31T15:59:38Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+
+class ManipulatePreds:
+    """
+    Class to help expand prediction text and indexing via OCR data.


can you add an example usage here?

Scott771 · 2024-01-31T16:11:11Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+
+    def expand_predictions(self, pred_start: int, pred_end: int) -> dict:
+        """
+        Expand predictions and boundaries to match that of OCR data.


provide an example here of what you mean 100 -> $100.00 if $100.00 is the full token

Scott771 · 2024-01-31T16:11:42Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+
+        # Find index value of bounded prediction
+        for index, pred in enumerate(self.predictions):
+            if pred["start"] == pred_start or pred["end"] == pred_end:


I wonder here whether you should be looking for any overlap rather than an exact match on start/end-- I believe there is already a helper function for this somewhere

Scott771 · 2024-01-31T16:14:03Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+        self.ocr_tokens = ocr_tokens
+        self.predictions = preds
+
+    def expand_predictions(self, pred_start: int, pred_end: int) -> dict:


I'm not sure this is how I'd imagine using this function-- wouldn't you have a specific prediction in mind where you want this to run? So the function could be static and take list of ocr_tokens and pred and then expands then returns the pred (expanded if that's relevant-- updating the start/end indexes as well as the text value)? Let me know what your rationale was for pred start/end instead I might be missing something

"alks Sco"-- > ["talks", "Scott"] - > "talks Scott"

def expand_pred(some_pred: dict, tokens: List[dict]) -> expanded_pred (dict)

-> dict, bool (bool = True if updated)

Scott771 · 2024-01-31T16:15:18Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+        pred_index = None
+
+        # Find index value of bounded prediction
+        for index, pred in enumerate(self.predictions):


I think you could eliminate this? I'm not sure how you'd have pred start/end in the first place without already knowing the pred you want to operate on

Scott771 · 2024-01-31T16:18:51Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+            return self.predictions[pred_index]
+
+        # Use overlapping boundaries and expand text / boundaries to match OCR data
+        for token in self.ocr_tokens:


you might be able to reuse some token matching functionality from other classes-- and then expand from there

e.g. see if you can reuse this class (or if it needs some small tweaks, could include that?) https://github.com/IndicoDataSolutions/Indico-Solutions-Toolkit/blob/main/indico_toolkit/association/extracted_tokens.py

Scott771 · 2024-01-31T16:22:48Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+
+        expanded_text = self._get_ocr_text(expanded_start, expanded_end)
+        if expanded_text != ocr_text_initial:
+            raise ValueError("Expanded text does not match the OCR text.")


how could this be the case?

I would feel like if you match, then you're good

Scott771 · 2024-01-31T16:27:15Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+
+        if expanded_text == ocr_text_initial:
+            # Update prediction
+            self.predictions[pred_index]["start"] = expanded_start


if we instantiate this class with all of the preds / tokens, then maybe this method woul dmake more sense operating against a particular label? (i.e. expand for all "Insured Name")-- to me, probably makes more sense to have this be a static method as described above-

Scott771 · 2024-01-31T16:28:14Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+        if expanded_text != ocr_text_initial:
+            raise ValueError("Expanded text does not match the OCR text.")
+
+        if expanded_text == ocr_text_initial:


don't need this check given condition above (I would always assume that these have to be equal given that you set them)

Scott771 · 2024-01-31T16:29:36Z

indico_toolkit/indico_wrapper/pred_manipulation.py

+        return self.predictions[pred_index]
+
+    def is_token_nearby(
+        self, ocr_start: int, ocr_end: int, search_tokens: List[str], distance: int


I'm not sure I understand what distance means here or when this function would be useful (also don't like the idea of entering in the ocr_start ocr_end (how would you know what those values should be?)

I think the more relevant functionality would be "is SOME_TEXT contained within a token within X distance"

initial creation of ManipulatePreds class + tests

f536af5

franknovak requested review from mawelborn, nickesparza and Scott771 January 22, 2024 21:51

Scott771 reviewed Jan 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial creation of ManipulatePreds class + tests #184

initial creation of ManipulatePreds class + tests #184

franknovak commented Jan 22, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Feb 14, 2024

Scott771 Feb 14, 2024

Scott771 Feb 14, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

Scott771 Jan 31, 2024

initial creation of ManipulatePreds class + tests #184

Are you sure you want to change the base?

initial creation of ManipulatePreds class + tests #184

Conversation

franknovak commented Jan 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment