Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial creation of ManipulatePreds class + tests #184

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

franknovak
Copy link
Contributor

No description provided.


class ManipulatePreds:
"""
Class to help expand prediction text and indexing via OCR data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example usage here?


def expand_predictions(self, pred_start: int, pred_end: int) -> dict:
"""
Expand predictions and boundaries to match that of OCR data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide an example here of what you mean 100 -> $100.00 if $100.00 is the full token


# Find index value of bounded prediction
for index, pred in enumerate(self.predictions):
if pred["start"] == pred_start or pred["end"] == pred_end:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder here whether you should be looking for any overlap rather than an exact match on start/end-- I believe there is already a helper function for this somewhere

self.ocr_tokens = ocr_tokens
self.predictions = preds

def expand_predictions(self, pred_start: int, pred_end: int) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is how I'd imagine using this function-- wouldn't you have a specific prediction in mind where you want this to run? So the function could be static and take list of ocr_tokens and pred and then expands then returns the pred (expanded if that's relevant-- updating the start/end indexes as well as the text value)? Let me know what your rationale was for pred start/end instead I might be missing something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"alks Sco"-- > ["talks", "Scott"] - > "talks Scott"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def expand_pred(some_pred: dict, tokens: List[dict]) -> expanded_pred (dict)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> dict, bool (bool = True if updated)

pred_index = None

# Find index value of bounded prediction
for index, pred in enumerate(self.predictions):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could eliminate this? I'm not sure how you'd have pred start/end in the first place without already knowing the pred you want to operate on

return self.predictions[pred_index]

# Use overlapping boundaries and expand text / boundaries to match OCR data
for token in self.ocr_tokens:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to reuse some token matching functionality from other classes-- and then expand from there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. see if you can reuse this class (or if it needs some small tweaks, could include that?) https://github.com/IndicoDataSolutions/Indico-Solutions-Toolkit/blob/main/indico_toolkit/association/extracted_tokens.py


expanded_text = self._get_ocr_text(expanded_start, expanded_end)
if expanded_text != ocr_text_initial:
raise ValueError("Expanded text does not match the OCR text.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could this be the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would feel like if you match, then you're good


if expanded_text == ocr_text_initial:
# Update prediction
self.predictions[pred_index]["start"] = expanded_start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we instantiate this class with all of the preds / tokens, then maybe this method woul dmake more sense operating against a particular label? (i.e. expand for all "Insured Name")-- to me, probably makes more sense to have this be a static method as described above-

if expanded_text != ocr_text_initial:
raise ValueError("Expanded text does not match the OCR text.")

if expanded_text == ocr_text_initial:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this check given condition above (I would always assume that these have to be equal given that you set them)

return self.predictions[pred_index]

def is_token_nearby(
self, ocr_start: int, ocr_end: int, search_tokens: List[str], distance: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what distance means here or when this function would be useful (also don't like the idea of entering in the ocr_start ocr_end (how would you know what those values should be?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more relevant functionality would be "is SOME_TEXT contained within a token within X distance"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants