Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception thrown by coref processor #1397

Open
tseanard opened this issue Jun 21, 2024 · 1 comment
Open

Exception thrown by coref processor #1397

tseanard opened this issue Jun 21, 2024 · 1 comment

Comments

@tseanard
Copy link

tseanard commented Jun 21, 2024

end_word = word_pos[span[1]]

In certain cases, the line linked above throws an error and crashes out the coreference processer. Since the exception is unhandled, no document object is returned by the method. It took me a while to find the root of this issue, and I'm not sure of all of the inner workings of stanza so I don't know that I can create a robust fix that doesn't create issues somewhere else.

I discovered the issue when I was doing a naïve character based split across a long section of text (130,000 characters) of just breaking it into chunks that were 2k to 10k in size. I understand that trying to pass blocks of text that are split within sentences and even sometimes in the middle of a word is not specifically a use case to coreference resolution should be able to handle, but being new to stanza it was not clear to me that this is what was causing the issue.

This is the for loop that crashes (specifically end_word = word_pos[span[1]])

            for span_idx, span in enumerate(span_cluster):
                sent_id = sent_ids[span[0]]
                sentence = sentences[sent_id]
                start_word = word_pos[span[0]]
                end_word = word_pos[span[1]]
                # very UD specific test for most number of proper nouns in a mention
                # will do nothing if POS is not active (they will all be None)
                num_propn = sum(word.pos == 'PROPN' for word in sentence.words[start_word:end_word])

                if ((span[1] - span[0] > max_len) or
                    span[1] - span[0] == max_len and num_propn > max_propn):
                    max_len = span[1] - span[0]
                    best_span = span_idx
                    max_propn = num_propn

Condition to reproduce issue:

  • Provide a block of text that ends with a word that is part of a coference span
  • Missing punctuation at the end of the text

Workaround:

  • Add any punctuation to the end of the text if the error is thrown
  • I tested period, comma, exclamation, question mark, comma, and colon and those all worked
  • newline \n and adding an extra space did not help as workaround attemps.

Use case where this is relevant:

  • I am processing massive amounts of text that was collected using OCR, so there are sometimes cases where punctuation gets missed or misread by the OCR.

Example text that causes issue: "Sometimes people are part of the problem, and sometimes they are the solution to it"
Update to text that resolves the issue: "Sometimes people are part of the problem, and sometimes they are the solution to it."

@AngledLuffa
Copy link
Collaborator

Can reproduce. Thank you for calling this to our attention.

AngledLuffa added a commit that referenced this issue Jun 21, 2024
Jemoka pushed a commit that referenced this issue Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants