Exception thrown by coref processor #1397

tseanard · 2024-06-21T16:05:35Z

stanza/stanza/pipeline/coref_processor.py

Line 127 in 6e442a6

end_word = word_pos[span[1]]

In certain cases, the line linked above throws an error and crashes out the coreference processer. Since the exception is unhandled, no document object is returned by the method. It took me a while to find the root of this issue, and I'm not sure of all of the inner workings of stanza so I don't know that I can create a robust fix that doesn't create issues somewhere else.

I discovered the issue when I was doing a naïve character based split across a long section of text (130,000 characters) of just breaking it into chunks that were 2k to 10k in size. I understand that trying to pass blocks of text that are split within sentences and even sometimes in the middle of a word is not specifically a use case to coreference resolution should be able to handle, but being new to stanza it was not clear to me that this is what was causing the issue.

This is the for loop that crashes (specifically end_word = word_pos[span[1]])

            for span_idx, span in enumerate(span_cluster):
                sent_id = sent_ids[span[0]]
                sentence = sentences[sent_id]
                start_word = word_pos[span[0]]
                end_word = word_pos[span[1]]
                # very UD specific test for most number of proper nouns in a mention
                # will do nothing if POS is not active (they will all be None)
                num_propn = sum(word.pos == 'PROPN' for word in sentence.words[start_word:end_word])

                if ((span[1] - span[0] > max_len) or
                    span[1] - span[0] == max_len and num_propn > max_propn):
                    max_len = span[1] - span[0]
                    best_span = span_idx
                    max_propn = num_propn

Condition to reproduce issue:

Provide a block of text that ends with a word that is part of a coference span
Missing punctuation at the end of the text

Workaround:

Add any punctuation to the end of the text if the error is thrown
I tested period, comma, exclamation, question mark, comma, and colon and those all worked
newline \n and adding an extra space did not help as workaround attemps.

Use case where this is relevant:

I am processing massive amounts of text that was collected using OCR, so there are sometimes cases where punctuation gets missed or misread by the OCR.

Example text that causes issue: "Sometimes people are part of the problem, and sometimes they are the solution to it"
Update to text that resolves the issue: "Sometimes people are part of the problem, and sometimes they are the solution to it."

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-06-21T23:25:12Z

Can reproduce. Thank you for calling this to our attention.

…ster ends at exactly the end of the document. #1397

AngledLuffa added a commit that referenced this issue Jun 21, 2024

Avoid going off the end of the document in the event that a coref clu…

035364b

…ster ends at exactly the end of the document. #1397

AngledLuffa added the fixed on dev label Jun 21, 2024

Jemoka pushed a commit that referenced this issue Jul 16, 2024

Avoid going off the end of the document in the event that a coref clu…

e2a93c2

…ster ends at exactly the end of the document. #1397

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception thrown by coref processor #1397

Exception thrown by coref processor #1397

tseanard commented Jun 21, 2024 •

edited

Loading

AngledLuffa commented Jun 21, 2024

Exception thrown by coref processor #1397

Exception thrown by coref processor #1397

Comments

tseanard commented Jun 21, 2024 • edited Loading

AngledLuffa commented Jun 21, 2024

tseanard commented Jun 21, 2024 •

edited

Loading