Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophes at start or end of word seem to mess up the segmenter #9

Open
Telavian opened this issue May 23, 2021 · 5 comments
Open
Labels
bug Something isn't working

Comments

@Telavian
Copy link

Given this text
When it first arrived, I thought it was huge, and was thinkin' 'bout returning it, even though it is the size they say it is, It just seemed really large in person. I kept it and started using it. It is very easy to use with the instruction manual in hand, and I don't need that anymore for the things I do. I've scanned, copied, enlarged and printed double sided. All verry intuitive now. Prints clean and clear, bought a two pack of extra capacity black ink cartridges from Epson, delivered they were only $37, which I thought was reasonable, and it doesn't even look that big anymore. I am likin' it more all the time, and real happy with my choice.

If I convert the last "likin'" to "likin" then it segments into 2 phrases.
If I convert the first "thinkin'" to "thinkin" it segments to 1 phrase.
If I convert the first "'bout" to "bout" then it segments to 7 phrases.

@EliotJones
Copy link
Member

Ooh, this is a tricky one 🙈 Since it assumes the apostrophes form a quoted section and the library doesn't do sentence boundary detection internally to the quoted sections. I've been mulling over rules to detect this situation but it's a hard one.

@EliotJones EliotJones added the bug Something isn't working label Jun 16, 2021
@Telavian
Copy link
Author

I am honestly not sure either.

' at end of word like thinkin' may be treated as normal character.
' in middle of word like I've may be treated as normal character.

However probably not realistic to tell the difference between a quoted section vs excessive ' usage.
An option may help.
AllowQuotes or something.

@EliotJones
Copy link
Member

The trouble is ' at the end of almost all words will be a closing single quote, which 'might signify the presence of a quoted pair' but without hardcoding every possible quoted word end it's highly likely to give false positives.

Currently the library doesn't do quote pair detection it just has a set of regexes for probably quoted pairs and naively ignores sentence breaks between pairs, I think. By distinguishing between open and close quotes we'd at least detect the first thinkin' as being unrelated. For reasons I need to drill into the library doesn't treat 'bout as an opening single quote as far as I can tell. So quote pair detection might work here, I'm just worried what it might break since text like this is very unusual.

@EliotJones
Copy link
Member

An alternative that just occured is if the initial character following the quote is lowercase, as in 'bout and the quoted pair cross several sentence boundaries (checked by running against the inner text), then the pair is invalid. I'll have more of a think about it.

@Telavian
Copy link
Author

If first seen quote does not have a preceding whitespace character then could assume it is not start of a quote. For end of quote could use the lowercase solution you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants