Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support destructive splitting #10

Open
waldyrious opened this issue Nov 5, 2023 · 2 comments
Open

Support destructive splitting #10

waldyrious opened this issue Nov 5, 2023 · 2 comments

Comments

@waldyrious
Copy link
Contributor

It's understood that the library performs non-destructive splitting by default, but would it be possible to add an option to allow "destructive" splitting? In other words, trimming of whitespace around each sentence, and removal of whitespace-only splitted parts.

@santhoshtr
Copy link
Member

I think I understood what you meant, but could you please provide an example to confirm?

There is a plan to provide an advanced API other than segment- may be get_boundaries which returns more information about the nature of splitting and downstream applications can augment the behavior. See wikimedia/sentencex-js#1

This means, the default API can be opinionated, but the new API can defer such decisions to a caller. The exact API is not finalized. Still collecting requirements for that API.

@waldyrious
Copy link
Contributor Author

I think I understood what you meant, but could you please provide an example to confirm?

Yes, sure. Here's what I currently get with a file based on the README example in sentencex-js and using the content from this comment as a test:

import segment from 'sentencex'

const text = `This should? Produce a violation.

So should! This example.

Abbreviations (e.g. like) these (i.e. should) not.

"Sentences in." Quotes should, too.

Pausing... for... thought... should not?

This rule does weird things if a sentence is
already wrapped. It should maybe unwrap in
cases like this?`;

console.log(segment("en", text))

I saved this as sentencex-test.mjs and ran npm install --save sentencex followed by node sentencex-test.mjs. The result was this:

[
  'This should?',
  'Produce a violation.',
  '\n\n',
  'So should!',
  'This example.',
  '\n\n',
  'Abbreviations (e.g. like) these (i.e. should) not.',
  '\n\n',
  '"Sentences in." Quotes should, too.',
  '\n\n',
  'Pausing...',
  'for...',
  'thought...',
  'should not?',
  '\n\n',
  'This rule does weird things if a sentence is\nalready wrapped.',
  'It should maybe unwrap in\ncases like this?'
]

I would have expected the whitespace-only entries to not be present:

[
  'This should?',
  'Produce a violation.',
  'So should!',
  'This example.',
  'Abbreviations (e.g. like) these (i.e. should) not.',
  '"Sentences in." Quotes should, too.',
  'Pausing...',
  'for...',
  'thought...',
  'should not?',
  'This rule does weird things if a sentence is\nalready wrapped.',
  'It should maybe unwrap in\ncases like this?'
]

...but I understand that this is by design, hence the suggestion of an option.

That said, in further tests I noticed that, contrary to my assumption above, whitespace around sentences is not preserved (internal whitespace is preserved). Shouldn't that also be kept for full "reconstructability" of the original text?

p.s. - I realize I'm using sentencex-js above even though I am commenting in the Python-based sentencex repo, but I assumed the two projects to produce identical results, and this one to be the canonical one for decisions about the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants