Support destructive splitting #10

waldyrious · 2023-11-05T23:12:46Z

It's understood that the library performs non-destructive splitting by default, but would it be possible to add an option to allow "destructive" splitting? In other words, trimming of whitespace around each sentence, and removal of whitespace-only splitted parts.

santhoshtr · 2023-11-06T04:59:51Z

I think I understood what you meant, but could you please provide an example to confirm?

There is a plan to provide an advanced API other than segment- may be get_boundaries which returns more information about the nature of splitting and downstream applications can augment the behavior. See wikimedia/sentencex-js#1

This means, the default API can be opinionated, but the new API can defer such decisions to a caller. The exact API is not finalized. Still collecting requirements for that API.

waldyrious · 2023-11-11T23:45:11Z

I think I understood what you meant, but could you please provide an example to confirm?

Yes, sure. Here's what I currently get with a file based on the README example in sentencex-js and using the content from this comment as a test:

import segment from 'sentencex'

const text = `This should? Produce a violation.

So should! This example.

Abbreviations (e.g. like) these (i.e. should) not.

"Sentences in." Quotes should, too.

Pausing... for... thought... should not?

This rule does weird things if a sentence is
already wrapped. It should maybe unwrap in
cases like this?`;

console.log(segment("en", text))

I saved this as sentencex-test.mjs and ran npm install --save sentencex followed by node sentencex-test.mjs. The result was this:

[
  'This should?',
  'Produce a violation.',
  '\n\n',
  'So should!',
  'This example.',
  '\n\n',
  'Abbreviations (e.g. like) these (i.e. should) not.',
  '\n\n',
  '"Sentences in." Quotes should, too.',
  '\n\n',
  'Pausing...',
  'for...',
  'thought...',
  'should not?',
  '\n\n',
  'This rule does weird things if a sentence is\nalready wrapped.',
  'It should maybe unwrap in\ncases like this?'
]

I would have expected the whitespace-only entries to not be present:

[
  'This should?',
  'Produce a violation.',
  'So should!',
  'This example.',
  'Abbreviations (e.g. like) these (i.e. should) not.',
  '"Sentences in." Quotes should, too.',
  'Pausing...',
  'for...',
  'thought...',
  'should not?',
  'This rule does weird things if a sentence is\nalready wrapped.',
  'It should maybe unwrap in\ncases like this?'
]

...but I understand that this is by design, hence the suggestion of an option.

That said, in further tests I noticed that, contrary to my assumption above, whitespace around sentences is not preserved (internal whitespace is preserved). Shouldn't that also be kept for full "reconstructability" of the original text?

p.s. - I realize I'm using sentencex-js above even though I am commenting in the Python-based sentencex repo, but I assumed the two projects to produce identical results, and this one to be the canonical one for decisions about the API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support destructive splitting #10

Support destructive splitting #10

waldyrious commented Nov 5, 2023

santhoshtr commented Nov 6, 2023

waldyrious commented Nov 11, 2023

Support destructive splitting #10

Support destructive splitting #10

Comments

waldyrious commented Nov 5, 2023

santhoshtr commented Nov 6, 2023

waldyrious commented Nov 11, 2023