add paragraph_threshold
into paragraph_tokenize
function
#806
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adding
paragraph_threshold
argument, According to the original paper 'Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation,' we have the option to adjust the paragraph threshold using theparagraph_threshold
argument. This threshold corresponds to thealpha
value mentioned in the paper's method section. By default, the paragraph threshold is set to 0.5Here is a usage:
when
paragraph_threshold=0.5
when the
paragraph_threshold = 0.8
-> more conservative segmentationwhen the
paragraph_threshold = 0.05
-> less conservative segmentation