Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop for certain text #11

Open
Telavian opened this issue Apr 26, 2022 · 4 comments
Open

Infinite loop for certain text #11

Telavian opened this issue Apr 26, 2022 · 4 comments

Comments

@Telavian
Copy link

Telavian commented Apr 26, 2022

Certain text causes the segmenter to enter into an infinite loop.

var text = "o idioma estao errados.000000000000000000000000000000000000000000000000000000000000000000000000";
Segmenter.Segment(text);
@Telavian
Copy link
Author

I am not sure what this is supposed to do however the problem seems to be in
ReferenceSeparator.SeparateReferences

The regex ReferenceRegex is used against the text and it seems the 2 are a fatal combination.

@Telavian
Copy link
Author

Telavian commented Apr 26, 2022

It seems the combination of character, '.', number does not process well with the regex ReferenceRegex.

When testing the regex at https://regexr.com if I type c.#### then as I continue to type any numbers then the execution time gets slower and slower until it eventually timesout at 250ms. Therefore for very large numbers I would expect the execution time is exponentially long.

I am not sure how to test the original ruby version however it seems like since it uses the exact same regex then it likely has the same issues.
https://github.com/diasks2/pragmatic_segmenter/blob/1ade491c81f9d1d7fb3abd4c1e2e266fa5b34c42/lib/pragmatic_segmenter/languages/common/numbers.rb#L50

@Telavian
Copy link
Author

Telavian commented Apr 26, 2022

I am not sure if this is a good solution or even "correct" in general however it does solve my problem.

private static readonly Regex _numericSeparator = new Regex(@"(.\.\d)", RegexOptions.Compiled);
private string PreprocessText(string text)
{
    var matches = _numericSeparator.Matches(text);

    var groups = matches
        .AsEnumerable()
        .SelectMany(x => x.Groups.Values)
        .Select(x => x.Value)
        .Distinct();

    foreach (var group in groups)
    {
        var replacement = group
            .Replace(".", ". ");

        text = text.Replace(group, replacement);
    }

    return text;
}

@AndrewLamWARC
Copy link

The suggested fix may not work in the general case. For example
"0.5 ml of milk" will be pre-processed to "0. 5 ml of milk".
Further segmentation may separate the "0." from "5 ml of milk"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants