Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove redundant check of 128-bit bitmap of zeros in regex #72312

Closed
scrappyCoco opened this issue Jul 16, 2022 · 6 comments · Fixed by #72317
Closed

Remove redundant check of 128-bit bitmap of zeros in regex #72312

scrappyCoco opened this issue Jul 16, 2022 · 6 comments · Fixed by #72317
Assignees

Comments

@scrappyCoco
Copy link

I'm very glad for RegexGeneratorAttribute.

For an example, I have to find all consonant in russian language. I write:

[RegexGenerator(@"[а-я-[аeиоуыэюя]]", RegexOptions.None, -1)]
private static partial Regex ConsonantRegex();

This Regex emits the following generated code:

/// ...
/// I display only interesting part. 

/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    char ch;
    
    // Empty matches aren't possible.
    if ((uint)pos < (uint)inputSpan.Length)
    {
        // The pattern begins with a character in the set [\u0430-\u044F-[e\u0430\u0438\u043E\u0443\u044B\u044D-\u044F]].
        // Find the next occurrence. If it can't be found, there's no match.
        ReadOnlySpan<char> span = inputSpan.Slice(pos);
        for (int i = 0; i < span.Length; i++)
        {
            if (((ch = span[i]) < 128 ? ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
            {
                base.runtextpos = pos + i;
                return true;
            }
        }
    }
}

There we can see the condition that will be always false: ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0

Is it possible to replace this line with: if (((ch = span[i]) >= 128 && RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))?

I've tried to compare generated regex with suggested change:

public class DotnetRegexBenchmark
{
    // This is the pure copy from generated regex. 
    private readonly Regex _originalRegex = OriginalRegex.Instance;
    
    // This is the copy from generated regex with suggested change. 
    private readonly Regex _suggestedRegex = SuggestedRegex.Instance;
    
    // Text with Latin symbols.
    private readonly string? _text;

    public DotnetRegexBenchmark()
    {
        _text = new string(Enumerable
            .Range(1, 5000)
            .Select(i => (char)('a' + i % ('z' - 'a')))
            .ToArray());
    }

    [Benchmark(Baseline = true)]
    public int Original() => CountMatches(_originalRegex);

    [Benchmark]
    public int Suggested() => CountMatches(_suggestedRegex);
    
    private int CountMatches(Regex regex)
    {
        int matchesCount = 0;
        Match match = regex.Match(_text!);
        while (match.Success)
        {
            ++matchesCount;
            match = match.NextMatch();
        }
        return matchesCount;
    }
}

With configuration:

BenchmarkDotNet=v0.13.1.20220709-develop, OS=Windows 10 (10.0.19044.1766/21H2/November2021Update)  
Intel Xeon CPU E3-1225 V2 3.20GHz, 1 CPU, 4 logical and 4 physical cores  
.NET SDK=7.0.100-preview.5.22307.18  
[Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT  
DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT

I've got:

Method Mean Error StdDev Ratio
Original 5.711 us 0.0049 us 0.0038 us 1.00
Suggested 2.879 us 0.0034 us 0.0029 us 0.50

As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.

Thanks!

@scrappyCoco scrappyCoco added the tenet-performance Performance related issue label Jul 16, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 16, 2022
@ghost
Copy link

ghost commented Jul 16, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

I'm very glad for RegexGeneratorAttribute.

For an example, I have to find all consonant in russian language. I write:

[RegexGenerator(@"[а-я-[аeиоуыэюя]]", RegexOptions.None, -1)]
private static partial Regex ConsonantRegex();

This Regex emits the following generated code:

/// ...
/// I display only interesting part. 

/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    char ch;
    
    // Empty matches aren't possible.
    if ((uint)pos < (uint)inputSpan.Length)
    {
        // The pattern begins with a character in the set [\u0430-\u044F-[e\u0430\u0438\u043E\u0443\u044B\u044D-\u044F]].
        // Find the next occurrence. If it can't be found, there's no match.
        ReadOnlySpan<char> span = inputSpan.Slice(pos);
        for (int i = 0; i < span.Length; i++)
        {
            if (((ch = span[i]) < 128 ? ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
            {
                base.runtextpos = pos + i;
                return true;
            }
        }
    }
}

There we can see the condition that will be always false: ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0

Is it possible to replace this line with: if (((ch = span[i]) >= 128 && RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))?

I've tried to compare generated regex with suggested change:

public class DotnetRegexBenchmark
{
    // This is the pure copy from generated regex. 
    private readonly Regex _originalRegex = OriginalRegex.Instance;
    
    // This is the copy from generated regex with suggested change. 
    private readonly Regex _suggestedRegex = SuggestedRegex.Instance;
    
    // Text with Latin symbols.
    private readonly string? _text;

    public DotnetRegexBenchmark()
    {
        _text = new string(Enumerable
            .Range(1, 5000)
            .Select(i => (char)('a' + i % ('z' - 'a')))
            .ToArray());
    }

    [Benchmark(Baseline = true)]
    public int Original() => CountMatches(_originalRegex);

    [Benchmark]
    public int Suggested() => CountMatches(_suggestedRegex);
    
    private int CountMatches(Regex regex)
    {
        int matchesCount = 0;
        Match match = regex.Match(_text!);
        while (match.Success)
        {
            ++matchesCount;
            match = match.NextMatch();
        }
        return matchesCount;
    }
}

With configuration:

BenchmarkDotNet=v0.13.1.20220709-develop, OS=Windows 10 (10.0.19044.1766/21H2/November2021Update)  
Intel Xeon CPU E3-1225 V2 3.20GHz, 1 CPU, 4 logical and 4 physical cores  
.NET SDK=7.0.100-preview.5.22307.18  
[Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT  
DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT

I've got:

Method Mean Error StdDev Ratio
Original 5.711 us 0.0049 us 0.0038 us 1.00
Suggested 2.879 us 0.0034 us 0.0029 us 0.50

As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.

Thanks!

Author: scrappyCoco
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance, untriaged

Milestone: -

@teo-tsirpanis
Copy link
Contributor

I'm working on this.

@teo-tsirpanis teo-tsirpanis self-assigned this Jul 16, 2022
@stephentoub
Copy link
Member

stephentoub commented Jul 16, 2022

The intent was for that to be handled by

if (analysis.ContainsNoAscii)
{
// We determined that the character class contains only non-ASCII,
// for example if the class were [\u1000-\u2000\u3000-\u4000\u5000-\u6000].
// (In the future, we could possibly extend the rm.Analysis to produce a known
// lower-bound and compare against that rather than always using 128 as the
// pivot point.)
return negate ?
$"((ch = {chExpr}) < 128 || !RegexRunner.CharInClass((char)ch, {Literal(charClass)}))" :
$"((ch = {chExpr}) >= 128 && RegexRunner.CharInClass((char)ch, {Literal(charClass)}))";
}

We likely just need to update that flag based on whether
for (int i = 0; i < 128; i++)
{
char c = (char)i;
if (RegexCharClass.CharInClass(c, charClass))
{
dest[i >> 4] |= (char)(1 << (i & 0xF));
}
sees any bits set, which will mean moving that up a bit. Alternatively, the analysis could be improved to recognize that subtraction in char classes can't ever increase what's accepted, only narrow.
internal static CharClassAnalysisResults Analyze(string set)

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 16, 2022
@stephentoub
Copy link
Member

it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.

@scrappyCoco, is this a common situation for you, and if so, can you tell me more about them? If a pattern is known to contain / start with only non-ASCII and input is expected to be mostly ASCII, there are things we could do to make it way faster, but they'd make it more expensive to process inputs that were mostly non-ASCII.

@ghost ghost removed untriaged New issue has not been triaged by the area owner in-pr There is an active PR which will close this issue when it is merged labels Jul 17, 2022
@scrappyCoco
Copy link
Author

@teo-tsirpanis, thanks for the super-fast reaction.
@stephentoub, this regex has been written only for an academic purpose. This is not from a real life. In this experiment, I wanted to compare "Character class subtraction" with a regular sequence of "Positive character group". I want to write for me the best practices for regexes with non-ASCII alphabets. For an example, '[0-9]' is better than '\d' for Russian text. If the compared character was Cyrillic, it would not match ASCII-range 0x30 - 0x39, then it would get Unicode Category, that was redundant (because we expected only ‘[0-9]’). And other similar experiments intended for non-ASCII texts.

@stephentoub
Copy link
Member

Thanks.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants