Remove redundant check of 128-bit bitmap of zeros in regex #72312

scrappyCoco · 2022-07-16T09:17:58Z

I'm very glad for RegexGeneratorAttribute.

For an example, I have to find all consonant in russian language. I write:

[RegexGenerator(@"[а-я-[аeиоуыэюя]]", RegexOptions.None, -1)]
private static partial Regex ConsonantRegex();

This Regex emits the following generated code:

/// ...
/// I display only interesting part. 

/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    char ch;
    
    // Empty matches aren't possible.
    if ((uint)pos < (uint)inputSpan.Length)
    {
        // The pattern begins with a character in the set [\u0430-\u044F-[e\u0430\u0438\u043E\u0443\u044B\u044D-\u044F]].
        // Find the next occurrence. If it can't be found, there's no match.
        ReadOnlySpan<char> span = inputSpan.Slice(pos);
        for (int i = 0; i < span.Length; i++)
        {
            if (((ch = span[i]) < 128 ? ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
            {
                base.runtextpos = pos + i;
                return true;
            }
        }
    }
}

There we can see the condition that will be always false: ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0

Is it possible to replace this line with: if (((ch = span[i]) >= 128 && RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))?

I've tried to compare generated regex with suggested change:

public class DotnetRegexBenchmark
{
    // This is the pure copy from generated regex. 
    private readonly Regex _originalRegex = OriginalRegex.Instance;
    
    // This is the copy from generated regex with suggested change. 
    private readonly Regex _suggestedRegex = SuggestedRegex.Instance;
    
    // Text with Latin symbols.
    private readonly string? _text;

    public DotnetRegexBenchmark()
    {
        _text = new string(Enumerable
            .Range(1, 5000)
            .Select(i => (char)('a' + i % ('z' - 'a')))
            .ToArray());
    }

    [Benchmark(Baseline = true)]
    public int Original() => CountMatches(_originalRegex);

    [Benchmark]
    public int Suggested() => CountMatches(_suggestedRegex);
    
    private int CountMatches(Regex regex)
    {
        int matchesCount = 0;
        Match match = regex.Match(_text!);
        while (match.Success)
        {
            ++matchesCount;
            match = match.NextMatch();
        }
        return matchesCount;
    }
}

With configuration:

BenchmarkDotNet=v0.13.1.20220709-develop, OS=Windows 10 (10.0.19044.1766/21H2/November2021Update)  
Intel Xeon CPU E3-1225 V2 3.20GHz, 1 CPU, 4 logical and 4 physical cores  
.NET SDK=7.0.100-preview.5.22307.18  
[Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT  
DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT

I've got:

Method	Mean	Error	StdDev	Ratio
Original	5.711 us	0.0049 us	0.0038 us	1.00
Suggested	2.879 us	0.0034 us	0.0029 us	0.50

As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.

Thanks!

The text was updated successfully, but these errors were encountered:

ghost · 2022-07-16T09:18:10Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

I'm very glad for RegexGeneratorAttribute.

For an example, I have to find all consonant in russian language. I write:

[RegexGenerator(@"[а-я-[аeиоуыэюя]]", RegexOptions.None, -1)]
private static partial Regex ConsonantRegex();

This Regex emits the following generated code:

/// ...
/// I display only interesting part. 

/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    char ch;
    
    // Empty matches aren't possible.
    if ((uint)pos < (uint)inputSpan.Length)
    {
        // The pattern begins with a character in the set [\u0430-\u044F-[e\u0430\u0438\u043E\u0443\u044B\u044D-\u044F]].
        // Find the next occurrence. If it can't be found, there's no match.
        ReadOnlySpan<char> span = inputSpan.Slice(pos);
        for (int i = 0; i < span.Length; i++)
        {
            if (((ch = span[i]) < 128 ? ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
            {
                base.runtextpos = pos + i;
                return true;
            }
        }
    }
}

There we can see the condition that will be always false: ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0

Is it possible to replace this line with: if (((ch = span[i]) >= 128 && RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))?

I've tried to compare generated regex with suggested change:

public class DotnetRegexBenchmark
{
    // This is the pure copy from generated regex. 
    private readonly Regex _originalRegex = OriginalRegex.Instance;
    
    // This is the copy from generated regex with suggested change. 
    private readonly Regex _suggestedRegex = SuggestedRegex.Instance;
    
    // Text with Latin symbols.
    private readonly string? _text;

    public DotnetRegexBenchmark()
    {
        _text = new string(Enumerable
            .Range(1, 5000)
            .Select(i => (char)('a' + i % ('z' - 'a')))
            .ToArray());
    }

    [Benchmark(Baseline = true)]
    public int Original() => CountMatches(_originalRegex);

    [Benchmark]
    public int Suggested() => CountMatches(_suggestedRegex);
    
    private int CountMatches(Regex regex)
    {
        int matchesCount = 0;
        Match match = regex.Match(_text!);
        while (match.Success)
        {
            ++matchesCount;
            match = match.NextMatch();
        }
        return matchesCount;
    }
}

With configuration:

BenchmarkDotNet=v0.13.1.20220709-develop, OS=Windows 10 (10.0.19044.1766/21H2/November2021Update)  
Intel Xeon CPU E3-1225 V2 3.20GHz, 1 CPU, 4 logical and 4 physical cores  
.NET SDK=7.0.100-preview.5.22307.18  
[Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT  
DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT

I've got:

Method	Mean	Error	StdDev	Ratio
Original	5.711 us	0.0049 us	0.0038 us	1.00
Suggested	2.879 us	0.0034 us	0.0029 us	0.50

As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.

Thanks!

Author:	scrappyCoco
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`, `untriaged`
Milestone:	-

teo-tsirpanis · 2022-07-16T09:56:31Z

I'm working on this.

stephentoub · 2022-07-16T11:39:26Z

The intent was for that to be handled by

runtime/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

Lines 4508 to 4518 in 50fa756

    
           if (analysis.ContainsNoAscii) 
        
           { 
        
               // We determined that the character class contains only non-ASCII, 
        
               // for example if the class were [\u1000-\u2000\u3000-\u4000\u5000-\u6000]. 
        
               // (In the future, we could possibly extend the rm.Analysis to produce a known 
        
               // lower-bound and compare against that rather than always using 128 as the 
        
               // pivot point.) 
        
               return negate ? 
        
                   $"((ch = {chExpr}) < 128 || !RegexRunner.CharInClass((char)ch, {Literal(charClass)}))" : 
        
                   $"((ch = {chExpr}) >= 128 && RegexRunner.CharInClass((char)ch, {Literal(charClass)}))"; 
        
           }

We likely just need to update that flag based on whether

runtime/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

Lines 4547 to 4553 in 50fa756

    
                           for (int i = 0; i < 128; i++) 
        
                           { 
        
                               char c = (char)i; 
        
                               if (RegexCharClass.CharInClass(c, charClass)) 
        
                               { 
        
                                   dest[i >> 4] |= (char)(1 << (i & 0xF)); 
        
                               }

sees any bits set, which will mean moving that up a bit. Alternatively, the analysis could be improved to recognize that subtraction in char classes can't ever increase what's accepted, only narrow.

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

Line 1032 in 50fa756

internal static CharClassAnalysisResults Analyze(string set)

stephentoub · 2022-07-16T18:44:29Z

it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.

@scrappyCoco, is this a common situation for you, and if so, can you tell me more about them? If a pattern is known to contain / start with only non-ASCII and input is expected to be mostly ASCII, there are things we could do to make it way faster, but they'd make it more expensive to process inputs that were mostly non-ASCII.

scrappyCoco · 2022-07-17T08:32:19Z

@teo-tsirpanis, thanks for the super-fast reaction.
@stephentoub, this regex has been written only for an academic purpose. This is not from a real life. In this experiment, I wanted to compare "Character class subtraction" with a regular sequence of "Positive character group". I want to write for me the best practices for regexes with non-ASCII alphabets. For an example, '[0-9]' is better than '\d' for Russian text. If the compared character was Cyrillic, it would not match ASCII-range 0x30 - 0x39, then it would get Unicode Category, that was redundant (because we expected only ‘[0-9]’). And other similar experiments intended for non-ASCII texts.

stephentoub · 2022-07-17T10:56:44Z

Thanks.

scrappyCoco added the tenet-performance Performance related issue label Jul 16, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Jul 16, 2022

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Jul 16, 2022

teo-tsirpanis self-assigned this Jul 16, 2022

teo-tsirpanis mentioned this issue Jul 16, 2022

Do not use the string bit vector in regexes, if it's all ones or all zeroes. #72317

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 16, 2022

stephentoub closed this as completed in #72317 Jul 17, 2022

ghost removed untriaged New issue has not been triaged by the area owner in-pr There is an active PR which will close this issue when it is merged labels Jul 17, 2022

ghost locked as resolved and limited conversation to collaborators Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove redundant check of 128-bit bitmap of zeros in regex #72312

Remove redundant check of 128-bit bitmap of zeros in regex #72312

scrappyCoco commented Jul 16, 2022

ghost commented Jul 16, 2022

teo-tsirpanis commented Jul 16, 2022

stephentoub commented Jul 16, 2022 •

edited

Loading

stephentoub commented Jul 16, 2022

scrappyCoco commented Jul 17, 2022

stephentoub commented Jul 17, 2022

Remove redundant check of 128-bit bitmap of zeros in regex #72312

Remove redundant check of 128-bit bitmap of zeros in regex #72312

Comments

scrappyCoco commented Jul 16, 2022

ghost commented Jul 16, 2022

teo-tsirpanis commented Jul 16, 2022

stephentoub commented Jul 16, 2022 • edited Loading

stephentoub commented Jul 16, 2022

scrappyCoco commented Jul 17, 2022

stephentoub commented Jul 17, 2022

stephentoub commented Jul 16, 2022 •

edited

Loading