-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove redundant check of 128-bit bitmap of zeros in regex #72312
Comments
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsI'm very glad for For an example, I have to find all consonant in russian language. I write: [RegexGenerator(@"[а-я-[аeиоуыэюя]]", RegexOptions.None, -1)]
private static partial Regex ConsonantRegex(); This Regex emits the following generated code: /// ...
/// I display only interesting part.
/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
int pos = base.runtextpos;
char ch;
// Empty matches aren't possible.
if ((uint)pos < (uint)inputSpan.Length)
{
// The pattern begins with a character in the set [\u0430-\u044F-[e\u0430\u0438\u043E\u0443\u044B\u044D-\u044F]].
// Find the next occurrence. If it can't be found, there's no match.
ReadOnlySpan<char> span = inputSpan.Slice(pos);
for (int i = 0; i < span.Length; i++)
{
if (((ch = span[i]) < 128 ? ("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
{
base.runtextpos = pos + i;
return true;
}
}
}
} There we can see the condition that will be always Is it possible to replace this line with: I've tried to compare generated regex with suggested change: public class DotnetRegexBenchmark
{
// This is the pure copy from generated regex.
private readonly Regex _originalRegex = OriginalRegex.Instance;
// This is the copy from generated regex with suggested change.
private readonly Regex _suggestedRegex = SuggestedRegex.Instance;
// Text with Latin symbols.
private readonly string? _text;
public DotnetRegexBenchmark()
{
_text = new string(Enumerable
.Range(1, 5000)
.Select(i => (char)('a' + i % ('z' - 'a')))
.ToArray());
}
[Benchmark(Baseline = true)]
public int Original() => CountMatches(_originalRegex);
[Benchmark]
public int Suggested() => CountMatches(_suggestedRegex);
private int CountMatches(Regex regex)
{
int matchesCount = 0;
Match match = regex.Match(_text!);
while (match.Success)
{
++matchesCount;
match = match.NextMatch();
}
return matchesCount;
}
} With configuration:
I've got:
As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols. Thanks!
|
I'm working on this. |
The intent was for that to be handled by runtime/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Lines 4508 to 4518 in 50fa756
We likely just need to update that flag based on whether runtime/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Lines 4547 to 4553 in 50fa756
Line 1032 in 50fa756
|
@scrappyCoco, is this a common situation for you, and if so, can you tell me more about them? If a pattern is known to contain / start with only non-ASCII and input is expected to be mostly ASCII, there are things we could do to make it way faster, but they'd make it more expensive to process inputs that were mostly non-ASCII. |
@teo-tsirpanis, thanks for the super-fast reaction. |
Thanks. |
I'm very glad for
RegexGeneratorAttribute
.For an example, I have to find all consonant in russian language. I write:
This Regex emits the following generated code:
There we can see the condition that will be always
false
:("\0\0\0\0\0\0\0\0"[ch >> 4] & (1 << (ch & 0xF))) != 0
Is it possible to replace this line with:
if (((ch = span[i]) >= 128 && RegexRunner.CharInClass((char)ch, "\0\u0002\0аѐ\0\u000e\0efабийопуфыьэѐ")))
?I've tried to compare generated regex with suggested change:
With configuration:
I've got:
As we can notice, it could be faster for a regex, that is looking for non ASCII symbols and for a text, that contains at most ASCII symbols.
Thanks!
The text was updated successfully, but these errors were encountered: