Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use the string bit vector in regexes, if it's all ones or all zeroes. #72317

Merged
merged 2 commits into from
Jul 17, 2022

Conversation

teo-tsirpanis
Copy link
Contributor

Fixes #72312.

When emitting code that matches regex character classes, if all other shortcuts fail, we resort to hardcoding whether ASCII characters belong in the set using a string as a bit vector, and calling CharInClass for the other characters. Sometimes, based on RegexCharClass.Analyze we determine that the character class contains either all ASCII characters or none of them, and skip the bit vector search.

However these analysis results are not perfect and cannot produce reliable results if for example the character class contains a subtraction. This PR checks the string bit vector after its construction to see if it contains only ones or zeroes and if so, emits code that does not use it.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 16, 2022
@ghost
Copy link

ghost commented Jul 16, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #72312.

When emitting code that matches regex character classes, if all other shortcuts fail, we resort to hardcoding whether ASCII characters belong in the set using a string as a bit vector, and calling CharInClass for the other characters. Sometimes, based on RegexCharClass.Analyze we determine that the character class contains either all ASCII characters or none of them, and skip the bit vector search.

However these analysis results are not perfect and cannot produce reliable results if for example the character class contains a subtraction. This PR checks the string bit vector after its construction to see if it contains only ones or zeroes and if so, emits code that does not use it.

Author: teo-tsirpanis
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@stephentoub
Copy link
Member

Thanks. Are these new post-string-building cases exercised by our existing tests? i.e. does code coverage show all the new branches hit?

@stephentoub stephentoub merged commit 8a709bc into dotnet:main Jul 17, 2022
@teo-tsirpanis teo-tsirpanis deleted the fix-72312 branch July 17, 2022 09:38
@ghost ghost locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove redundant check of 128-bit bitmap of zeros in regex
2 participants