-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RegexOptions.NonBacktracking matching end anchors at timeout check boundaries #74525
Fix RegexOptions.NonBacktracking matching end anchors at timeout check boundaries #74525
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsThis PR adds a test for and fixes #74467. The test is the more complicated part, as it needs to find how many characters the innermost matching loop is processing at a time. Since that value is not public it first does a binary search over different input lengths with a 1-tick timeout to find the minimum input length to trigger the timeout. Then with that a pattern of Including the test in the unit test project would have been another option, but the matcher's sources aren't built into that currently. The fix itself is quite simple, with an integer bound passed in instead of slicing the input (since the slice length is used to indicate where the whole input ends for anchors).
|
I'm missing where exactly the problem is. To save me the trouble of debugging, can you point me to where it is and why passing in the length separately rather than slicing fixes it? |
@@ -561,7 +561,7 @@ public SymbolicMatch FindMatch(RegexRunnerMode mode, ReadOnlySpan<char> input, i | |||
} | |||
|
|||
// If there is more input available try to transition with the next character. | |||
if (!IsMintermId(positionId) || !TStateHandler.TryTakeTransition(this, ref state, positionId)) | |||
if (pos >= length || !TStateHandler.TryTakeTransition(this, ref state, positionId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understood correctly the bug was here where we were not checking if we are at the end of the input, right? I'm still not understanding why the problem is only happening for anchors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem was specifically with end anchors, as they rely on the position ID of the next character being -1 exactly when it is the end of the whole input. When they inadvertently matched, they would cause the whole search to end prematurely instead of the outer loop just checking the timeout and going back in for more input.
return 1; | ||
} | ||
} | ||
int charsPerTimeoutCheck = Array.BinarySearch(Enumerable.Range(0, 16_000).ToArray(), -1, Comparer<int>.Create(IsCharsPerTimeoutCheck)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem sound. BinarySearch expects a consistent comparer, but this comparer can change answer from call to call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use your test from the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll simplify the test by hardcoding the charsPerTimeoutCheck
and including asserts that validate it's indeed the right value before the actual test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the test even further, now there's just a constant that has to be large enough to trigger timeout checks. I realized it doesn't have to be exact for the test pattern ^a*$
, as treating a timeout check in any point of a string "aaaaaaa...aaab" as an end-of-input would cause a match. I verified that the test indeed does fail if I reintroduce the slicing.
@olsaarik, where are we with this? I assume it's important this make it into .NET 7. |
@stephentoub The source of the bug was an interaction with the two Lines 1218 to 1239 in 8a3a702
Anchors sensitive to the end-of-input use the -1 position ID coming out of that check to decide that they match. |
If the simplified test looks good I think this PR should be good to merge as soon as the tests pass. I reran that failing test, which was due to something unrelated to regex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/backport to release/7.0 |
Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3018567091 |
This PR adds a test for and fixes #74467.
The test is the more complicated part, as it needs to find how many characters the innermost matching loop is processing at a time. Since that value is not public it first does a binary search over different input lengths with a 1-tick timeout to find the minimum input length to trigger the timeout. Then with that a pattern of
^a*$
needs to not match an input of form"aaa...ab"
where there are exactly enougha
's to pop out of the loop right before theb
.Including the test in the unit test project would have been another option, but the matcher's sources aren't built into that currently.
The fix itself is quite simple, with an integer bound passed in instead of slicing the input (since the slice length is used to indicate where the whole input ends for anchors).