Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RegexOptions.NonBacktracking matching end anchors at timeout check boundaries #74525

Merged
merged 6 commits into from
Sep 8, 2022

Conversation

olsaarik
Copy link
Contributor

This PR adds a test for and fixes #74467.

The test is the more complicated part, as it needs to find how many characters the innermost matching loop is processing at a time. Since that value is not public it first does a binary search over different input lengths with a 1-tick timeout to find the minimum input length to trigger the timeout. Then with that a pattern of ^a*$ needs to not match an input of form "aaa...ab" where there are exactly enough a's to pop out of the loop right before the b.

Including the test in the unit test project would have been another option, but the matcher's sources aren't built into that currently.

The fix itself is quite simple, with an integer bound passed in instead of slicing the input (since the slice length is used to indicate where the whole input ends for anchors).

@ghost
Copy link

ghost commented Aug 24, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR adds a test for and fixes #74467.

The test is the more complicated part, as it needs to find how many characters the innermost matching loop is processing at a time. Since that value is not public it first does a binary search over different input lengths with a 1-tick timeout to find the minimum input length to trigger the timeout. Then with that a pattern of ^a*$ needs to not match an input of form "aaa...ab" where there are exactly enough a's to pop out of the loop right before the b.

Including the test in the unit test project would have been another option, but the matcher's sources aren't built into that currently.

The fix itself is quite simple, with an integer bound passed in instead of slicing the input (since the slice length is used to indicate where the whole input ends for anchors).

Author: olsaarik
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@stephentoub
Copy link
Member

stephentoub commented Aug 28, 2022

I'm missing where exactly the problem is. To save me the trouble of debugging, can you point me to where it is and why passing in the length separately rather than slicing fixes it?

@@ -561,7 +561,7 @@ public SymbolicMatch FindMatch(RegexRunnerMode mode, ReadOnlySpan<char> input, i
}

// If there is more input available try to transition with the next character.
if (!IsMintermId(positionId) || !TStateHandler.TryTakeTransition(this, ref state, positionId))
if (pos >= length || !TStateHandler.TryTakeTransition(this, ref state, positionId))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understood correctly the bug was here where we were not checking if we are at the end of the input, right? I'm still not understanding why the problem is only happening for anchors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was specifically with end anchors, as they rely on the position ID of the next character being -1 exactly when it is the end of the whole input. When they inadvertently matched, they would cause the whole search to end prematurely instead of the outer loop just checking the timeout and going back in for more input.

return 1;
}
}
int charsPerTimeoutCheck = Array.BinarySearch(Enumerable.Range(0, 16_000).ToArray(), -1, Comparer<int>.Create(IsCharsPerTimeoutCheck));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem sound. BinarySearch expects a consistent comparer, but this comparer can change answer from call to call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use your test from the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll simplify the test by hardcoding the charsPerTimeoutCheck and including asserts that validate it's indeed the right value before the actual test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the test even further, now there's just a constant that has to be large enough to trigger timeout checks. I realized it doesn't have to be exact for the test pattern ^a*$, as treating a timeout check in any point of a string "aaaaaaa...aaab" as an end-of-input would cause a match. I verified that the test indeed does fail if I reintroduce the slicing.

@stephentoub
Copy link
Member

@olsaarik, where are we with this? I assume it's important this make it into .NET 7.

@stephentoub stephentoub added this to the 7.0.0 milestone Sep 6, 2022
@olsaarik
Copy link
Contributor Author

olsaarik commented Sep 8, 2022

@stephentoub The source of the bug was an interaction with the two (uint)pos >= (uint)input.Length checks in the IInputReader implementations here:

private readonly struct NoZAnchorInputReader : IInputReader
{
public static int GetPositionId(SymbolicRegexMatcher<TSet> matcher, ReadOnlySpan<char> input, int pos) =>
(uint)pos >= (uint)input.Length ? -1 : matcher._mintermClassifier.GetMintermID(input[pos]);
}
/// <summary>This reader includes full handling of an \n as the last character of input for the \Z anchor.</summary>
private readonly struct FullInputReader : IInputReader
{
public static int GetPositionId(SymbolicRegexMatcher<TSet> matcher, ReadOnlySpan<char> input, int pos)
{
if ((uint)pos >= (uint)input.Length)
return -1;
int c = input[pos];
// Find the minterm, handling the special case for the last \n for states that start with a relevant anchor
return c == '\n' && pos == input.Length - 1 ?
matcher._minterms.Length : // mintermId = minterms.Length represents an \n at the very end of input
matcher._mintermClassifier.GetMintermID(c);
}
}

Anchors sensitive to the end-of-input use the -1 position ID coming out of that check to decide that they match.

@olsaarik
Copy link
Contributor Author

olsaarik commented Sep 8, 2022

If the simplified test looks good I think this PR should be good to merge as soon as the tests pass. I reran that failing test, which was due to something unrelated to regex.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@olsaarik olsaarik merged commit 239f550 into dotnet:main Sep 8, 2022
@olsaarik olsaarik deleted the nonbacktracking-timeout-anchors-fix branch September 8, 2022 22:00
@stephentoub
Copy link
Member

/backport to release/7.0

@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2022

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3018567091

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NonBacktracking treats timeout check boundaries as end-of-input for anchors
3 participants