Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NonBacktracking Regex optimizations #102655

Merged
merged 63 commits into from
Jul 11, 2024
Merged

Conversation

ieviev
Copy link
Contributor

@ieviev ieviev commented May 24, 2024

@stephentoub @veanes

Here's the initial list of changes i'd make to the nonbacktracking regex engine,
I haven't yet documented much because i made some design choices that i need to discuss.

Rebar baseline before changes:
baselinenet9

After changes:
optimized

update: After more optimization:
image

Essentially i fixed the things that hurt the performance the most, there's many wins of 300% or more already.

Some fundamental design changes:

  • There is no BDD.Find() but rather a 64kb array lookup, this alone increases performance on unicode by ~3x. The initialization side of this is cheap as it's mostly zeroes, but having the array is important, this could be a span in the hot loop as well since the reference does not change.
  • NFA mode should not be thought of as "switching to second gear" but rather "stopping a tumor". It should be a last resort option which drops the performance from 500mb/s to 5mb/s to not blow up (search time performance goes from O(n) to O(m*n)). The symbolic automata framework here already provides an extremely small automaton and an unfair advantage over other engines. The one-time cost of potentially some megabytes allocated for state space is what you pay for this speed.
  • RegexFindOptimizations is often a lot slower than a DFA is, the only cases where AVX brings benefit are small character sets, ranges or string literals. Optimizing scenarios like [A-Z\n] or [A-Z0-9] with SearchValues would benefit a ton too, since they're very common predicates to skip to in the inner loop. Currently i disabled RegexFindOptimizations for the dfa engine for any char set over 4 chars that's not a single range.

There's low hanging fruit in the hot loop as well, i removed the IsDeadend flag entirely, since there is just one dead end state, which i stored the state ID for. The nullability and skipping checks got dedicated short-circuit lookups as having a nullable position at all (a valid match) is usually rare, something like beq or bgt could be used as well. Some if if-else branches in the hot
loop should be removed and resolved somewhere above.

There's a lot of performance still on the table, when compared to resharp, dedicated
DFA prefilter optimizations could help a lot as well.

resharp

Let me know what you think of the changes, i think the nonbacktracking engine could be made a lot faster than compiled in most scenarios and the symbolic automata benefits should be used a lot more aggressively.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 24, 2024
@ieviev
Copy link
Contributor Author

ieviev commented May 24, 2024

@dotnet-policy-service agree

@ieviev
Copy link
Contributor Author

ieviev commented May 26, 2024

More overhead elimination got another 20% or so in some cases,
image

The all-english and all-russian benchmarks have too many one-off edge cases related to supporting all anchors that i cannot really change. higher performance there would require breaking something or bigger redesign. But for all other cases the engine is crazy fast already.

One cheap option available for the two slower ones is to remove \b;^;$ anchors where they change nothing semantically, \b\w+\b matches the exact same thing as \w+ and this would allow skipping the anchor checks with about ~30% more performance. I'll see what else is possible to improve match count overhead.

@ieviev
Copy link
Contributor Author

ieviev commented May 28, 2024

\b\w+\b matches the exact same thing as \w+

forgot the zero-width-non-joiner so this is not correct, this would otherwise almost double the performance

image

but making the conditional nullability checks cheaper did help by around 20%
i think this is about as far as i can get without sinking months into it

one thing i didnt figure out is where should tests for SymbolicRegexMatcher go?
it requires adding a ton of references to the unit test project

@danmoseley
Copy link
Member

IIRC essentially all the tests for the non backtracking are also run against the other engines, and involve passing the pattern and text in (ie aren't testing the internals directly). Is that what you're looking to do?

BTW is any of this work potentially relevant to the other engines? (I don't own regex just observer here.)

@ieviev
Copy link
Contributor Author

ieviev commented May 28, 2024

@danmoseley
I meant to write a couple tests for reversal optimization correctness but i suppose the end to end tests will immediately show when something is wrong as well. Or alternatively i could refactor it a bit and move it out of the matcher.

Perhaps Compiled might benefit from the same alphabet compression and character kind lookup. A source generated variant of the DFA would be very impressive though, it'd run circles around all other engines

@kasperk81
Copy link
Contributor

[03:29:12] info: [FAIL] System.Text.RegularExpressions.Tests.AttRegexTests.Test
[03:29:12] info: System.OutOfMemoryException : Out of memory
[03:29:12] info:    at System.Text.RegularExpressions.Symbolic.MintermClassifier..ctor(BDD[] )
[03:29:12] info:    at System.Text.RegularExpressions.Symbolic.UInt64Solver..ctor(BDD[] )

@ieviev
Copy link
Contributor Author

ieviev commented May 28, 2024

System.OutOfMemoryException : Out of memory in wasm tests

(RegexOptions Options, string Pattern, string Input, string Expected)[] cases = Match_MemberData_Cases(engine).ToArray();
Regex[] regexes = RegexHelpers.GetRegexes(engine, cases.Select(c => (c.Pattern, (CultureInfo?)null, (RegexOptions?)c.Options, (TimeSpan?)null)).ToArray());

Running thousands of regex instances with no GC inbetween while still maintaining low memory constraints in wasm tests is not easily solved, maybe some of the high memory annotations like <HighAotMemoryUsageAssembly> would do it. Or some sort of pooling / static dictionary that shares array instances based on ranges. Some memory could be shared but overall i think it's cleaner to have the memory associated to the engine instance itself.

@ieviev
Copy link
Contributor Author

ieviev commented May 29, 2024

Now the engine has less per-match overhead as well,
all-english got about 50% faster and all-russian got 100% faster compared to baseline.

The ending Z anchor is a major pain preventing lots of optimizations,
which also means paying for something you don't use.
A lot of temptation to just throw an exception and not supporting \Z at all.

I added a slight indirection to ascii-only cases to save memory (128 B instead of 65536 B) but the
performance difference is insignificant (< 5%) and at least passes the tests.

Other than that the results look pretty good,
it's not yet clear to me which abstractions are free and which are not so i've been avoiding interfaces.

@stephentoub
Copy link
Member

Thanks, @ieviev! I will take a look soon.

@steveharter steveharter added the tenet-performance Performance related issue label Jun 5, 2024
@ieviev
Copy link
Contributor Author

ieviev commented Jun 5, 2024

There's still some things to do here

  • I'd still want to look at the INullabilityHandler and remove the dependency on StateFlags lookup and the state struct, these are currently causing some redundant work if the nullability is stored in an array.
  • NFA mode should be decoupled from the high-performance parts and only kept as a fallback mechanism, code sharing between DFA mode and NFA mode makes it harder to optimize things somewhere downstream for the DFA (in the sense that there's no safety guarantees attached to taking shortcuts).
  • Another thing is that perhaps it is possible to handle the \Z anchor before even starting the match, e.g. making the input span one character shorter if it ends with \n and then starting the match as usual. This could potentially remove a ton of complexity elsewhere in the code. I'm not certain it is always possible but there's many trivial cases where it would work

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting to this. Much appreciated your submitting it.

A couple of comments/questions...

  1. In our previous conversations, you mentioned you thought inner vectorization would be possible, where we could vectorize within a match rather than just finding the next starting place. I don't see that in this PR. Is that possible? Seems like that'd be an opportunity for some significant wins.

  2. I understand from your comments that these change brought significant wins on some patterns, in particular those involving non-ASCII, which is great. I'm a bit concerned though that when running this on our own perf test suite, I'm seeing regressions in various places. You can find the tests in https://github.com/dotnet/performance/tree/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions. Some of the more concerning ones were \p{Sm} and .{2,4}(Tom, which regressed throughput by ~50%, and Holmes.{0,25}(...).{0,25}Holmes..., which regressed throughput by ~25%. Thoughts?

Thanks!

@stephentoub stephentoub added this to the 9.0.0 milestone Jun 14, 2024
@ieviev
Copy link
Contributor Author

ieviev commented Jun 15, 2024

In our previous conversations, you mentioned you thought inner vectorization would be possible, where we could vectorize within a match rather than just finding the next starting place. I don't see that in this PR. Is that possible? Seems like that'd be an opportunity for some significant wins.

Yes, this is possible. Any pattern that contains .* could be a lot faster with longer matches. It'd be best to start with inner vectorization without anchors. The presence of anchors makes it more complicated and expensive but i still think it's possible with anchors as well when followed with an anchor context lookup, also it needs a bit of testing to see where the line between match time speedup and unwanted compile/construction-time overhead is.

I understand from your comments that these change brought significant wins on some patterns, in particular those involving non-ASCII, which is great. I'm a bit concerned though that when running this on our own perf test suite, I'm seeing regressions in various places. You can find the tests in https://github.com/dotnet/performance/tree/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions. Some of the more concerning ones were \p{Sm} and .{2,4}(Tom, which regressed throughput by ~50%, and Holmes.{0,25}(...).{0,25}Holmes..., which regressed throughput by ~25%. Thoughts?

I'll definitely profile these as well. There is some overhead from the way edge cases are currently handled. \p{Sm} in particular could be made to skip the reversal part entirely along with other fixed length patterns. I'll follow up about this once i've done some more testing

@ieviev
Copy link
Contributor Author

ieviev commented Jun 18, 2024

Now i've had a better look at the dotnet/performance benchmarks as well. Turns out i was wrong about the \p{Sm} benchmark, System.Buffers.ProbabilisticWithAsciiCharSearchValues is actually faster there even with 936 characters in \p{Sm}. But there is a trade-off depending on how frequent the characters are in the input. With <70 occurrences in 20mb of text it was about 2x faster, but it can be either slower or faster depending on frequency. I reenabled RegexFindOptimizations there for now since it's more of a statistics question when can you safely assume all the characters in the set are rare.

Try running the performance tests again - i think it should be faster across the board now.

I thought about inner vectorization as well, but it doesn't seem like there's many benchmarks including .* or other star loops and i think creating a separate implementation of RegexFindOptimizations for the automata engine would be far more beneficial. Derivatives have a very concise and useful way to infer prefixes, that'd work for any pattern, so this would replace a lot of custom logic in RegexPrefixAnalyzer.cs with something similar to this:

 // ... prefix start, looping back to this or any visited states is redundant
SymbolicRegexNode currentState = // ..
var mergedPrefix = solver.Empty
foreach (var minterm in minterms) {
	var derivative = createDerivative(state, minterm)
	if (isRedundant(derivative)) then continue;
	// O(1) bitwise merge relevant transitions for the prefix
	mergedPrefix = solver.Or(mergedPrefix, minterm) 
}
// .. continue breadth first search for longer prefixes
return mergedPrefix

There's many optimizations that currently go undetected/unused which could enable avx in more cases. The avx and transitions could be combined as well, e.g. when optimizing for a longer prefix like Tom.*River, the engine currently skips to the location of |Tom.*River when it could just skip the transitions and continue immediately at |.*River instead.
I think most of the regexes being used in benchmarks spend their time in RegexFindOptimizations territory so this would be immediately noticeable there as well. I'm unsure how this compares to the prefixes used for the Compiled engine, maybe just opting in to the same prefix optimizations Compiled gets and adjusting them for NonBacktracking works as well but i'd need to try that in practice.

I wonder if replacing the algorithm choices in the match with something like Func<..,int> created during construction would be an option. Essentially what i'd want to do is handle as many if-else choices during construction as possible. The ideal DFA would look something like this without any if-else conditions in between:

while (position != end) {
	// check for success
	// get next state
	position++;
}

An optimized DFA will give you a very consistent worst case throughput of somewhere around 500mb/s on any pattern which also accounts for utf16 losses on utf8 input (why the english benchmarks are slower), AVX on benchmarks is a bit of a gamble, which sometimes helps a lot and sometimes does not. High match count overhead can sometimes dominate the benchmark as well. There's massive gains from AVX in some cases but overall i'm not exactly sure where the line is as it depends on input text as well.
image

The worst case performance should be very good on both ascii and unicode by now, what kind of optimizations to do with prefixes depends on benchmarks used or https://github.com/dotnet/runtime-assets/blob/main/src/System.Text.RegularExpressions.TestData/Regex_RealWorldPatterns.json

I'll refactor and clean up the PR and comments for a bit now.

@ieviev
Copy link
Contributor Author

ieviev commented Jun 18, 2024

Ah i left an exception to check if there's over 255 minterms in some pattern as well, a pattern from RegexPcreTests did reach it so a fallback mode definitely necessary. an int array would be a 256K allocation but a couple times faster than BDD so there's a decision there as well. Would pooling the array be an option?

Simplification, style consistency, dead code deletion, some bounds-check removal, etc.
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit with a last round of changes.

Otherwise, LGTM.

Thanks!

@stephentoub stephentoub merged commit 0c38d80 into dotnet:main Jul 11, 2024
71 of 83 checks passed
@stephentoub
Copy link
Member

I've got another round of cleanup / simplification (some of the code in this PR, some pre-existing) I'm working on and will submit as a separate PR.

@veanes
Copy link
Contributor

veanes commented Jul 11, 2024

I will be happy to provide comments or feedback if needed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants