-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PCRE2_EXTRA_VANILLA_SYNTAX to disable PCRE2 extensions #479
base: master
Are you sure you want to change the base?
Conversation
I'm considering throwing in a change to Proposed:
The |
I am, in general, don't support this direction for various reasons:
I would only support this, if all other engines would have the same feature, and there would be a standard body (even if it is informal), which decides the features in this list. Another idea: there could be a tool, which parses patterns, and warn about the differences between engines. This would offer more help for people, rather than randomly failing parsing. |
Like Zoltan, I am not keen The engines are constantly evolving and trying to decide what is "common" and what is not, and keeping it up to date sounds like a lot of effort for not much gain. What do we do if, in future, Perl adds non-atomic positive lookaheads? In the past, PCRE implemented recursion before Perl did (don't know if it preceded any other engines), and I think it had capture group names using the Python syntax (?P before Perl added them. If somebody has the time, then I think users would much more appreciate up-to-date documentation in exactly what each engine supports, but maintaining such a list could be very time consuming. |
OK. That's quite fair. I'll let this PR hang out here for a while before I close it. The motivation was: what if you want to expose regexes in a product (such as Excel). BUT you want to lock down the API surface to something that's really solid, and could be implemented by third parties even. So we'd offer regex features that are widely supported across multiple engines, and work consistently, such that the regex dialect could be supported on ~any backend via syntax translation. If we were to expose the entirety of PCRE2, then we would force other implementors (within Microsoft!) to use exactly the same regex library if they want to achieve compatibility. I guess you'd suggest instead what I've been arguing for internally: build a pre-parser that sits in front of PCRE2, so that the application shipping the feature "owns" the syntax that it considers valid, rather than rely on the library to cut itself down to a common/universal featureset. Somewhat to my surprise however, there was a decision made by Product Managers, that they didn't want to ship a cut-down syntax unless unless it was a supported configuration for PCRE2. (This seemed strange to me.) The likelihood is that our application will ship, with the entirety of PCRE2's custom dialect exposed to customers. This will lock us in to shipping PCRE2 as the only backend that can support this regex dialact, for the entire many-decades-long expected lifetime of this feature. (Bear in mind that Excel is 40 years old, and new features have to plan for decades of maintenance with zero backwards compatibility regressions.) So, I thought it was at least worth seeing if you'd take this patch! |
I understand the motivation, and why this is useful for you. Actually I have a story for you: c++ interface for pcre1. Google wanted to land a large piece of code, and it was accepted. Then they disappeared, and nobody wanted to maintain their code. Then a few people complained that the code does not work, the interface is not complete, etc. It was just a lot of burden, with no gain. As for pcre2, we simply dropped the whole thing. It is good to think 10 years ahead, and we also do it from our perspective. This feature is mostly needed for you (a single entity), and not a generic interest. Then, who will maintain this change 10 years later? It is not necessary bad, if company interest appear in the code, but we also need to see that the support is long term, and the project also benefits from it. |
Thank you, I understand! I'm very sympathetic, and I wish that we were more willing at Microsoft to maintain our "universal regex dialect" internally. |
One thing I would like to mention is that you are likely going to be one of the main users of the For example, I DO see how we could add a flag that would reject Similarly, callouts can be disabled already at build time, because they don't make sense for all users. |
f3d23c0
to
af06720
Compare
!!!
BIKESHEDDING / ARGUMENT WARNING
!!!
This PR is harmless in principle, but I expect people will have different opinions on the details.
Personally, I don't mind much.
Goal
Add an
PCRE2_EXTRA_VANILLA_SYNTAX
to remove syntax which is specific to PCRE2. It will be treated as if PCRE2 didn't even parse the syntax, giving the same syntax errors you'd get from (say) Perl.The goal here is to remove things which are invented by PCRE2 - not because it's bad (I'm not passing judgement), but simply so that users who want a more "vanilla" syntax can do so.
Things that are in .NET, Ruby (oniguruma), Perl, ... are all OK, because they're not invented by PCRE2. So this option isn't bringing PCRE2 into alignment with Perl (which would be stricter than this PR). I'm merely restricting it from offering syntax that's not supported by any other engine.
First draft includes:
(*NOTEMPTY)
(?aX)
(?(VERSION>=10.4)yes|no)
(?C0)
scan_substring
non_atomic_positive_look(ahead|behind)