Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added patterns library #2190

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

RunDevelopment
Copy link
Member

@RunDevelopment RunDevelopment commented Jan 18, 2020

This adds a new "language" to Prism which provides functions to create and modify patterns.

Motivation:

It's very simple to create languages with Prism, but we provide nothing at all which can be used to extract common pattern parts into variables or to create more complex patterns from simpler ones. So what I usually do and advise other people to do is this. Not pretty, robust or scalable but it gets the job done.

After I redid #1444, I wanted a standardized way for pattern templates (or whatever you want to call that). The implementation there is fast and simple because it only does string replacements, so it has a few downsides: The first is that it doesn't check flags, so it's very easy to get flags wrong which is really hard to debug. The second big downside is that it can't deal with backreferences which is quite a big limitation.

So what I wanted was a way to have pattern templates that are easy to debug and support backreference. And because I use it quite often, I also wanted a way to create nested patterns.

Description:

Before I describe the new function, I will explain 2 important interfaces: Flags and Pattern.

A flags object is an immutable object like so: { g?: boolean, i?: boolean, m?: boolean, ...remainingFlags >. It's a map from each flag to an optional boolean which can be either true (== the flag is required in the final RegExp), false (== the flag is forbidden in the final RegExp), or undefined (== it may be present or not in the final RegExp, doesn't matter).

A pattern is an immutable object like so: { source: string, flags: Flags }. It represents a lightweight version of a RegExp.

With that out of the way: 4 functions are exported from Prism.languages.patterns:

  1. pattern(source: string, flags?: Flags): Pattern
    This function takes the source of a RegExp and optionally flags and returns a new pattern. This is just a convenience function to create patterns.
  2. toRegExp(pattern: string | Pattern): RegExp
    Takes a given pattern or string and converts it to a new RegExp object.
  3. template(pattern: string | Pattern, replacements: (string | Pattern)[]): Pattern
    The templating function. More details below.
  4. nested(pattern: string | Pattern, depth: number): Pattern
    The function to created nested patterns where the placeholder <<self>> will be recursively replaced with the pattern itself up to a certain depth.

Note: Strings and patterns can be used interchangeably. A string will be assumed to be a pattern without flags.

The template function is the heart of this PR. It will handle capturing groups and backreferences in the template pattern and its replacements. It will also check that all patterns are valid regular expressions and that the flags of all patterns are compatible. Example:

var typeKeyword = pattern(/\b(?:class|enum|interface)\b/.source, { i: false });
var string = /(["'])(?:\\.|(?!\1)[^\\])*\1/.source;
template(/(<<0>>\s+)(?:[A-Z]\w*|<<1>>)/.source, [typeKeyword, string])

To make the minified version as small and as fast as possible I added support for the compilation constant Prism.MIN which will be true in minified files and false (or rather undefined) otherwise.
This allowed me to remove all checks from the minified version so only the development version throws errors. The minified version will assume that everything works with just 1.2kB.

Why are the flags so complex?

I had to go with an object because the three states yes/no/don't care cannot be expressed by a string like "gi". The three states are necessary for error checking.

Example: Let's say we want to union /\d+/, /^foo/, and /bar$/m. That's not possible because the m flag has to both be present and not present in the union of /^foo/ and /bar$/m. But we have to analyze the pattern to figure this out because the absence of a flag doesn't tell you whether the flag is absent because the flag doesn't influence the pattern (e.g. /\d+/ == /\d+/m) or because its absence is necessary (e.g. /^foo/).

By having three states, we can check the flag compatiblity of patterns without analyzing the source.

Points for discussion:

All in all, I'm very happy with the result and would really like to see as a part of Prism but there are a few points which I'm unsure about and would like to discuss:

  1. The placeholder pattern. Right now I use <<\d+>> because it's just text from the RegExp's point of view (it doesn't use any special RegExp characters) and because easy to read. You can easily make out placeholders even in long and complex patterns.
    That being said, I'm not very attached to this and maybe the simpler pattern <\d+> would be better? Idk.
  2. I restricted template to use arrays as the replacements map while the underlying implementation can work with anything that's indexable. I placed this restriction so that people have to use short placeholders instead of using (relatively) long names.
    But maybe that's better for readability? I mean, this thing is pretty readable.
  3. Maybe instead of a function, pattern should be a class and toRegExp should be an instance method. Just an idea I had while implementing this. It's basically an object-oriented vs. functional programming style.
  4. Since I couldn't put the test for this in any existing category, I made a new one and called it lib. I'm thinking about merging this with core but I'm unsure what the best course of action will be, so suggestions are very welcome.

@Golmote
Copy link
Contributor

Golmote commented Mar 25, 2020

This is impressive!
I like the idea, but at the same time, it encourages the creation of crazy complicated regexps, which is not something I wanna see in every language definition.

Also, since you'll most likely want to use it often, should it really be a language definition? I feel like if it were to be included, it should be in the core.

Finally, this is quite a large bit of code, with an important complexity. It adds to the things that are to be debugged and maintained, that's important to consider as well.

@RunDevelopment
Copy link
Member Author

Most languages are simple enough that you won't need this. (Most languages today don't need it.)
So if a language really needs this lib, it can just require it. (I want to keep non-essential features out of prism-core and in their own files.)

Regarding "crazy complicated regexps": Yeah, it's true, you can really abuse template and nested to create patterns with I don't even want to look at. It's a powerful technique, so it can be abused.
But in the end, all it allows you to do is to make variables and string them together. It's really just a way to create a BNF-like grammar, so I don't think that there will be any big problems.

@Golmote
Copy link
Contributor

Golmote commented Mar 26, 2020

Hm ok.

Regarding your points of discussion:

  1. I like that placeholder, it's both easy to spot and relatively short.

  2. I'm good with only numbers. We'll see if we later need to expand that to identifiers if some patterns are too hard to read.

  3. An OO syntax might be easier to read. But then I would expect template() and nested() to also be methods of that class, which removes to ability to work directly with strings when patterns are not needed.

Pattern(/[^"'/()]|<<0>>|\(<<self>>*\)/.source).template([regularStringCharacterOrComment]).nested(2);

vs

nested(template(/[^"'/()]|<<0>>|\(<<self>>*\)/.source, [regularStringCharacterOrComment]), 2);

I don't have a super strong opinion on this.

  1. This is a bit weird. Since it is a language definition. I'd expect to find the tests files in the languages folder. But I know the run.js file currently handle tests in a special way. Maybe we could have another extension check to handle this sort of tests?

@RunDevelopment
Copy link
Member Author

Regarding 1-3.: Then let's just leave it as is for the moment.

Regarding 4.: IMO this is as much a language as Markup Templating is. It isn't. But it has to be.
Dependency-wise, we only languages and plugins right now, so libraries like Patterns and Markup templating aren't covered.
They are registered as languages anyway because it doesn't require changes to other components of Prism (like the Autoloader which can only do languages).

IMO we should make a new category for these to fit in. Tests will then be normal mocha/chai tests.
This would also make it easier to share methods between languages and plugins while keeping non-essential methods out of Core. (At the extreme we could even remove stuff like Prism.languages.extend and insertBefore but that's probably overdoing it.)

@RunDevelopment
Copy link
Member Author

Regarding 1 (placeholder): I gave it some more consideration and now think that <0> might be a better choice than <<0>> for a few reasons:

  • It's shorter. 2 characters less can add up to a lot when it's used hundreds of times. (I tested this with C# which currently uses the <<0>> syntax and the shorter syntax alone saves about 200bytes.)
  • It's more common. Other BNF syntaxes also use this notation.
  • IMO, it's as readable as <<0>>.

@mAAdhaTTah
Copy link
Member

This would be nice to refresh and see the bundle size impacts of this PR. Might be worth refactoring a language definitions to take advantage of the patterns library as well.

@github-actions
Copy link

github-actions bot commented Dec 23, 2020

JS File Size Changes (gzipped)

A total of 1 files have changed, with a combined diff of +604 B (+100.0%).

file master pull size diff % diff
components/prism-patterns.min.js 0 Bytes 604 B +604 B +100.0%

Generated by 🚫 dangerJS against 4e05947

@joshgoebel
Copy link

It's shorter. 2 characters less can add up to a lot when it's used hundreds of times.

I suspect gzip removes this advantage though.

@RunDevelopment
Copy link
Member Author

Might be worth refactoring language definitions to take advantage of the patterns library as well.

That's the easy part. Primary candidates are Apex, C#, FTL, JavaDoc, JSDoc, Lilypond, Lisp, QML, SAS, Shell sessions, YAML, and Zig. There are also a few other languages that could benefit from it but those are the most obvious ones.

I intended to make a few refactoring PRs after this one got merged.

It's shorter. 2 characters less can add up to a lot when it's used hundreds of times.

I suspect gzip removes this advantage though.

True. I mainly moved towards <foo> because the regexes are shorter. "Shorter" in the sense of "you see more characters before your screen runs out of pixels," and not "fewer bytes." Less visual clutter so to say.

@joshgoebel
Copy link

How does this deal with a case where the regex itself actually needs to match a <0> style pattern in a language? Or is that simply not considered?

@RunDevelopment
Copy link
Member Author

In this case, you can use literal escapes, e.g. /\<0>/ or /<0\>/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants