Added patterns library #2190

RunDevelopment · 2020-01-18T20:44:23Z

This adds a new "language" to Prism which provides functions to create and modify patterns.

Motivation:

It's very simple to create languages with Prism, but we provide nothing at all which can be used to extract common pattern parts into variables or to create more complex patterns from simpler ones. So what I usually do and advise other people to do is this. Not pretty, robust or scalable but it gets the job done.

After I redid #1444, I wanted a standardized way for pattern templates (or whatever you want to call that). The implementation there is fast and simple because it only does string replacements, so it has a few downsides: The first is that it doesn't check flags, so it's very easy to get flags wrong which is really hard to debug. The second big downside is that it can't deal with backreferences which is quite a big limitation.

So what I wanted was a way to have pattern templates that are easy to debug and support backreference. And because I use it quite often, I also wanted a way to create nested patterns.

Description:

Before I describe the new function, I will explain 2 important interfaces: Flags and Pattern.

A flags object is an immutable object like so: { g?: boolean, i?: boolean, m?: boolean, ...remainingFlags >. It's a map from each flag to an optional boolean which can be either true (== the flag is required in the final RegExp), false (== the flag is forbidden in the final RegExp), or undefined (== it may be present or not in the final RegExp, doesn't matter).

A pattern is an immutable object like so: { source: string, flags: Flags }. It represents a lightweight version of a RegExp.

With that out of the way: 4 functions are exported from Prism.languages.patterns:

pattern(source: string, flags?: Flags): Pattern
This function takes the source of a RegExp and optionally flags and returns a new pattern. This is just a convenience function to create patterns.
toRegExp(pattern: string | Pattern): RegExp
Takes a given pattern or string and converts it to a new RegExp object.
template(pattern: string | Pattern, replacements: (string | Pattern)[]): Pattern
The templating function. More details below.
nested(pattern: string | Pattern, depth: number): Pattern
The function to created nested patterns where the placeholder <<self>> will be recursively replaced with the pattern itself up to a certain depth.

Note: Strings and patterns can be used interchangeably. A string will be assumed to be a pattern without flags.

The template function is the heart of this PR. It will handle capturing groups and backreferences in the template pattern and its replacements. It will also check that all patterns are valid regular expressions and that the flags of all patterns are compatible. Example:

var typeKeyword = pattern(/\b(?:class|enum|interface)\b/.source, { i: false });
var string = /(["'])(?:\\.|(?!\1)[^\\])*\1/.source;
template(/(<<0>>\s+)(?:[A-Z]\w*|<<1>>)/.source, [typeKeyword, string])

To make the minified version as small and as fast as possible I added support for the compilation constant Prism.MIN which will be true in minified files and false (or rather undefined) otherwise.
This allowed me to remove all checks from the minified version so only the development version throws errors. The minified version will assume that everything works with just 1.2kB.

Why are the flags so complex?

I had to go with an object because the three states yes/no/don't care cannot be expressed by a string like "gi". The three states are necessary for error checking.

Example: Let's say we want to union /\d+/, /^foo/, and /bar$/m. That's not possible because the m flag has to both be present and not present in the union of /^foo/ and /bar$/m. But we have to analyze the pattern to figure this out because the absence of a flag doesn't tell you whether the flag is absent because the flag doesn't influence the pattern (e.g. /\d+/ == /\d+/m) or because its absence is necessary (e.g. /^foo/).

By having three states, we can check the flag compatiblity of patterns without analyzing the source.

Points for discussion:

All in all, I'm very happy with the result and would really like to see as a part of Prism but there are a few points which I'm unsure about and would like to discuss:

The placeholder pattern. Right now I use <<\d+>> because it's just text from the RegExp's point of view (it doesn't use any special RegExp characters) and because easy to read. You can easily make out placeholders even in long and complex patterns.
That being said, I'm not very attached to this and maybe the simpler pattern <\d+> would be better? Idk.
I restricted template to use arrays as the replacements map while the underlying implementation can work with anything that's indexable. I placed this restriction so that people have to use short placeholders instead of using (relatively) long names.
But maybe that's better for readability? I mean, this thing is pretty readable.
Maybe instead of a function, pattern should be a class and toRegExp should be an instance method. Just an idea I had while implementing this. It's basically an object-oriented vs. functional programming style.
Since I couldn't put the test for this in any existing category, I made a new one and called it lib. I'm thinking about merging this with core but I'm unsure what the best course of action will be, so suggestions are very welcome.

Golmote · 2020-03-25T17:47:04Z

This is impressive!
I like the idea, but at the same time, it encourages the creation of crazy complicated regexps, which is not something I wanna see in every language definition.

Also, since you'll most likely want to use it often, should it really be a language definition? I feel like if it were to be included, it should be in the core.

Finally, this is quite a large bit of code, with an important complexity. It adds to the things that are to be debugged and maintained, that's important to consider as well.

RunDevelopment · 2020-03-25T18:25:27Z

Most languages are simple enough that you won't need this. (Most languages today don't need it.)
So if a language really needs this lib, it can just require it. (I want to keep non-essential features out of prism-core and in their own files.)

Regarding "crazy complicated regexps": Yeah, it's true, you can really abuse template and nested to create patterns with I don't even want to look at. It's a powerful technique, so it can be abused.
But in the end, all it allows you to do is to make variables and string them together. It's really just a way to create a BNF-like grammar, so I don't think that there will be any big problems.

Golmote · 2020-03-26T07:30:34Z

Hm ok.

Regarding your points of discussion:

I like that placeholder, it's both easy to spot and relatively short.
I'm good with only numbers. We'll see if we later need to expand that to identifiers if some patterns are too hard to read.
An OO syntax might be easier to read. But then I would expect template() and nested() to also be methods of that class, which removes to ability to work directly with strings when patterns are not needed.

Pattern(/[^"'/()]|<<0>>|\(<<self>>*\)/.source).template([regularStringCharacterOrComment]).nested(2);

vs

nested(template(/[^"'/()]|<<0>>|\(<<self>>*\)/.source, [regularStringCharacterOrComment]), 2);

I don't have a super strong opinion on this.

This is a bit weird. Since it is a language definition. I'd expect to find the tests files in the languages folder. But I know the run.js file currently handle tests in a special way. Maybe we could have another extension check to handle this sort of tests?

RunDevelopment · 2020-03-26T15:06:12Z

Regarding 1-3.: Then let's just leave it as is for the moment.

Regarding 4.: IMO this is as much a language as Markup Templating is. It isn't. But it has to be.
Dependency-wise, we only languages and plugins right now, so libraries like Patterns and Markup templating aren't covered.
They are registered as languages anyway because it doesn't require changes to other components of Prism (like the Autoloader which can only do languages).

IMO we should make a new category for these to fit in. Tests will then be normal mocha/chai tests.
This would also make it easier to share methods between languages and plugins while keeping non-essential methods out of Core. (At the extreme we could even remove stuff like Prism.languages.extend and insertBefore but that's probably overdoing it.)

RunDevelopment · 2020-04-08T15:27:47Z

Regarding 1 (placeholder): I gave it some more consideration and now think that <0> might be a better choice than <<0>> for a few reasons:

It's shorter. 2 characters less can add up to a lot when it's used hundreds of times. (I tested this with C# which currently uses the <<0>> syntax and the shorter syntax alone saves about 200bytes.)
It's more common. Other BNF syntaxes also use this notation.
IMO, it's as readable as <<0>>.

mAAdhaTTah · 2020-12-22T23:31:26Z

This would be nice to refresh and see the bundle size impacts of this PR. Might be worth refactoring a language definitions to take advantage of the patterns library as well.

github-actions · 2020-12-23T14:02:41Z

JS File Size Changes (gzipped)

A total of 1 files have changed, with a combined diff of +604 B (+100.0%).

file	master	pull	size diff	% diff
components/prism-patterns.min.js	0 Bytes	604 B	+604 B	+100.0%

Generated by 🚫 dangerJS against 4e05947

joshgoebel · 2020-12-23T14:54:38Z

It's shorter. 2 characters less can add up to a lot when it's used hundreds of times.

I suspect gzip removes this advantage though.

RunDevelopment · 2020-12-23T15:03:20Z

Might be worth refactoring language definitions to take advantage of the patterns library as well.

That's the easy part. Primary candidates are Apex, C#, FTL, JavaDoc, JSDoc, Lilypond, Lisp, QML, SAS, Shell sessions, YAML, and Zig. There are also a few other languages that could benefit from it but those are the most obvious ones.

I intended to make a few refactoring PRs after this one got merged.

It's shorter. 2 characters less can add up to a lot when it's used hundreds of times.

I suspect gzip removes this advantage though.

True. I mainly moved towards <foo> because the regexes are shorter. "Shorter" in the sense of "you see more characters before your screen runs out of pixels," and not "fewer bytes." Less visual clutter so to say.

joshgoebel · 2020-12-23T15:25:30Z

How does this deal with a case where the regex itself actually needs to match a <0> style pattern in a language? Or is that simply not considered?

RunDevelopment · 2020-12-23T17:21:34Z

In this case, you can use literal escapes, e.g. /\<0>/ or /<0\>/.

RunDevelopment added 2 commits January 18, 2020 17:21

Added patterns library

fbdb900

nested: Added support for backreferences

da9b43c

RunDevelopment added the needs review label Jan 18, 2020

RunDevelopment mentioned this pull request Jan 18, 2020

Pattern builder #1538

Closed

RunDevelopment added 4 commits January 18, 2020 21:48

Fixed premerge check

786d408

Expanded test case

d3fa021

Merge branch 'master' into patterns

f8fea92

Minor improvements

1abdee6

RunDevelopment mentioned this pull request Apr 7, 2020

Scheme: Improvements #2263

Merged

RunDevelopment mentioned this pull request Oct 18, 2020

Non-RegExp patterns #2595

Open

Merge branch 'master' into patterns

0ac89a7

Use single-angle-bracket placeholders

4e05947

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added patterns library #2190

Added patterns library #2190

RunDevelopment commented Jan 18, 2020 •

edited

Loading

Golmote commented Mar 25, 2020

RunDevelopment commented Mar 25, 2020

Golmote commented Mar 26, 2020

RunDevelopment commented Mar 26, 2020

RunDevelopment commented Apr 8, 2020

mAAdhaTTah commented Dec 22, 2020

github-actions bot commented Dec 23, 2020 •

edited

Loading

joshgoebel commented Dec 23, 2020

RunDevelopment commented Dec 23, 2020

joshgoebel commented Dec 23, 2020

RunDevelopment commented Dec 23, 2020

Added patterns library #2190

Are you sure you want to change the base?

Added patterns library #2190

Conversation

RunDevelopment commented Jan 18, 2020 • edited Loading

Golmote commented Mar 25, 2020

RunDevelopment commented Mar 25, 2020

Golmote commented Mar 26, 2020

RunDevelopment commented Mar 26, 2020

RunDevelopment commented Apr 8, 2020

mAAdhaTTah commented Dec 22, 2020

github-actions bot commented Dec 23, 2020 • edited Loading

JS File Size Changes (gzipped)

joshgoebel commented Dec 23, 2020

RunDevelopment commented Dec 23, 2020

joshgoebel commented Dec 23, 2020

RunDevelopment commented Dec 23, 2020

RunDevelopment commented Jan 18, 2020 •

edited

Loading

github-actions bot commented Dec 23, 2020 •

edited

Loading