Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify term and regexp syntax #19

Open
marcelocantos opened this issue Jan 25, 2020 · 0 comments
Open

Unify term and regexp syntax #19

marcelocantos opened this issue Jan 25, 2020 · 0 comments
Labels
P2 Low priority

Comments

@marcelocantos
Copy link
Contributor

marcelocantos commented Jan 25, 2020

There are two languages baked into wbnf, the grammar syntax and the regexp syntax, which are already very similar. We should see if we can bring them together as a single unified language.

  1. The following operations have the same syntax and meaning: a|b, ab, a?, a*, a+, a{m,n}, [chars], [^chars], \pN, \PN, re??, re*?, re+?, re{m,n}? and most \-letter combinations.

  2. The following operations have different syntax but the same meaning:

regexp wbnf
(?P<name>a) name=a
(?:a) (a)
  1. Regexps have the following operators, which have no counterpart in the grammar.
type regexp proposed wnbf notes
Numbered capture (re) Won't support. Wbnf has (term) syntax, but it is non-capturing.
Reluctant quantifiers re??, re*?, ... same Implemented for regexps. Should also be implemented for terms.
Flags (?flags) (?flags:re) ?flags Disallowed after a term
Lookaside assertions (?=re) (?!re) (?<=re) (?<!re) (?= term+ ) (?! term+ ) Not supported by RE2, but the lookahead forms might be useful in wbnf as a stopgap till LL(k) or LL(*) is implemented.
Anchors ^ $ same
  1. Regexps currently act as a natural embodiment of a token, which has important implications for the structure of an output AST and both the computational efficiency and cognitive load of working with them. This warrants some kind of syntax to demarcate tokens. The current regexp syntax, /{} will probably suffice for this. Anything inside /{...} will be clumped together as a single token with any internal structure discarded. If the internal structure is needed, it can be extracted by reparsing the text against the internal terms.
    • Currently, /{...} will use the first capturing group as the text of the output token. How will this be done when (...) no longer denotes capturing group? Maybe /{...@=(token)...}? This could perhaps be extended to support tokens as tuples if multiple names appear inside /{...}.
    • This would also support a useful optimisation. If everything inside /{...} can be expressed as regular expressions, the entire form may be compiled as a single regexp matcher.
    • Another concern is that some use cases (grammar analysers, optimisers, grammar transforms, etc.) might need access to the internal structure of a parsed /{...} node. This can be achieved simply by reparsing the token. If it's in the form /{ rule }, this is as simple as running the parser for rule across the text of the output node. For more complex forms, see Redefine parser rule parameter as a term expression #18.

Here's an initial stab at elements of the new grammar supporting the above:

COMMENT -> scomment=/{ '//' .* } '\n'
         | mcomment=/{?s '/*' ( [^*] | '*'+ [^*/] ) '*/' };
IDENT   -> /{ '@' | [A-Za-z_\.] \w* };
STR     -> "'" squote=/{ ( `\.` | [^\\'] )* } "'"
         | '"' dquote=/{ ( `\.` | [^\\"] )* } '"'
         | '`' bquote=/{ ( '``' | [^‵]   )* } '`';
INT     -> /{ [\d]+ };
RE      -> '[' neg='^'? chars=/{ ( `\]` | [^\]] )+ } ']';
TOKEN   -> '/{' term* '}';
MOD     -> /{ '?' [ims] }
REF     -> /{ '%' IDENT };
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Low priority
Projects
None yet
Development

No branches or pull requests

2 participants