Is there a way to not consider tokens for OR / maxLookahead? #1972

matthew-dean · 2023-08-06T18:43:49Z

matthew-dean
Aug 6, 2023

One of the things that I still struggle with is whitespace-sensitive parsing. Many / most / (all?) of the Chevrotain examples in the docs use skipped whitespace. For a white-space sensitive language, when determining matching paths, it would be great if whitespace tokens were ignored for the purpose of resolving OR alts, even if they were explicitly CONSUMEd in the actual rule.

I assume this would involve creating a custom ILookaheadStrategy and implementing buildLookaheadForAlternation and buildLookaheadForOptional which skips whitespace tokens?

Answered by bd82

Aug 6, 2023

Hello @matthew-dean

Chevrotain by default uses LL(K) lookahead, meaning it searches at most a fixed number of tokens ahead.
This would conflict with ignoring an arbitrarily number of whitespace tokens in the alternatives.

While you could implement a custom lookahead strategy that only takes into account the non-whitespace tokens.
There may be a more generic solution of using the existing LL(*) lookahead plugin for Chevrotain.

https://github.com/langium/chevrotain-allstar

LL(*) should be able to "check" an arbitrarily number of tokens ahead to distinguish between alternatives, even when the lookahead paths include optional sequences of whitespace, e.g (AB vs AC):

_ _ _ A _ B
_ _ _ _ A _ _

View full answer

bd82 · 2023-08-06T20:25:07Z

bd82
Aug 6, 2023
Maintainer

Hello @matthew-dean

Chevrotain by default uses LL(K) lookahead, meaning it searches at most a fixed number of tokens ahead.
This would conflict with ignoring an arbitrarily number of whitespace tokens in the alternatives.

While you could implement a custom lookahead strategy that only takes into account the non-whitespace tokens.
There may be a more generic solution of using the existing LL(*) lookahead plugin for Chevrotain.

https://github.com/langium/chevrotain-allstar

LL(*) should be able to "check" an arbitrarily number of tokens ahead to distinguish between alternatives, even when the lookahead paths include optional sequences of whitespace, e.g (AB vs AC):

_ _ _ A _ B
_ _ _ _ A _ _ C

7 replies

bd82 Aug 7, 2023
Maintainer

See https://www.typefox.io/blog/allstar-lookahead for more details about the LL(*) Plugin created by Langium / @msujew

matthew-dean Aug 7, 2023
Author

Thanks for the blog post! One thing I will say is that in experimenting with Antlr4, it does allow you to be expressive and flexible in a grammar that can be easier to reason about, whereas, my experience with Chevrotain is that you spend a fair amount of time re-writing your grammar, but not to just remove bugs or be more accurate, but to basically navigate parser behavior, or adopt an existing spec to what is the equivalent parseable structure for Chevrotain.

For example, there are definitely grammar specs within CSS which absolutely cannot / do not map to LL(k), but which worked fine in Antlr4 with ALL(*). So I'm definitely excited to try this plugin out! Would you ever consider adding it to the Chevrotain Github org as an officially-maintained project? Or adding it to the chevrotain package as an optional import? (If it indeed works as well as its documented to.)

matthew-dean Aug 7, 2023
Author

Another question, does the https://github.com/langium/chevrotain-allstar plugin deal better with the decision of whether or not to enter a MANY statement?

For example, in the past, this sort of pattern would fail in Chevrotain:

$.MANY(() => {
  $.OPTION(() => $.CONSUME(T.WS))
  $.CONSUME(T.Comma)
  $.OPTION2(() => $.CONSUME2(T.WS))
  $.SUBRULE($.someRepeatingRule)
})
$.OPTION3(() => $.CONSUME3(T.WS))

This would fail if the (optional) repeating rule is not present, but white-space is, because (how you explained it, IIRC), the Chevrotain parser would assume that the white-space meant it could enter the MANY, at which time it fails to find any other matching tokens and throws an error. However, this type of pattern I found worked fine in Antlr4. (I assume because of the ALL(*) algorithm?)

I guess I'm asking: do custom lookaheadStrategys affect all decisions to proceed everywhere in the parser? Or only when entering subrules?

msujew Aug 7, 2023
Collaborator

I guess I'm asking: do custom lookaheadStrategys affect all decisions to proceed everywhere in the parser? Or only when entering subrules?

Chevrotain never performs lookahead when entering subrules, only alternatives (OR) and the different kinds of repetitions. And yes, the strategy completely replaces the existing lookahead behavior.

I believe the rule in question should parse fine in the chevrotain-allstar package, not 100% sure though.

For example, there are definitely grammar specs within CSS which absolutely cannot / do not map to LL(k), but which worked fine in Antlr4 with ALL(*).

Note that in addition to ALL(*), ANTLR4 employs a few additional tricks during parsing, such as direct left recursion. While this generally makes grammars easier to write, it usually has huge ramifications on parser performance. See page 29 (and afterwards) of my master's thesis. ANTLR4

matthew-dean Aug 13, 2023
Author

@msujew

I believe the rule in question should parse fine in the chevrotain-allstar package

Just FYI, just finished rebuilding my parser and it doesn't. If there is an optional token (in this case whitespace) following a $.MANY and that token is also an optional first token within a $.MANY, then Chevrotain will throw an error about tokens expected following the optional token (defined within the MANY). This seems to happen regardless of using chevrotain-allstar or not.

bd82 · 2023-08-08T16:20:59Z

bd82
Aug 8, 2023
Maintainer

@matthew-dean wrote:

with Chevrotain is that you spend a fair amount of time re-writing your grammar, but not to just remove bugs or be more accurate, but to basically navigate parser behavior, or adopt an existing spec to what is the equivalent parseable structure for Chevrotain.

This observation makes sense for two reasons:

Antlr has a more powerful parsing algorithm.
Chevrotain was originally created as utilities to help implement an hand-crafted Recursive decent parser.
So, There is a-lot less abstraction between the grammar and the implemented parser,
e.g:
1. Antlr4's automatic refactoring of some left recursive constructs.
2. verbose programmatic DSL vs EBNF
3. limitation of specifying the index (CONSUME1/2/3) in the parsing DSL.

0 replies

bd82 · 2023-08-08T16:28:23Z

bd82
Aug 8, 2023
Maintainer

Would you ever consider adding it (LL(*) to the Chevrotain Github org as an officially-maintained project? Or adding it to the chevrotain package as an optional import? (If it indeed works as well as its documented to.)

I don't think it matters where it is located (repo / org).
I am personally not interested in additional maintenance responsibility, in particular for advanced code
that will require me to re-read (and grok) a couple of academic papers once every X years if/when I need to debug the relevant code...

I've tried to give extra visibility to the LL(*) plugin on the main README, and I am not opposed to adding a documentation section on the webpage about it.

if someone wants to write these docs 😄
maybe in the "guides" section?

0 replies

matthew-dean · 2023-08-15T13:47:47Z

matthew-dean
Aug 15, 2023
Author

@bd82 @msujew So, I did some work this last weekend, and I decided to just refactor everything by skipping whitespace (and comments) when tokenizing / parsing.

Originally, I thought this just wouldn't work, because I needed "skipped" tokens in my eventual AST, and needed a few parsing paths to be determined by whitespace, so they couldn't just be thrown out.

However, inspired by this bit in the CSS parser (the logic of which I'm pretty sure is incorrect? but was a good starting point), I decided to save comments in whitespace in their own lexing group, and assign those groups to the parser instance before parsing starts. (I couldn't see another way to reference the lexer output in the parser.)

I think this is simplifying everything incredibly. All of the whitespace-ambiguous parsing is eliminated (such as when to enter a MANY or OPTION or not). And then, in the places where no whitespace/comments are allowed, or SOME whitespace is required, I have a function I can pass to a GATE which looks up the groups to determine if there were any tokens present.

The only really awkward bit about this is that the Chevrotain API doesn't seem to have a good construct for the concept of a "required gate".

What I came up with is an OR with a single ALT, like this:

$.OR([
  {
    GATE: noSep,
    ALT: () => {
      $.CONSUME(T.Ident)
    }
  }
])

In other words, if the gate is true, consume the token. If the gate is not true, throw an error.

This differs from a gated OPTION, which is more like: if the gate is true, consume the token. If the gate is not true, no worries, because the token is optional.

Is there a better semantic way to define this though?

3 replies

bd82 Aug 23, 2023
Maintainer

Is there a better semantic way to define this though?

Not sure, maybe something like:

$.OR([
  {
    GATE: noSep,
    ALT: () => {
      $.CONSUME(T.Ident)
    },
    {
      ALT: () => {
       // throw custom error here to be more explicit
       // but I don't think custom errors are supported, so parts of `errors_public.ts` would need to be overridden to allow this
      }
     }
  }
])

Alternative Approach

Do these situations where the whitespace/comments are required or not allowed affect the results of the parsing?

e.g. choice between alternatives or entering an optional?

In other words: would you be able to parse a superset of your grammar where comments and whitespace are completely ignored and only at a later stage produce the "parsing errors"?

If this is possible it would be a superior approach, because you would "downgrade" these whitespace / comments constraints to "linting" errors which you can easily show multiples in an editor for example.

matthew-dean Sep 1, 2023
Author

Do these situations where the whitespace/comments are required or not allowed affect the results of the parsing?

Yes.

In other words: would you be able to parse a superset of your grammar where comments and whitespace are completely ignored and only at a later stage produce the "parsing errors"?

No.

The problem is that CSS (and therefore its derivatives) cannot really have a generalized tokenization stage (with a single-mode lexer).

For example, I know Chevortain's example CSS parser is not meant to be comprehensive, but it parses things like:

foo : bar
: has(blah) {
  color: red;
}

This is, of course, not valid CSS. A naïve approach would be to follow the CSS spec more closely, and tokenize :bar and :has( as a single token, but then your parser becomes much harder to reason about, because you end up with this:

.foo {
  color:red;  // generalized tokenization will produce `:red` as a pseudo-class token
  border:var(--some-var); // generalized tokenization will produce `:var(` as a pseudo-function token
}

The reason why browsers don't have a problem parsing is because tokenization happens in distinct stages, based on context. A qualified rule will produce pseudo-class tokens while a declaration will not.

I thought about using a multi-mode lexer for this but in CSS, but I kept feeling like it quickly got extremely complex, because a qualified rule or an at-rule can jump in-and-out of lexing modes depending on the individual spec for that individual rule or pseudo-function.

The easier-to-reason about approach is like the Chevrotain example CSS parser, which is to separate colons as distinct tokens. But if you then want to parse accurately, and return errors, then you have to consider white-space between tokens.

This is why I've gone back and forth through different strategies as I've worked with CSS + Chevrotain. There just isn't a straightforward way to parse a language like CSS (or Less or Sass). But Chevrotain is still great to work with so I'm still trying to find creative solutions. Initially I parsed all whitespace and didn't skip it, but I'm loving the approach I took now where I grouped the "skipped" tokens because they usually don't matter, and then look them up when I need to.

matthew-dean Sep 1, 2023
Author

There just isn't a straightforward way to parse a language like CSS

I should rephrase. One could indeed write a multi-mode lexer for CSS and you could probably circumvent these problems. However, because I'm trying to use a complete, spec-accurate parser for CSS that I can extend into Less & Sass, it gets much more difficult, because Less & Sass unfortunately cannot resolve ambiguities between a declaration and a qualified rule, so you cannot switch lexing modes because it is not clear until parsing with ALT rules and GATEs which is which.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to not consider tokens for OR / maxLookahead? #1972

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is there a way to not consider tokens for OR / maxLookahead? #1972

matthew-dean Aug 6, 2023

Replies: 4 comments · 10 replies

bd82 Aug 6, 2023 Maintainer

bd82 Aug 7, 2023 Maintainer

matthew-dean Aug 7, 2023 Author

matthew-dean Aug 7, 2023 Author

msujew Aug 7, 2023 Collaborator

matthew-dean Aug 13, 2023 Author

bd82 Aug 8, 2023 Maintainer

bd82 Aug 8, 2023 Maintainer

matthew-dean Aug 15, 2023 Author

bd82 Aug 23, 2023 Maintainer

Alternative Approach

matthew-dean Sep 1, 2023 Author

matthew-dean Sep 1, 2023 Author

matthew-dean
Aug 6, 2023

Replies: 4 comments 10 replies

bd82
Aug 6, 2023
Maintainer

bd82 Aug 7, 2023
Maintainer

matthew-dean Aug 7, 2023
Author

matthew-dean Aug 7, 2023
Author

msujew Aug 7, 2023
Collaborator

matthew-dean Aug 13, 2023
Author

bd82
Aug 8, 2023
Maintainer

bd82
Aug 8, 2023
Maintainer

matthew-dean
Aug 15, 2023
Author

bd82 Aug 23, 2023
Maintainer

matthew-dean Sep 1, 2023
Author

matthew-dean Sep 1, 2023
Author