Is there a way to not consider tokens for OR / maxLookahead? #1972
-
One of the things that I still struggle with is whitespace-sensitive parsing. Many / most / (all?) of the Chevrotain examples in the docs use skipped whitespace. For a white-space sensitive language, when determining matching paths, it would be great if whitespace tokens were ignored for the purpose of resolving OR alts, even if they were explicitly CONSUMEd in the actual rule. I assume this would involve creating a custom |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 10 replies
-
Hello @matthew-dean Chevrotain by default uses LL(K) lookahead, meaning it searches at most a fixed number of tokens ahead. While you could implement a custom lookahead strategy that only takes into account the non-whitespace tokens. LL(*) should be able to "check" an arbitrarily number of tokens ahead to distinguish between alternatives, even when the lookahead paths include optional sequences of whitespace, e.g (AB vs AC):
|
Beta Was this translation helpful? Give feedback.
-
@matthew-dean wrote:
This observation makes sense for two reasons:
|
Beta Was this translation helpful? Give feedback.
-
I don't think it matters where it is located (repo / org). I've tried to give extra visibility to the LL(*) plugin on the main README, and I am not opposed to adding a documentation section on the webpage about it.
|
Beta Was this translation helpful? Give feedback.
-
@bd82 @msujew So, I did some work this last weekend, and I decided to just refactor everything by skipping whitespace (and comments) when tokenizing / parsing. Originally, I thought this just wouldn't work, because I needed "skipped" tokens in my eventual AST, and needed a few parsing paths to be determined by whitespace, so they couldn't just be thrown out. However, inspired by this bit in the CSS parser (the logic of which I'm pretty sure is incorrect? but was a good starting point), I decided to save comments in whitespace in their own lexing group, and assign those groups to the parser instance before parsing starts. (I couldn't see another way to reference the lexer output in the parser.) I think this is simplifying everything incredibly. All of the whitespace-ambiguous parsing is eliminated (such as when to enter a The only really awkward bit about this is that the Chevrotain API doesn't seem to have a good construct for the concept of a "required gate". What I came up with is an $.OR([
{
GATE: noSep,
ALT: () => {
$.CONSUME(T.Ident)
}
}
]) In other words, if the gate is true, consume the token. If the gate is not true, throw an error. This differs from a gated Is there a better semantic way to define this though? |
Beta Was this translation helpful? Give feedback.
Hello @matthew-dean
Chevrotain by default uses LL(K) lookahead, meaning it searches at most a fixed number of tokens ahead.
This would conflict with ignoring an arbitrarily number of whitespace tokens in the alternatives.
While you could implement a custom lookahead strategy that only takes into account the non-whitespace tokens.
There may be a more generic solution of using the existing LL(*) lookahead plugin for Chevrotain.
LL(*) should be able to "check" an arbitrarily number of tokens ahead to distinguish between alternatives, even when the lookahead paths include optional sequences of whitespace, e.g (AB vs AC):