WIP: Move jointness info from TokenStream to Token #75528

matklad · 2020-08-14T15:16:31Z

Part of #63689.

The TL;RD of that issue is that rustc represents >> as a single token at the moment, while proc_macros represent it as a pair of tokens (>, Joint), (>, _). And we want to move the parser to proc_macro representation of tokens, with the two main motivations being: a) don't having two things in the compiler, b) making parser's interface more obviously right, to help with extracting the parser into a separate library.

Moreover, at the moment rustc actually tracks jointness in two ways in a single data structure. TokenStream, before this PR stores (TokenTree, IsJoin) pairs, while the TokenTree itself can be a composite or decomposed token. And this jointness info is pretty easily lost via this impl. This PR by itself doesn't solve the two reprs problem, but it does help with not accidentally loosing jointness. In particular, macros by example now preserve jointness, while they erased it previously.

Rebase of #64782

rust-highfive · 2020-08-14T15:16:35Z

r? @varkor

(rust_highfive has picked a reviewer for you, use r? to override)

matklad · 2020-08-14T15:16:49Z

r? @petrochenkov

Part of rust-lang#63689.

bors · 2020-08-15T03:25:17Z

☔ The latest upstream changes (presumably #73851) made this pull request unmergeable. Please resolve the merge conflicts.

petrochenkov · 2020-08-16T15:03:20Z

So, I recently tried to do something like this too (to be able to address pretty-printing regressions in #73345).
Some brain dump of thoughs based on that attempt:

Let's call rustc_lexer tokens "tokens-0".
token::Token tokens (let's call them "tokens-1") as they exist now are tokens produced by lexer, they are always "joint" simply because token::Whitespace exists at this abstraction level.
Non-joint tokens are joint with the following token::Whitespace in this case.
Note how the jointness flag isn't produced by lexer, and only appears when we are pre-parsing the token::Tokens produced by lexer into token streams skipping the whitespace tokens.
Pre-parsed token streams now contain tokens at the next abstraction level ("tokens-2"), except they are currently called TokenTrees rather than "Token"s, for tokens at this level the jointness property starts making sense and we should introduce it.
It would also be great to keep this property for all tokens-2, not only operators, for pretty-printing in particular. For non-operator tokens it can be dropped at proc macro boundary.

So, I think, the conclusion is that we should:

Drop token::Whitespace and friends, and use rustc_lexer tokens at the abstraction level where they are necessary. These are tokens-0 and jointness doesn't exist at this level.
Merge TokenTree into token::Token so that TokenStream becomes a list of Tokens. These are tokens-1 (no more tokens-2) and jointness exists at this level.

This PR moves things closer to that state, but enters a pretty weird intermediate state where token::Whitespace can be joint or non-joint.
It's ok if we know for sure that the "drop token::Whitespace and friends" step will be performed soon after merging this PR, but don't have too much confidence in that.
I think it would be preferable to drop tokens that cannot be encountered in token streams ^(*) from the token::Token enum before doing what this PR does.
(^* Except for token::OpenDelim and token::CloseDelim, those may require significant parser modifications.)

petrochenkov · 2020-08-16T15:07:10Z

On jointness-from-the-left vs jointness-from-the-right.

Proc macro API currently uses the right jointness because it's more convenient for parsers - less need for lookahead.

It's less convenient for lexers, because to obtain jointness for a token we need to lex an arbitrary number of the following (whitespace) tokens.

However, if we are producing jointness flags in a pre-parser (token stream parser) rather than in a lexer, like it's mentioned in the comment above, then keeping jointness-from-the-right everywhere should be ok, I think.

matklad · 2020-08-17T09:45:19Z

Yup, this all makes sense, especially "avoiding unstable intermediate states". I do think I now have capacity to pour into these work (20 hours per week for months), but it makes sense to make more sure steps, especially in the beginning.

So, I'll look into removing Whitespace, Comment and Shebang tokens, as they are duplicate with spacing. Unknown and Eof also look suspiciously, but for orthogonal reasons, so I'll leave them alone.

I am less sure about merging TokenStream and Token, but that needs a longer explanation.

Longer explanation:

I think the core idea we have here is that "there are many tokens in rust". proc macro tokens != mbe tokens, for example, so some amount of bridging would be required somewhere. Here, I'd like to distill the notion of "parser tokens" -- what the input to the parser looks like. This is an unorthodox viewpoint, but I am not sure that rust parser operates on token trees ( :D ). Today, it uses TokenCursor and maintains a stack of unclosed_delims, which effectively flatten the token tree into the list. Moreover, for IDE purposes, we'd want to reasonably parse code like this:

fn main() {
    let x = foo(
    let y = ();
}

Where "reasonably parse" means "the syntax tree should have ArgList node with just (".

So, long-term, I think the parser should work just with Iterator<Item=Token>, where Token is a flat thing, and () is two tokens. The token-tree representations would be reserved as an explicit input/output format for macros (which it already somewhat is, with /// desugaring for proc-macros, and >> gluing for mbe).

Though, even if the above is a good idea, it might make sense to collapse Token & TokenTree as an intermediate state.

matklad · 2020-08-17T09:47:24Z

@petrochenkov could you clarify if "Make one more step towards fully token-based expansion" refers only to proc-macros, or do we want to make mbe work without non-terminals as well? That is, I am reading this as "we can bend backwards compat gently enough to completely remove Nt from everywhere", but I wonder if I am overly optimistic here :)

petrochenkov · 2020-08-17T21:10:51Z

Unknown and Eof also look suspiciously

FWIW, Unknown looks like a legitimate token-1 that can be a part of a token stream, it's just always unexpected during parsing.

Moreover, for IDE purposes, we'd want to reasonably parse code like this:

AFAIK, rustc recovers unbalanced delimiters during pre-parsing aka token stream parsing now, not during regular parsing (but I'm not sure).
So input for the regular rustc parser could in theory consists purely from what is now known as TokenTrees.
Migrating from the flattened version would require some noticeable work though.
The parser working with flattened tokens currently treats groups with empty delimiters entirely incorrectly though, flattening them into nothing (#67062).

could you clarify if "Make one more step towards fully token-based expansion" refers only to proc-macros, or do we want to make mbe work without non-terminals as well?

It refers to everything, including MBEs.
Nonterminal in the token based world is a delimited group^* with an empty delimiter, that's how proc macros treat them already.
To treat them like this everywhere we need to fix #67062 though.

^* With rare exceptions like tt, which is not delimited by design. (Or perhaps ident in the future, which is only "grouped" to keep more spans, #72545 (comment))

matklad · 2020-08-19T10:11:34Z

Let's close this for now, to keep the set of open PRs smaller. There are several baby yaks to be shaved before this one

rust-highfive assigned varkor Aug 14, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 14, 2020

rust-highfive assigned petrochenkov and unassigned varkor Aug 14, 2020

matklad force-pushed the jointness branch from 2236c0f to 41b0ea2 Compare August 14, 2020 15:22

Move jointness info from TokenStream to Token

922ec17

Part of rust-lang#63689.

matklad force-pushed the jointness branch from 41b0ea2 to 922ec17 Compare August 14, 2020 15:36

petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 16, 2020

matklad mentioned this pull request Aug 17, 2020

Switch rustdoc from lexer::StringReader to rustc_lexer #75619

Closed

petrochenkov added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 17, 2020

petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 17, 2020

matklad closed this Aug 19, 2020

matklad mentioned this pull request Sep 1, 2020

Remove trivia tokens #76170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Move jointness info from TokenStream to Token #75528

WIP: Move jointness info from TokenStream to Token #75528

matklad commented Aug 14, 2020

rust-highfive commented Aug 14, 2020

matklad commented Aug 14, 2020

bors commented Aug 15, 2020

petrochenkov commented Aug 16, 2020

petrochenkov commented Aug 16, 2020

matklad commented Aug 17, 2020

matklad commented Aug 17, 2020

petrochenkov commented Aug 17, 2020

matklad commented Aug 19, 2020

WIP: Move jointness info from TokenStream to Token #75528

WIP: Move jointness info from TokenStream to Token #75528

Conversation

matklad commented Aug 14, 2020

rust-highfive commented Aug 14, 2020

matklad commented Aug 14, 2020

bors commented Aug 15, 2020

petrochenkov commented Aug 16, 2020

petrochenkov commented Aug 16, 2020

matklad commented Aug 17, 2020

matklad commented Aug 17, 2020

petrochenkov commented Aug 17, 2020

matklad commented Aug 19, 2020