Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Move jointness info from TokenStream to Token #75528

Closed
wants to merge 1 commit into from

Conversation

matklad
Copy link
Member

@matklad matklad commented Aug 14, 2020

Part of #63689.

The TL;RD of that issue is that rustc represents >> as a single token at the moment, while proc_macros represent it as a pair of tokens (>, Joint), (>, _). And we want to move the parser to proc_macro representation of tokens, with the two main motivations being: a) don't having two things in the compiler, b) making parser's interface more obviously right, to help with extracting the parser into a separate library.

Moreover, at the moment rustc actually tracks jointness in two ways in a single data structure. TokenStream, before this PR stores (TokenTree, IsJoin) pairs, while the TokenTree itself can be a composite or decomposed token. And this jointness info is pretty easily lost via this impl. This PR by itself doesn't solve the two reprs problem, but it does help with not accidentally loosing jointness. In particular, macros by example now preserve jointness, while they erased it previously.

Rebase of #64782

@rust-highfive
Copy link
Collaborator

r? @varkor

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 14, 2020
@matklad
Copy link
Member Author

matklad commented Aug 14, 2020

r? @petrochenkov

@bors
Copy link
Contributor

bors commented Aug 15, 2020

☔ The latest upstream changes (presumably #73851) made this pull request unmergeable. Please resolve the merge conflicts.

@petrochenkov
Copy link
Contributor

So, I recently tried to do something like this too (to be able to address pretty-printing regressions in #73345).
Some brain dump of thoughs based on that attempt:

  • Let's call rustc_lexer tokens "tokens-0".
  • token::Token tokens (let's call them "tokens-1") as they exist now are tokens produced by lexer, they are always "joint" simply because token::Whitespace exists at this abstraction level.
    Non-joint tokens are joint with the following token::Whitespace in this case.
    Note how the jointness flag isn't produced by lexer, and only appears when we are pre-parsing the token::Tokens produced by lexer into token streams skipping the whitespace tokens.
  • Pre-parsed token streams now contain tokens at the next abstraction level ("tokens-2"), except they are currently called TokenTrees rather than "Token"s, for tokens at this level the jointness property starts making sense and we should introduce it.
    It would also be great to keep this property for all tokens-2, not only operators, for pretty-printing in particular. For non-operator tokens it can be dropped at proc macro boundary.

So, I think, the conclusion is that we should:

  • Drop token::Whitespace and friends, and use rustc_lexer tokens at the abstraction level where they are necessary. These are tokens-0 and jointness doesn't exist at this level.
  • Merge TokenTree into token::Token so that TokenStream becomes a list of Tokens. These are tokens-1 (no more tokens-2) and jointness exists at this level.

This PR moves things closer to that state, but enters a pretty weird intermediate state where token::Whitespace can be joint or non-joint.
It's ok if we know for sure that the "drop token::Whitespace and friends" step will be performed soon after merging this PR, but don't have too much confidence in that.
I think it would be preferable to drop tokens that cannot be encountered in token streams (*) from the token::Token enum before doing what this PR does.
(* Except for token::OpenDelim and token::CloseDelim, those may require significant parser modifications.)

@petrochenkov
Copy link
Contributor

On jointness-from-the-left vs jointness-from-the-right.

Proc macro API currently uses the right jointness because it's more convenient for parsers - less need for lookahead.

It's less convenient for lexers, because to obtain jointness for a token we need to lex an arbitrary number of the following (whitespace) tokens.

However, if we are producing jointness flags in a pre-parser (token stream parser) rather than in a lexer, like it's mentioned in the comment above, then keeping jointness-from-the-right everywhere should be ok, I think.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 16, 2020
@matklad
Copy link
Member Author

matklad commented Aug 17, 2020

Yup, this all makes sense, especially "avoiding unstable intermediate states". I do think I now have capacity to pour into these work (20 hours per week for months), but it makes sense to make more sure steps, especially in the beginning.

So, I'll look into removing Whitespace, Comment and Shebang tokens, as they are duplicate with spacing. Unknown and Eof also look suspiciously, but for orthogonal reasons, so I'll leave them alone.

I am less sure about merging TokenStream and Token, but that needs a longer explanation.

Longer explanation:

I think the core idea we have here is that "there are many tokens in rust". proc macro tokens != mbe tokens, for example, so some amount of bridging would be required somewhere. Here, I'd like to distill the notion of "parser tokens" -- what the input to the parser looks like. This is an unorthodox viewpoint, but I am not sure that rust parser operates on token trees ( :D ). Today, it uses TokenCursor and maintains a stack of unclosed_delims, which effectively flatten the token tree into the list. Moreover, for IDE purposes, we'd want to reasonably parse code like this:

fn main() {
    let x = foo(
    let y = ();
}

Where "reasonably parse" means "the syntax tree should have ArgList node with just (".

So, long-term, I think the parser should work just with Iterator<Item=Token>, where Token is a flat thing, and () is two tokens. The token-tree representations would be reserved as an explicit input/output format for macros (which it already somewhat is, with /// desugaring for proc-macros, and >> gluing for mbe).

Though, even if the above is a good idea, it might make sense to collapse Token & TokenTree as an intermediate state.

@matklad
Copy link
Member Author

matklad commented Aug 17, 2020

@petrochenkov could you clarify if "Make one more step towards fully token-based expansion" refers only to proc-macros, or do we want to make mbe work without non-terminals as well? That is, I am reading this as "we can bend backwards compat gently enough to completely remove Nt from everywhere", but I wonder if I am overly optimistic here :)

@petrochenkov petrochenkov added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Aug 17, 2020
@petrochenkov
Copy link
Contributor

Unknown and Eof also look suspiciously

FWIW, Unknown looks like a legitimate token-1 that can be a part of a token stream, it's just always unexpected during parsing.

Moreover, for IDE purposes, we'd want to reasonably parse code like this:

AFAIK, rustc recovers unbalanced delimiters during pre-parsing aka token stream parsing now, not during regular parsing (but I'm not sure).
So input for the regular rustc parser could in theory consists purely from what is now known as TokenTrees.
Migrating from the flattened version would require some noticeable work though.
The parser working with flattened tokens currently treats groups with empty delimiters entirely incorrectly though, flattening them into nothing (#67062).

could you clarify if "Make one more step towards fully token-based expansion" refers only to proc-macros, or do we want to make mbe work without non-terminals as well?

It refers to everything, including MBEs.
Nonterminal in the token based world is a delimited group* with an empty delimiter, that's how proc macros treat them already.
To treat them like this everywhere we need to fix #67062 though.

* With rare exceptions like tt, which is not delimited by design. (Or perhaps ident in the future, which is only "grouped" to keep more spans, #72545 (comment))

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 17, 2020
@matklad
Copy link
Member Author

matklad commented Aug 19, 2020

Let's close this for now, to keep the set of open PRs smaller. There are several baby yaks to be shaved before this one

@matklad matklad closed this Aug 19, 2020
@matklad matklad mentioned this pull request Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants