Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore non-semantic tokens for 'probably_eq' streams. #55971

Merged
merged 1 commit into from
Nov 19, 2018

Conversation

SergioBenitez
Copy link
Contributor

Improves the situation in #43081 by skipping typically non-semantic tokens when checking for 'probably_eq'.

r? @alexcrichton

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 15, 2018
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I left a few comments, and I'm somewhat wary of this. It's possible for the token trees to differ ever-so-slightly in pedantic fashions where I don't think this PR is correct, but I also don't think that we can actually today construct such a situation. The main use case here is handling things like previously expanded macros, #[cfg] removing items, etc. These are "major changes" to the AST rather than surgical changes, so this is likely to be correct today but I'm curious to get your take too!

match tree {
| TokenTree::Token(_, Token::Comma)
| TokenTree::Token(_, Token::OpenDelim(DelimToken::NoDelim))
| TokenTree::Token(_, Token::CloseDelim(DelimToken::NoDelim))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be slightly worried about ignoring these because couldn't they affect precedence? The token streams 1 + (2 * 3) vs (1 + 2) * 3 for example (subbing parens for "no delimiter"). Does this come up in practice though for one of the tests below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be a DelimToken::Paren? I didn't filter those as I was worried about the same thing.

| TokenTree::Token(_, Token::OpenDelim(DelimToken::NoDelim))
| TokenTree::Token(_, Token::CloseDelim(DelimToken::NoDelim))
| TokenTree::Token(_, Token::OpenDelim(DelimToken::Brace))
| TokenTree::Token(_, Token::CloseDelim(DelimToken::Brace))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the one above where it seems like it may cause weird failures, but did this come up in practice for tests below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has come up in practice; all of these filters were motivated by existing reports. I'm not sure, off the top of my head, where I saw this issue in practice, but I'm happy to find it if needed!

let mut t2 = other.trees();
fn semantic_tree(tree: &TokenTree) -> bool {
match tree {
| TokenTree::Token(_, Token::Comma)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you throw a comment here for why we're ignoring all comma tokens?

One case this may get tripped up is (x) vs (x,), though :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the most commonly hit span-loss in practice (see rwf2/Rocket#811, my own comment in #43081 (comment), and dtolnay/syn#482). In general, the pretty printer inserts trailing commas everywhere, and users don't. So all three of the following lose span information:

struct Foo { a: usize }
let x = Foo { a: 10 };
match x { Foo(..) => 10 }

| TokenTree::Token(_, Token::CloseDelim(DelimToken::NoDelim))
| TokenTree::Token(_, Token::OpenDelim(DelimToken::Brace))
| TokenTree::Token(_, Token::CloseDelim(DelimToken::Brace))
| TokenTree::Token(_, Token::Semi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, could a comment be added here why this isn't necessary to compare?

I'm also a little uneasy about this one like commas though because it seems like we're saying that semicolons are optional in Rust, but I feel like there's something that can parse one way or another with and without a semicolon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but the thing to consider here is whether there's going to be a modification to the AST that add/removes semicolons without preserving semantics. I'd be surprised if that was the case.

#![feature(proc_macro_diagnostic)]
#![feature(proc_macro_hygiene)]
#![feature(proc_macro_quote)]
#![feature(proc_macro_span)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to write this test with only stable proc macros perhaps? I've found the best way to do that is just force tokens to round trip through a procedural macro but make sure the tokens have a compiler warning in them (like an unused mut), and that way we won't have to update this over time ideally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try that (round-tripping was my first attempt), but I couldn't get the tests to work as I'd expect. I can try again, however.

@SergioBenitez
Copy link
Contributor Author

SergioBenitez commented Nov 15, 2018

I'm somewhat wary of this. It's possible for the token trees to differ ever-so-slightly in pedantic fashions where I don't think this PR is correct, but I also don't think that we can actually today construct such a situation.

I'm somewhat wary of this too, but no more so than I am of the entire probably_eq ordeal.

My understanding of the need for probably_eq at all (and please do correct me if I'm wrong in any way!), is that we're trying to detect if the AST has changed in some way that invalidates the TokenStream that was stored when the original AST was parsed. If this is the case, then it seems that the correct solution to the entire problem is to know where the AST is changing and have that code invalidate the cached TokenStream directly.

Presumably this is much harder than it sounds. I, for one, don't know how to easily find all of the instances that change the AST in a way that matters for this issue, and worse, how to enforce that any future code carries out the appropriate invalidation. (Side note: To get a better understanding, I'd love it if you could point to some example in the source where this happens today.) As such, probably_eq exists to check if the TokenStream has been invalidated by generating a TokenStream from the current AST and comparing it to the stored TokenStream. Because the generated TokenStream isn't lossless, we do a probably_eq instead of simply eq.

The implementation of probably_eq already doesn't consider some important changes. For instance, the following two are considered equal: 1 + 2 and 4 + 5. This is because the Lit comparison looks only at the variant discriminant and not at the contents. The justification for this is that any invalidating modifications to the AST are unlikely to be of this kind (they're likely to be "major changes"!), or if they are of this kind, they're likely to preserve semantics. I think the same reasoning applies to ignoring tokens like ,, {, and }.

Somewhat tangentially: it seems that in an ideal world, we'd never want to return a TokenStream that was created from printing an AST node. Is that correct? In other words, there should never be a case where a TokenStream is "invalidated". Instead, it should be replaced with the "updated" TokenStream. Is that the right way to think about this, or am I misunderstanding something fundamentally?

@alexcrichton
Copy link
Member

You're spot on with all the descriptions/comments here, and I had also forgotten this was "so bad" as to consider 1 + 2 equivalent to 4 + 5! Since we do that, I'd actually be comfortable landing everything here!

Presumably this is much harder than it sounds.

Yeah :(. Fundamentally none of this should be necessary. We should implement #43081 instead, basically implementing something like ToToken for all AST types. Since #43081 is such a significant undertaking, though, it hasn't been done yet.

Since that's not done the next best thing is to simply cache what we parse and reuse that. Naturally, though, mutation of the AST can happen practically anywhere, so there's no real feasible way for us to track where it's all happening (especially over time as we continue to modify the compiler).

To get a better understanding, I'd love it if you could point to some example in the source where this happens today

The only one that I can remember (and the main motivation for probably_eq to exist in the first place) was #[cfg] processing. During #[cfg] processing we can remove, for example, functions from an impl block or fields of a struct. The cached token streams, however, aren't updated to reflect this.

I can't think of any other locations that we modify the AST pre-macro-expansion, but I vaguely recall there being at least one more...

Note that as a side effect of this you're guaranteed to get bad spans for this:

#[derive(Deserialize)]
struct Foo {
    #[cfg(something)]
    a: i32,
    #[serde(malformed attribute)]
    b: u32,
}

Ok so to land this PR, I would ideally like to see two things:

  • First, a comment above each case in semantic_tree about why the token is being ignored. For example , definitely makes sense for trailing commas and whatnot.
  • Second, the test is updated to only use stable proc_macro APIs

For the second part I think you should be able to get away with this:

// src/lib.rs
#![crate_type = "proc-macro"]

extern crate proc_macro;

use proc_macro::TokenStream;

#[proc_macro_attribute]
pub fn foo(_attr: TokenStream, item: TokenStream) -> TokenStream {
    item.into_iter().collect()
}

and with

extern crate foo;

#[foo::foo]
#[allow(dead_code)]
fn main() {
    let warn_about_me = 3;

    struct A { a: i32 }
}

as the test. That (for me today) produces a crazy looking warning with no span. You should be able to just continue throwing syntax into fn main() { ... } which today causes spans to get removed, and all the examples added here the spans should get preserved.

How's that all sound?

@SergioBenitez
Copy link
Contributor Author

How's that all sound?

Perfect! Thank you for expanding on the issue. Will tackle this tonight.

@SergioBenitez
Copy link
Contributor Author

@alexcrichton Done.

@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Contributor

bors commented Nov 17, 2018

📌 Commit 78eb516 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 17, 2018
@SergioBenitez
Copy link
Contributor Author

SergioBenitez commented Nov 19, 2018

Any chance we could get this in a roll-up or bump the priority? This is a blocker for Rocket's 0.4 release.

@petrochenkov
Copy link
Contributor

@bors p=1

@bors
Copy link
Contributor

bors commented Nov 19, 2018

⌛ Testing commit 78eb516 with merge 5aff307...

bors added a commit that referenced this pull request Nov 19, 2018
Ignore non-semantic tokens for 'probably_eq' streams.

Improves the situation in #43081 by skipping typically non-semantic tokens when checking for 'probably_eq'.

r? @alexcrichton
@bors
Copy link
Contributor

bors commented Nov 19, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing 5aff307 to master...

@bors bors merged commit 78eb516 into rust-lang:master Nov 19, 2018
kennytm added a commit to kennytm/rust that referenced this pull request Nov 23, 2018
Add unstable Literal::subspan().

Take 2 of rust-lang#55971. Still ~wrong, but now with a comment! (and less of a surface) Unblocks rust-lang#49219.

r? @alexcrichton
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants