Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librustc_lexer: Refactor the module #66015

Merged
merged 9 commits into from
Nov 6, 2019

Conversation

popzxc
Copy link
Contributor

@popzxc popzxc commented Nov 1, 2019

This PR introduces a refactoring of the librustc_lexer in order to improve readability.

All the changes performed are only cosmetic and do not introduce any changes the lexer logic or performance.

Newly introduced modules literal, token and utils are just copy-pasted from the lib.rs and do not contain even cosmetic changes (I decided to do so so it'll be easier to review changes looking only on diff).

r? @petrochenkov

cc @Centril @matklad

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 1, 2019
// string with single quotes).
if self.first() == '\'' {
self.bump();
let kind = Char { terminated: true };
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I'm not sure why we're consuming the literal suffix above, but do not consume here.
As a result, we have a different errors for non-single character single-quoted literals with suffixes depending on the first symbol:
Playground 1 / Playground 2

That's pretty esoteric, I know, but nevertheless it seems a bit inconsistent to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not sure if that can even be called a bug since the code in example is completely invalid)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we detect error in the char literal, it's better to recover the next token as identifier, rather than treat it as a suffix

Copy link
Member

@matklad matklad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have strong opinions on any changes suggested here :)

rutsc_lexer is mostly a from-scratch implementation, so the current shape of code is mostly what I'd consider the most readable implementation.

I am also feel slightly uneasy because it's not trivially clear that this doesn't change behavior. I wish we did lexer spec and full-coverage test-suite already :)

src/librustc_lexer/src/cursor.rs Show resolved Hide resolved
src/librustc_lexer/src/utils.rs Outdated Show resolved Hide resolved
src/librustc_lexer/src/lib.rs Outdated Show resolved Hide resolved
src/librustc_lexer/src/lib.rs Outdated Show resolved Hide resolved
// string with single quotes).
if self.first() == '\'' {
self.bump();
let kind = Char { terminated: true };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we detect error in the char literal, it's better to recover the next token as identifier, rather than treat it as a suffix

src/librustc_lexer/src/lib.rs Outdated Show resolved Hide resolved
src/librustc_lexer/src/lib.rs Show resolved Hide resolved
fn float_exponent(&mut self) -> Result<(), ()> {
/// Eats the float exponent. Returns true if at least one digit was met,
/// and returns false otherwise.
fn eat_float_exponent(&mut self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other eat_x functions have a contract that, if they return false, they don't consume anything.

This function always consumed something, and, if it returns an Err, you must report it, hence this weird owl-result/bool. It definitely could use a comment though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do they? For example, eat_decimal_digits will consume _______ and return false.

src/librustc_lexer/src/lib.rs Outdated Show resolved Hide resolved
src/librustc_lexer/src/lib.rs Outdated Show resolved Hide resolved
@petrochenkov
Copy link
Contributor

I don't think this split into multiple files is an improvement.
The lexer was only recently rewritten and similarly to matklad I'm pretty happy about its current state and readability, and the file is far from being large enough for requiring a split

@petrochenkov
Copy link
Contributor

Could you put the code into its old place? Then I'll be able to review the remaining diff.

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 1, 2019
@popzxc
Copy link
Contributor Author

popzxc commented Nov 2, 2019

Sure, I'll bring everything back soon.

@petrochenkov petrochenkov added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Nov 2, 2019
src/librustc_lexer/src/lib.rs Outdated Show resolved Hide resolved
src/librustc_lexer/src/lib.rs Show resolved Hide resolved
@petrochenkov
Copy link
Contributor

+1 on eat_while and first/second.
Could you move them into separate commits so the rest of the PR can be reviewed more easily?

@petrochenkov petrochenkov added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 3, 2019
// Newline without following '\'' means unclosed quote, stop parsing.
'\n' if self.nth_char(1) != '\'' => break,
// End of file, stop parsing.
EOF_CHAR if self.is_eof() => break,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the order of match arms was changed here.

Copy link
Contributor Author

@popzxc popzxc Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I had two-level motivation here:

  1. I ordered match arms depending on the termination level (first match has return, then go exceptional cases with break, then go char-skipping arms (escaped char and any other char)).
  2. I thought that it's a bit more readable to have the normal exit condition to be the first match arm.

@petrochenkov
Copy link
Contributor

r? @matklad

@rust-highfive rust-highfive assigned matklad and unassigned petrochenkov Nov 3, 2019
@petrochenkov petrochenkov added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Nov 3, 2019
Copy link
Member

@matklad matklad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me with a comment about '0 expanded to mention error recovery, and with 13da2aa squashed into fdf74a3 (such that there's no back and forth with outlining and inlining methods).

Overall, I must say that some changes here are clearly wins, while others seem more like just a different equivalent way to write the same code. In the future, I would advise splitting uncontroversial strict improvements from stylistic changes, such that it becomes easier to access and merge both independently.

false
} else {
// If the first symbol is valid for identifier
// or it's a digit, it can be a lifetime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't really be a lifetime if second is a digit. Rather, this is a special-cased error recovery.

@@ -682,15 +670,33 @@ impl Cursor<'_> {
if self.eat_decimal_digits() { Ok(()) } else { Err(()) }
}

// Eats the suffix if it's an identifier.
// Eats the suffix of the literal, e.g. "_u8".
fn eat_literal_suffix(&mut self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this method can be removed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have it for readability. It's obvious why we are calling "eat_literal_suffix" after parsing the literal, but it's not that obvious when we'll call "eat_identifier" instead.

}

(n_hashes, started, finished)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure if this is a simplification:

  • started/finished can be const-propagated so that there's no need to keep state in your head
  • mutable predicates are odd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't completely like this too, but IMO it's better than the current approach.
With the current approach instead of having values of "started" and "finished" in the head, you had to remember what exactly "n_hashes, true, false" mean (which at least for me was a bit harder).

Regarding the mutable predicate: that's the cost of not having a eat_at_most_while. However, the scope of this predicate is pretty small and the context is simple, so I don't find it confusing.

Nevertheless I can revert those changes if you insist :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to revert this: I don't claim that the old version was better. That's really just different ways to work-around the bad signature of the method.

@popzxc
Copy link
Contributor Author

popzxc commented Nov 3, 2019

Regarding the splitting changes into several PRs: sure, I've got my lesson, and sorry for putting it all-together this time.
I'll try to do better in the future :)

@matklad
Copy link
Member

matklad commented Nov 4, 2019

@bors r+

Thanks!

@bors
Copy link
Contributor

bors commented Nov 4, 2019

📌 Commit 31735b0 has been approved by matklad

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 4, 2019
Centril added a commit to Centril/rust that referenced this pull request Nov 6, 2019
…matklad

librustc_lexer: Refactor the module

This PR introduces a refactoring of the `librustc_lexer` in order to improve readability.

All the changes performed are only cosmetic and do not introduce any changes the lexer logic or performance.

Newly introduced modules `literal`, `token` and `utils` are just copy-pasted from the `lib.rs` and do not contain even cosmetic changes (I decided to do so so it'll be easier to review changes looking only on diff).

r? @petrochenkov

cc @Centril @matklad
bors added a commit that referenced this pull request Nov 6, 2019
Rollup of 9 pull requests

Successful merges:

 - #65776 (Rename `LocalInternedString` and more)
 - #65973 (caller_location: point to macro invocation sites, like file!/line!, and use in core::panic!.)
 - #66015 (librustc_lexer: Refactor the module)
 - #66062 (Configure LLVM module PIC level)
 - #66086 (bump smallvec to 1.0)
 - #66092 (Use KERN_ARND syscall for random numbers on NetBSD, same as FreeBSD.)
 - #66103 (Add target thumbv7neon-unknown-linux-musleabihf)
 - #66133 (Update the bundled `wasi-libc` repository)
 - #66139 (use American spelling for `pluralize!`)

Failed merges:

r? @ghost
@bors bors merged commit 31735b0 into rust-lang:master Nov 6, 2019
@popzxc popzxc deleted the refactor-librustc_parser branch November 6, 2019 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants