Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up diagnostics in uncommon_codepoints (potentially splitting up the lint as well) #120228

Open
Manishearth opened this issue Jan 22, 2024 · 4 comments
Assignees
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. A-unicode Area: Unicode E-help-wanted Call for participation: Help is requested to fix this issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. L-uncommon_codepoints Lint: uncommon_codepoints

Comments

@Manishearth
Copy link
Member

Manishearth commented Jan 22, 2024

Currently we have the uncommon_codepoints lint, which lints on anything which is Identifier_Status=Restricted.

It may be worth improving the diagnostics there by splitting it into multiple different specialized diagnostics. In the long run, some of these might be something that should be promoted to a separate lint so that they can individually be allowed.

The diagnostics I can think of are:

  1. One that calls out confusables with operators and syntax. we already have this for parse errors but not for lints post-parse. Unicode does not provide this data directly but we can construct it from unicode data easily.
  2. One that talks about Technical in general
  3. One that talks about Exclusion in general (this is "scripts that are dead")
  4. One that talks about Limited_Use in general (this is "scripts that are alive but not in widespread digital use yet")
  5. One that talks about Not_NFKC in general.

The first one can be implemented by taking the set of Rust syntax characters, expanding that to their confusables set, and then winnowing it down to the set of characters that is allowed in an identifier. This could belong in a separate check in the unicode-security crate.

The others can be implemented by checking the identifier_type() of characters in the ident.

I might be able to mentor this, I can provide diagnostic text for these when needed.

@Manishearth Manishearth added A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. A-diagnostics Area: Messages for errors, warnings, and lints E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. E-help-wanted Call for participation: Help is requested to fix this issue. labels Jan 22, 2024
@fmease fmease added the A-unicode Area: Unicode label Jan 22, 2024
@HTGAzureX1212
Copy link
Contributor

Hello, I'm interested in taking a go at this. Could anyone mentor me on this?

@HTGAzureX1212
Copy link
Contributor

@rustbot claim

@Manishearth
Copy link
Member Author

I'm too busy in the coming weeks to fully mentor but I can answer questions. Please make a thread in the diagnostics channel on rust-lang.zulipchat.org and ask questions there, ccing me ("Manish Goregaokar").

@Manishearth
Copy link
Member Author

I would start by implementing the checks for 2-5 using the existing APIs.

The relevant code is here:

cx.emit_spanned_lint(UNCOMMON_CODEPOINTS, sp, IdentifierUncommonCodepoints);

You'll want that to emit a different lint message based on context.

The lint messages are pulled from a diagnostics type https://github.com/rust-lang/rust/blob/master/compiler/rustc_lint/src/lints.rs#L1111

which links to

lint_identifier_uncommon_codepoints = identifier contains uncommon Unicode codepoints

I think the first change to make would actually be to make this diagnostic type contain a vector of characters, which it prints out as a list. Once we have that done, we should add more versions of it that have different messages, for Technical, Exclusion, etc.

fmease added a commit to fmease/rust that referenced this issue Jan 23, 2024
…diagnostics-uncommon-codepoints, r=Manishearth

Split Diagnostics for Uncommon Codepoints: Add List to Display Characters Involved

This Pull Request adds a list of the uncommon codepoints involved in the `uncommon_codepoints` lint, as outlined as a first step in rust-lang#120228.

Example rendered diagnostic:
```
error: identifier contains an uncommon Unicode codepoint: 'µ'
  --> $DIR/lint-uncommon-codepoints.rs:3:7
   |
LL | const µ: f64 = 0.000001;
   |       ^
   |
note: the lint level is defined here
  --> $DIR/lint-uncommon-codepoints.rs:1:9
   |
LL | #![deny(uncommon_codepoints)]
   |         ^^^^^^^^^^^^^^^^^^^
```

(Retrying rust-lang#120258.)
fmease added a commit to fmease/rust that referenced this issue Jan 23, 2024
…diagnostics-uncommon-codepoints, r=Manishearth

Split Diagnostics for Uncommon Codepoints: Add List to Display Characters Involved

This Pull Request adds a list of the uncommon codepoints involved in the `uncommon_codepoints` lint, as outlined as a first step in rust-lang#120228.

Example rendered diagnostic:
```
error: identifier contains an uncommon Unicode codepoint: 'µ'
  --> $DIR/lint-uncommon-codepoints.rs:3:7
   |
LL | const µ: f64 = 0.000001;
   |       ^
   |
note: the lint level is defined here
  --> $DIR/lint-uncommon-codepoints.rs:1:9
   |
LL | #![deny(uncommon_codepoints)]
   |         ^^^^^^^^^^^^^^^^^^^
```

(Retrying rust-lang#120258.)
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jan 24, 2024
Rollup merge of rust-lang#120259 - HTGAzureX1212:HTGAzureX1212/split-diagnostics-uncommon-codepoints, r=Manishearth

Split Diagnostics for Uncommon Codepoints: Add List to Display Characters Involved

This Pull Request adds a list of the uncommon codepoints involved in the `uncommon_codepoints` lint, as outlined as a first step in rust-lang#120228.

Example rendered diagnostic:
```
error: identifier contains an uncommon Unicode codepoint: 'µ'
  --> $DIR/lint-uncommon-codepoints.rs:3:7
   |
LL | const µ: f64 = 0.000001;
   |       ^
   |
note: the lint level is defined here
  --> $DIR/lint-uncommon-codepoints.rs:1:9
   |
LL | #![deny(uncommon_codepoints)]
   |         ^^^^^^^^^^^^^^^^^^^
```

(Retrying rust-lang#120258.)
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Feb 26, 2024
…e-identifier-types, r=fmease,Manishearth

Split Diagnostics for Uncommon Codepoints: Add Individual Identifier Types

This pull request further modifies the `uncommon_codepoints` lint, adding the individual identifier types of `Technical`, `Not_NFKC`, `Exclusion` and `Limited_Use` to the diagnostic message.

Example rendered diagnostic:
```
error: identifier contains a Unicode codepoint that is not used in normalized strings: 'ij'
  --> $DIR/lint-uncommon-codepoints.rs:6:4
   |
LL | fn dijkstra() {}
   |    ^^^^^^^
   = note: this character is included in the Not_NFKC Unicode general security profile
```

Second step of rust-lang#120228.
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Feb 26, 2024
Rollup merge of rust-lang#120840 - HTGAzureX1212:HTGAzureX1212/unicode-identifier-types, r=fmease,Manishearth

Split Diagnostics for Uncommon Codepoints: Add Individual Identifier Types

This pull request further modifies the `uncommon_codepoints` lint, adding the individual identifier types of `Technical`, `Not_NFKC`, `Exclusion` and `Limited_Use` to the diagnostic message.

Example rendered diagnostic:
```
error: identifier contains a Unicode codepoint that is not used in normalized strings: 'ij'
  --> $DIR/lint-uncommon-codepoints.rs:6:4
   |
LL | fn dijkstra() {}
   |    ^^^^^^^
   = note: this character is included in the Not_NFKC Unicode general security profile
```

Second step of rust-lang#120228.
@jieyouxu jieyouxu added the L-uncommon_codepoints Lint: uncommon_codepoints label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-lint Area: Lints (warnings about flaws in source code) such as unused_mut. A-unicode Area: Unicode E-help-wanted Call for participation: Help is requested to fix this issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. L-uncommon_codepoints Lint: uncommon_codepoints
Projects
None yet
Development

No branches or pull requests

4 participants