Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the edition specifier to be an integer #12301

Closed
wants to merge 1 commit into from

Conversation

est31
Copy link
Member

@est31 est31 commented Jun 22, 2023

This PR allows for edition specifiers to be toml integers:

[package]
name = "foo"
version = "0.1.0"
edition = 2018

So far, each edition was fully described by its year number: 2015, 2018, 2021. Also for the future, the intent is that editions will always just be year numbers. Therefore, the natural choice for the type is an integer, instead of the current requirement for toml strings. Using an integer saves two characters in each Cargo.toml. Over the roughly hundred thousand Cargo.toml files on crates.io alone, you'll get some nice savings. Strings are still allowed to support legacy use cases and for the (unlikely) case of non-integral editions in the future.

I've proposed this 5 years ago before the 2018 edition even shipped, and received positive feedback from some people (#5617). I ended up closing it because of unrelated reasons. Also, back then it was less clear whether one would have sth like 2018-beta or such which would have required a string, which eventually didn't end up being a thing (but even then, one could make the case that one could still allow integers for the finally released 2018). With the cargo-script support (cc #12207), I'd say that interest is large again in minimizing the number of characters/keystrokes needed, so I'm bringing back this old proposal.

I feel it's a very small feature in scope so one doesn't need a feature gate to evaluate its impact, but feel free to suggest gating.

@rustbot
Copy link
Collaborator

rustbot commented Jun 22, 2023

r? @weihanglo

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added A-manifest Area: Cargo.toml issues S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 22, 2023
@est31 est31 force-pushed the integral_editions branch 3 times, most recently from 2547237 to 3991892 Compare June 22, 2023 12:30
@@ -979,6 +979,22 @@ impl StringOrVec {
}
}

#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
#[serde(untagged, expecting = "expected a string or an integer")]
pub enum StringOrI64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other areas I suspect we should update if we move forward with this

  • cargo new
  • embedded manifests
  • documentation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epage We shouldn't update cargo new until 2024, to avoid MSRV implications.

Copy link
Member Author

@est31 est31 Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been some discussion on this in #12301 (review) and #12301 (comment) . I agree that cargo new should be updated only well into the 2023 edition, or even later, not just for MSRV but also for better error messages from cargos that don't support the 2023 edition.

@epage
Copy link
Contributor

epage commented Jun 22, 2023

Previous thoughts on this

steveklabnik (#5617 (comment))

This adds 87 lines of code to save two characters. (Okay, some of it is whitespace, and some of it is tests, but the point is, you wouldn't need those tests without this.)

Is this actually worth it? I'm not convinced.

withoutboats (#5617 (comment))

I prefer not to support integers in this position for consistency. We don't support integers or floating points as version constraints, even though that would have the same motivation. These fields are strings.

matklad (#5617 (comment))

To clarify why I think this is a good idea:

Currently, edition numbers are integers. I think we might introduce something like 2018-pre, or even 2018.1, but the former seems niche, and the latter unlikely. So, editon = 2018 optimizes for the common case. I think the main point here is not saving two characters, but avoiding "2018", which just looks weird: why represent a number as a string if you can represent it as a number?

Note that we already support mixed typed in other places in Cargo.toml, for example, opt-level can be either of 1, "2", "s".

As for consistency with versions/version requirements I agree with @est31: we conceptually use SemVer for package versions and integers for edition versions, so, to me, it doesn't seem that these are the same thing.

withoutboats (#5617 (comment))

I think its fairly likely we will someday have an edition which we want to give a non-integral name. Early versions of the preview proposal used the syntax edition = "2018-preview" instead of turning on a lint, and I could see us using that in a future edition preview. And who knows what other names we will use as the concept evolves over the history of Rust.

Unless the type can encode all forms, I'd rather not support it. If I agreed that editions would always be an integral number, I'd be for this (well, really, I'd question the choice to support strings in that case). But I think editions can be named arbitrary strings.

The biggest problem for me is for a user who is new to TOML, and the way this could confuse them into thinking TOML supports bare words in value position (since it does support them in key position). Even more generally, I'd rather keep our use of TOML as "simply typed" as possible and avoid supporting multiple types without strong motivation (the support for both strings and objects for dependencies members is an example that pulls its weight).

joshtriplett (#5617 (comment))

Personally, I'm entirely in favor of this. Current editions use numbers, future editions will likely use numbers, and this makes Cargo accept numbers so people don't have to give advice like "just put edition equals 2018 in your Cargo.toml, but remember to quote the 2018". (Which then leads to the obvious question of "why?".)

@epage
Copy link
Contributor

epage commented Jun 22, 2023

So reception among the cargo team 5 years ago was mixed.

With the cargo-script support (cc #12207), I'd say that interest is large again in minimizing the number of characters needed, so I'm bringing back this old proposal.

imo a lot of the conversations around "cargo script" are measuring the wrong metric, number of characters. I think a better metric is number of unique keys touched. This is locale-centric though.

Within that framework, quotes are actually more expensive (<shift> + ', even more when you consider 's distance to home row). imo this is a reason I prefer doc comment syntax over macro syntax as macros are very disjoint to type.

Still I think the savings for this change are not significant compared to the content being entered [package]\nedition =

However, I am also sympathetic to reason joshtriplett had 5 years ago for being in favor: easier to communicate the field's value verbally.

@epage
Copy link
Contributor

epage commented Jun 22, 2023

An interesting counter point to this is we later added resolver and made that a string, rather than an integer. I would be interested to know more background on that to see if it affects this discussion.

Copy link
Member

@weihanglo weihanglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some counterpoints.

  • With the current implementation, this is allowed:
    [package]
    name = "foo"
    version = "0.1.0"
    edition = 0x7e5
  • Changing any existing syntax in Cargo.toml could lead to an MSRV bump without notice.
    (dep: feature for example)
  • A bit confusing that with the same set of sources, an old compiler can compile a "2015" package but not a 2015 package.

@epage
Copy link
Contributor

epage commented Jun 22, 2023

That is true that the edition is an identifier, and not a number, and so other numerical representations are invalid. However, I'm unsure if that would matter much in practice.

And yes, this would force an MSRV bump. However, I think its more for the future than for the past.

That doesn't mean I disagree with your points but just providing more considerations

@est31
Copy link
Member Author

est31 commented Jun 22, 2023

With the current implementation, this is allowed: edition = 0x7e5

Good point. But also agree with @epage 's point that this might not matter in practice. I think implementation wise, thanks to cargo using toml_edit it should be possible to easily disallow these cases.

Changing any existing syntax in Cargo.toml could lead to an MSRV bump without notice.

I agree that this is a problem. I would thus recommend waiting with the changing of the docs/cargo new.

This PR could document that specifying an integral edition specifier is possible, but not, say, change cargo new, so newly created projects don't immediately bump into this. Even once edition 2024 arrives, the error that the cargo that is currently on master now gives isn't as nice as if it were edition = "2024". So ideally, it would be phased in even more slowly for cargo new.

@est31
Copy link
Member Author

est31 commented Jun 22, 2023

Previous thoughts on this

I have replied to some of these points in the thread back when they were made. The 2018-pre point I've outlined and also addressed in this PR's description, as well as the point about how this patch is adding a lot of lines to save a few characters.

I think even back then that would have been a worthy tradeoff as there is potentially an extremely large number of crates. In fact, since that comment, Rust has grown extremely and there are hundreds of thousands of crates on crates.io. That's 200 kb savings potential if only the currently available crates switched to the model :). But in general I feel that tools like cargo or rustc should not avoid adding to their codebase if it even saves one character or just makes one error message better.

imo a lot of the conversations around "cargo script" are measuring the wrong metric, number of characters. I think a better metric is number of unique keys touched. This is locale-centric though.

Full agree on that. Number of characters is a good proxy though for number of unique keys touched, so it is useful to study both I think.

@joshtriplett joshtriplett added the I-nominated-to-discuss To be discussed during issue triage on the next Cargo team meeting label Jun 22, 2023
@joshtriplett
Copy link
Member

Nominating for discussion in the next meeting.

I also wonder if we should just FCP this and see if we have consensus/objections.

#[serde(untagged, expecting = "expected a string or an integer")]
pub enum StringOrI64 {
String(String),
I64(i64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an i64 instead of a u64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried an u32 at the start and it didn't work, as in it compiled but the tests as added by this PR failed. The problem was I think that serde represents toml integers via i64 in its data model. In the end I don't think it matters that much: a negative number will still give an error one way or another.

@weihanglo weihanglo removed the I-nominated-to-discuss To be discussed during issue triage on the next Cargo team meeting label Jun 27, 2023
@bors
Copy link
Collaborator

bors commented Jun 29, 2023

☔ The latest upstream changes (presumably #12317) made this pull request unmergeable. Please resolve the merge conflicts.

This helps save the number of characters needed to specify an edition.
So far, each edition was fully described by its year number, 2015, 2018, 2021.
Also for the future, the intent is that editions will always just be
year numbers. Therefore, we don't have to use a toml string for them in Cargo.toml,
an integer is enough. This saves two characters in each Cargo.toml.
@rustbot rustbot added the A-documenting-cargo-itself Area: Cargo's documentation label Jun 30, 2023
@bors
Copy link
Collaborator

bors commented Aug 25, 2023

☔ The latest upstream changes (presumably #12553) made this pull request unmergeable. Please resolve the merge conflicts.

@est31
Copy link
Member Author

est31 commented Aug 27, 2023

For future reference, this was also discussed in this zulip stream. It seems that more cargo team members would support this once there is a unified library to read/parse Cargo.toml files that can be used by external tools, without using the cargo crate directly.

@polarathene
Copy link

2 char savings vs leverage compression at scale

Using an integer saves two characters in each Cargo.toml. Over the roughly hundred thousand Cargo.toml files on crates.io alone, you'll get some nice savings.

How are these savings being measured? An uncompressed representation isn't necessarily a reliable metric.

A server can use a filesystem with transparent compression, a database or similar forms offering compressed storage such as archive formats. This can also be leveraged over network transit.

In a scenario such as crates.io, it could be a non-concern if the Cargo.toml files don't need to be uncompressed. ZSTD for example can be trained across all those Cargo.toml files as input to optimize compression further. The higher the frequency of a pattern, the more likely it'll be part of that trained compression dictionary.

If such savings were important at the scale of crates.io, this approach would be far more effective than the benefit of "two characters in each Cargo.toml", which I doubt would have much of an added improvement with that ZSTD approach.


Verbal instruction vs UX consistentcy

However, I am also sympathetic to reason joshtriplett had 5 years ago for being in favor: easier to communicate the field's value verbally.

people don't have to give advice like "just put edition equals 2018 in your Cargo.toml, but remember to quote the 2018". (Which then leads to the obvious question of "why?".)

"Set edition in your Cargo.toml to the string 2018"? 🤷‍♂️

I don't know about users asking "why?" with that information. They're instructed to do the same for semver, which if the minor or patch versions are omitted, looks like an integer or float.

For the most part, cargo could just improve the error message to reference existing editions 2015, 2018, 2021 - if that helps as an example (eg: last 3 editions). Even if you avoid the clarification and just verbally say "just put edition equals 2018 in your Cargo.toml", the error message makes it clear what to do (it's an integer, expected string):

$ cargo check
error: failed to parse manifest at `/app/Cargo.toml`

Caused by:
  invalid type: integer `2021`, expected a string or workspace
  in `package.edition`

Technically the edition hint exists if you at least provide an invalid string value:

# Newer than latest edition:
$ cargo check
error: failed to parse manifest at `/app/Cargo.toml`

Caused by:
  failed to parse the `edition` key

Caused by:
  this version of Cargo is older than the `2022` edition, and only supports `2015`, `2018`, and `2021` editions.

# Any other string value (including years older than latest edition):
$ cargo check
error: failed to parse manifest at `/app/Cargo.toml`

Caused by:
  failed to parse the `edition` key

Caused by:
  supported edition values are `2015`, `2018`, or `2021`, but `'2018'` is unknown

I'm not sure that supporting integers improves the UX when said user encounters a foreign Cargo.toml using strings instead (eg: contributing to an older project on Github).

  • Yes the error is a non-concern, 2021 and "2021" become valid, supported editions hint displayed if invalid 👍
  • Does the user infer that a dependency can also use 1 instead of "1", or 1.2 instead of "1.2"? Now you run into the same "why?" concern regardless.
$ cargo check
error: failed to parse manifest at `/app/Cargo.toml`

Caused by:
  invalid type: integer `1`, expected a version string like "0.9.8" or a detailed dependency like { version = "0.9.8" }
  in `dependencies.serde`

A similar PR / discussion will likely then shift to the above UX, and reference the change to edition if accepted. After all, you could save many more chars due to multiple dependencies and the same verbal concern can be expressed as shown above.


TL;DR

I just wanted to chime in with the above input:

  • There is probably a better way to get I/O savings (retroactively) on Cargo.toml storage at scale vs the proposed change to accept integers.
  • The verbal UX concern still applies elsewhere in Cargo.toml. If you allow the change for edition, expect the approval to justify similar changes elsewhere like dependencies semver.

@est31
Copy link
Member Author

est31 commented Oct 30, 2023

A server can use a filesystem with transparent compression, a database or similar forms offering compressed storage such as archive formats. This can also be leveraged over network transit.

The hashes of the .crate archives, post compression, are baked into the Cargo.lock files. There isn't much we can do about that, and therefore innovating on compression, which is a good idea, don't get me wrong, is non-trivial. This change in comparison is pretty easy and only takes as many lines because we use serde and serde is not known for allowing low-overhead customizations in data representation.

My argument should be understood in the context of the thing it replied to: people claiming that the gains were rather minor. The argument tries to argue with the scale of crates.io, because that one is easy to make back of the envelope calculations with. But in general it applies to every Cargo project whether it lives on the local disk, only in a private github, or in some other VCS somewhere else.

Does the user infer that a dependency can also use 1 instead of "1", or 1.2 instead of "1.2"? Now you run into the same "why?" concern regardless.

That's easy to answer: all editions are representable by integers, as they are named after year numbers in the gregorian calendar. Not all integers are valid edition numbers, but toml has no concept of an enum, an integer is just as valid as a string (and in my opinion more valid). As for floats, there is two main differences:

  • not all semver requirements are represesentable as floats, as 1.2.3 is not a valid float.
  • order on floats is different to order on semver. "1.23" comes after "1.8" but 1.23 < 1.8. the order on integers maps beautifully to the order of editions (although editions are rather point like things, it's just that usually future editions incorporate the breaking changes of earlier ones).
  • even three-tuples of integers are not a valid model for semver version numbers, as there can be appendices like -pre or -alpha, and semver specifiers can be even more extensive.

all in all, floats (or integers) are way further from semver than integers are from editions.

@polarathene
Copy link

polarathene commented Oct 31, 2023

Compression - Forward-only benefit

The hashes of the .crate archives, post compression, are baked into the Cargo.lock files.
There isn't much we can do about that, and therefore innovating on compression, which is a good idea, don't get me wrong, is non-trivial.

As I'm not familiar with how that's all managed, I assume these Cargo.toml files are in the .crate compressed archives? (EDIT: Yes) If you're already leveraging compression (EDIT: Yes, gzip -9?), then the impact of "integer saves two characters in each Cargo.toml" would already reduce the projected savings benefit? (EDIT: There appears to be no improvement to compressed archive)

These savings would only be for new .crate archives too right?

If the concern is with Cargo.lock files referencing the .crate archives by checksum hashes (EDIT: SHA256 of .crate / .tar.gz), how is that different to the proposed change here? Both would be as incompatible correct? (no benefit to existing .crate archives due to current infrastructure constraints, only benefits new published crates)


Compression - Integer edition no improvement vs ZSTD creates 20% smaller .crate archive

The argument tries to argue with the scale of crates.io, because that one is easy to make back of the envelope calculations with.

Taking the serde 1.0.190 archive .crate it is 76KB as .tar.gz archive (gzip -9?), if I change that to .tar.zst with zstd -19, the .tar file (531KB) compresses down to 61KB.

That's considerable savings already. If this Cargo.toml improvement would only be compatible with a future release of Rust and published crate leveraging it, you'd be far better off supporting zstd?

Meanwhile, I took the uncompressed tar file, extracted it and created the tar again, then edited the Cargo.toml and Cargo.toml.orig files to strip out the edition value quotes, creating a new tar of that and compared sizes with stat --printf="%s" filename (outputs size in bytes). Regardless of .tar or .tar.gz, the bytes reported was the same 🤷‍♂️

It'd seem the 4 bytes saved uncompressed, at least in this case had no impact to the compressed size? How confident are you in these savings being realized vs theoretical?

This is without a custom trained dictionary, so much simpler to implement/support.

Compression - Benefit to contexts: local and VCS

But in general it applies to every Cargo project whether it lives on the local disk, only in a private github, or in some other VCS somewhere else.

I've not tried investigating these scenarios like above. However:

  • AFAIK git does utilize compression, and only decompresses for the working tree checkout? This change additionally adds a delta that may be larger/unnecessary if an existing project adopts the change and switches to integer edition. A slight benefit for new projects sure, and at scale for a VCS hosting provider.
  • On disk storage locally... ok sure? I've not looked into how cargo manages this, I know the .crate archives are downloaded and cached, but are those contents extracted anywhere or are we just talking about projects with their own Cargo.toml file(s) where the amount of bytes saved is far less, especially if the OS uses transparent file compression?

The savings locally on disk, and for CI infrastructure like commonly outsourced to Github would benefit far more by the ZSTD compressed crates, where the savings seem significantly better.


Related - Cargo.lock could reduce checksum length

An additional savings that would probably be frowned upon would be for Cargo.lock format to use BLAKE3 checksums, where the length is more flexible.

You could for example use 28 bytes instead of 32 of SHA256, and IIRC that'd still offer around 112 bits of collision resistance (related blake3 issue, raised by a rust-lang member), roughly equivalent to security strength (entropy) of using RSA 2048-bit keys that is commonly deployed today for certificates with TLS? (112 bits of entropy is fairly strong, IIRC requires roughly the amount of energy needed to boil all the oceans in the world)

That'd be better savings too whenever a Cargo.lock file is present (VCS, local), and blake3 is far more efficient/faster than sha256 👍


UX - Resolving concern displaces it into a larger one

That's easy to answer: all editions are representable by integers

Sorry, I think you misunderstood the point I was trying to convey:

  • edition would support integer or string with this change. To a user the value is observed as interchangeable, or they are initially only familiar with one until they encounter the other.
  • Other fields in Cargo.toml that appear as integers only are expected to be strings (as noted above with the resolver field).
  • If a concern to justify edition as a string is that the user thinks a string is odd/unexpected (or that it's extra effort to communicate the requirement), why would this not apply to integer semver values? After all if a field like edition can support it, why would that not be inferred it might be possible for other fields with string integers? My point here is if an integer as a string is a UX issue, why wouldn't a user likewise get confused in this context, especially if edition were to support integer values?

I understand why semver is using a string. I was just pointing out that one of the reasonings to adopt integer values for edition field still applies to other fields. Allowing it for edition only adds to that confused user UX you're trying to address as now it becomes an example of "this field can do it, why can't this one too?". Then you get a proposal for semver values to support integers (major only), float (major + minor only) and string (everything else).

The "why?" question for edition just carries over to another field like the semver example, and while it was easy for you to answer, that was far more verbose than the answer would be for edition being only a string. If the intent is to reduce confusion by making "why?" redundant for edition, do you not anticipate a "why?" being more frequent for semver supporting values without quotes?

Is there many integer fields a user will encounter in the typical Cargo.toml today?

  • Or is it a better UX to keep it as a string where it quickly becomes apparent that dealing with number values as strings in the config is convention?
  • If the frequency of Cargo.toml integer supported fields today is rare, does this change for edition introduce new overhead to a user with "is this field a number, string, or does it support either?". Ideally the user doesn't have that overhead with the config, a single type for a field is one thing, but when it supports multiple then upon discovering this it's only natural that a curious user may wonder if that is supported elsewhere and if not why?

TL;DR

I'm not against edition supporting integers, the above two concerns expressed are just:

  • Are you really getting savings at scale? At a glance it seems it has no effect on .crate archives (I may have done it wrong 🤷‍♂️ ), adopting ZSTD seems to provide far more value.
  • Resolving a "why?" concern for edition to improve UX may have a ripple effect to other fields, that adds more noise than the original "why?" concern did 🤷‍♂️

@joshtriplett
Copy link
Member

By way of clarification: "save two bytes" / "size" was never the goal under consideration here; it was only ever a question of UX.

@joshtriplett
Copy link
Member

We discussed this once more in today's Cargo meeting, to give one more evaluation to the tradeoffs between churn/maintenance and potential UX improvement. We confirmed that we don't have any consensus for this change, and there are several folks objecting to the change. Given that, I'm going to go ahead and close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-documenting-cargo-itself Area: Cargo's documentation A-manifest Area: Cargo.toml issues S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants