Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible crate builds #8864

Merged
merged 4 commits into from
Nov 18, 2020
Merged

Reproducible crate builds #8864

merged 4 commits into from
Nov 18, 2020

Commits on Nov 15, 2020

  1. package: use a consistent timestamp

    For each entry in the tar archive, we generate a new timestamp.
    Normally cargo will be fast enough that we get a consistent timestamp,
    but that need not be the case.  There's very little reason to produce
    different timestamps for different files and it's slightly more
    efficient not to need to make multiple queries, so let's instead
    generate a single timestamp for all entries that we generate.
    bk2204 committed Nov 15, 2020
    Configuration menu
    Copy the full SHA
    9cc7ac6 View commit details
    Browse the repository at this point in the history
  2. package: honor SOURCE_DATE_EPOCH

    For projects supporting reproducible builds, it's possible to set the
    timestamp used in artifacts by setting SOURCE_DATE_EPOCH to a decimal
    Unix timestamp.  This is helpful because it allows users to produce the
    exact same artifact, regardless of when the project was built, and it
    also means that services which generate crates from source can generate
    a consistent crate without having store previously built artifacts.
    
    For all these reasons, let's honor the SOURCE_DATE_EPOCH environment
    variable if it's set and use the current timestamp if it's not.
    bk2204 committed Nov 15, 2020
    Configuration menu
    Copy the full SHA
    436b9eb View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2020

  1. package: canonicalize tar headers for crate packages

    Currently, when reading a file from disk, we include several pieces of
    data from the on-disk file, including the user and group names and IDs,
    the device major and minor, the mode, and the timestamp.  This means
    that our archives differ between systems, sometimes in unhelpful ways.
    
    In addition, most users probably did not intend to share information
    about their user and group settings, operating system and disk type, and
    umask.  While these aren't huge privacy leaks, cargo doesn't use them
    when extracting archives, so there's no value to including them.
    
    Since using consistent data means that our archives are reproducible and
    don't leak user data, both of which are desirable features, let's
    canonicalize the header to strip out identifying information.
    
    We set the user and group information to 0 and root, since that's the
    only user that's typically consistent among Unix systems.  Setting
    these values doesn't create a security risk since tar can't change the
    ownership of files when it's running as a normal unprivileged user.
    
    Similarly, we set the device major and minor to 0.  There is no useful
    value here that's portable across systems, and it does not affect
    extraction in any way.
    
    We also set the timestamp to the same one that we use for generated
    files.  This is probably the biggest loss of relevant data, but
    considering that cargo doesn't otherwise use it and honoring it makes
    the archives unreproducible, we canonicalize it as well.
    
    Finally, we canonicalize the mode of an item we're storing by looking at
    the executable bit and using mode 755 if it's set and mode 644 if it's
    not.  We already use 644 as the default for generated files, and this is
    the same algorithm that Git uses to determine whether a file should be
    considered executable.  The tests don't test this case because there's
    no portable way to create executable files on Windows.
    bk2204 committed Nov 16, 2020
    Configuration menu
    Copy the full SHA
    e46ca84 View commit details
    Browse the repository at this point in the history

Commits on Nov 18, 2020

  1. package: canonicalize tar headers for crate packages

    Currently, when reading a file from disk, we include several pieces of
    data from the on-disk file, including the user and group names and IDs,
    the device major and minor, the mode, and the timestamp.  This means
    that our archives differ between systems, sometimes in unhelpful ways.
    
    In addition, most users probably did not intend to share information
    about their user and group settings, operating system and disk type, and
    umask.  While these aren't huge privacy leaks, cargo doesn't use them
    when extracting archives, so there's no value to including them.
    
    Since using consistent data means that our archives are reproducible and
    don't leak user data, both of which are desirable features, let's
    canonicalize the header to strip out identifying information.
    
    Omit the inclusion of the timestamp for generated files and tell the tar
    crate to copy deterministic data.  That will omit all of the data we
    don't care about and also canonicalize the mode properly.
    
    Our tests don't check the specifics of certain fields because they
    differ between the generated files and the files that are archived from
    the disk format.  They are still canonicalized correctly for each type,
    however.
    bk2204 committed Nov 18, 2020
    Configuration menu
    Copy the full SHA
    449ead0 View commit details
    Browse the repository at this point in the history