Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network zlib stream error confusion #8517

Closed
ehuss opened this issue Jul 19, 2020 · 13 comments · Fixed by #8778, rust-lang/rust#78025, rust-lang/git2-rs#642 or #8998
Closed

network zlib stream error confusion #8517

ehuss opened this issue Jul 19, 2020 · 13 comments · Fixed by #8778, rust-lang/rust#78025, rust-lang/git2-rs#642 or #8998
Labels
A-git Area: anything dealing with git C-bug Category: bug

Comments

@ehuss
Copy link
Contributor

ehuss commented Jul 19, 2020

Problem
On rust-lang/rust CI, I've been noticing this error happening a lot lately:

    Updating crates.io index
error: failed to get `cc` as a dependency of package `bootstrap v0.0.0 (D:\a\1\s\src\bootstrap)`

Caused by:
  failed to fetch `https://github.com/rust-lang/crates.io-index`

Caused by:
  error inflating zlib stream; class=Zlib (5)

I think the CI has been having network problems in general, but this error in particular seems a little confusing to me. I don't know what would cause this error. I think it might be helpful to understand this error case a little better.

Here is the corresponding error in libgit2.

Should this error class be added to the spurious error list? What can possibly cause this to happen?

Notes

cargo 1.46.0-beta (4f74d9b2a 2020-07-08)

@ehuss ehuss added C-bug Category: bug A-git Area: anything dealing with git A-networking Area: networking issues, curl, etc. labels Jul 19, 2020
@alexcrichton
Copy link
Member

This looks reminiscent of #8258 which was fixed in libgit2/libgit2#5536. Rust 1.45 did not include that PR, but 1.46 (beta) and nightly both have the fix. Do you know if these errors are showing up using Cargo from 1.45? (bootstrap not updated?)

@ehuss
Copy link
Contributor Author

ehuss commented Jul 20, 2020

I'm pretty sure it is the version posted above cargo 1.46.0-beta (4f74d9b2a 2020-07-08)

extracting D:\a\1\s\build\cache\2020-07-16\cargo-beta-i686-pc-windows-msvc.tar.xz

https://dev.azure.com/rust-lang/rust/_build/results?buildId=35201&view=logs&j=339a807e-05eb-50a3-f69a-0d7a3a5b2cef&t=1f87e16f-7721-57ff-dca8-75b52e5f4eac

@alexcrichton
Copy link
Member

Bah.

It probably wouldn't hurt to throw this onto the list of spurious errors, but I agree it'd be good to understand. AFAIK zlib inflate errors indicate a corrupt data (e.g. not zlib) getting fed to the zlib inflater. That previous issue was due to a bug in libgit2 (valid zlib stream, invalidly fed), and this may also be that. This'll be really hard to track down though without a reliable reproduction.

Skimming libgit2's commit history I don't see much related to this since that original fix, so I don't think we can get lucky by "simply update the git repo".

alexcrichton added a commit to alexcrichton/cargo that referenced this issue Jul 20, 2020
This may be a bad band-aid for now, but the goal is to help
address rust-lang#8517 where this has been showing up in the wild quite a lot.
bors added a commit that referenced this issue Jul 23, 2020
Flag git zlib errors as spurious errors

This may be a bad band-aid for now, but the goal is to help
address #8517 where this has been showing up in the wild quite a lot.
@ehuss
Copy link
Contributor Author

ehuss commented Jul 23, 2020

Another instance: https://dev.azure.com/rust-lang/rust/_build/results?buildId=35201&view=logs&j=339a807e-05eb-50a3-f69a-0d7a3a5b2cef&t=1f87e16f-7721-57ff-dca8-75b52e5f4eac

I wonder if there is any significance that these cases happen on Windows i686 (i686-mingw-1 and i686-msvc-1).

@alexcrichton
Copy link
Member

Hm the fact that it was retried there and it didn't help anything makes me think that it's not a spurious failure...

@ehuss
Copy link
Contributor Author

ehuss commented Jul 23, 2020

Hm, looking at other instances:
rust-lang/rust#73210 (comment)
https://dev.azure.com/rust-lang/e71b0ddf-dd27-435a-873c-e30f86eea377/_build/results?buildId=35451
Failed in three builders: i686-gnu-opt i686-msvc-1 dist-i686-msvc

rust-lang/rust#74245 (comment)
i686-gnu

rust-lang/rust#74245 (comment)
i686-gnu

rust-lang/rust#74245 (comment)
i686-gnu-opt
i686-msvc-1
dist-i686-mingw

rust-lang/rust#74611 (comment)
i686-mingw-2

rust-lang/rust#74495 (comment)
i686-mingw-2

rust-lang/rust#74520 (comment)
i686-msvc-1

rust-lang/rust#74219 (comment)
i686-mingw-1

Raising priority since this seems important.

@ehuss ehuss added the P-high Priority: High label Jul 23, 2020
@mati865
Copy link
Contributor

mati865 commented Jul 23, 2020

It cannot be spurious, it affects probably all i686 platforms.

@ehuss
Copy link
Contributor Author

ehuss commented Jul 29, 2020

Notes from my investigation:

This seems to have gone away for now, though I would anticipate it happening again since it has been coming and going. My most likely hypothesis is that there is some caching layer between azure's network and github's git servers that is misbehaving.

AFAIK, all of the rust-lang/rust jobs start with an empty registry directory (based on my testing with GHA and Azure, they both start with an empty directory, and rust-lang/rust's CI jobs are not configured to cache it). So I don't think it is related to any partial caching, or incremental git updates.

In rust-lang/rust#74689 I attempted to use net.git-fetch-with-cli, but it also failed:

  process didn't exit successfully: `git fetch --force --update-head-ok 'https://github.com/rust-lang/crates.io-index' 'HEAD:refs/remotes/origin/HEAD'` (exit code: 128)
  --- stderr
  error: gnutls_handshake() failed: A TLS fatal alert has been received. while accessing https://github.com/rust-lang/crates.io-index/info/refs

  fatal: HTTP request failed

Interestingly, that failed on several different hosts (dist-i686, dist-x86_64-linux, dist-x86_64-linux-alt).

I'm still really curious why previous failures all seemed to be on the i686 jobs.

During the outage, I was able to easily reproduce on github actions. It didn't really matter which version of Cargo I was using:
1.47: https://github.com/ehuss/test/runs/903771093?check_suite_focus=true
1.46: https://github.com/ehuss/test/runs/903709326?check_suite_focus=true
1.45: https://github.com/ehuss/test/runs/903775146?check_suite_focus=true
1.43: https://github.com/ehuss/test/runs/903798953?check_suite_focus=true

Unfortunately the errors stopped before I could perform more testing.

So I don't think it is related to any recent libgit2 changes (and since it also failed with net.git-fetch-with-cli, I doubt it is related to libgit2 at all) or #8363.

Will revisit this if it ever starts acting up again and I'm around to catch it. Unfortunately there isn't really a good way to be alerted when it does. 😦

@ehuss ehuss removed the P-high Priority: High label Jul 29, 2020
@ehuss
Copy link
Contributor Author

ehuss commented Jul 30, 2020

Updates from 2020-07-30 investigation:

Saw errors on GitHub Actions. Debug logging showed it consistently failing with i686 to the IP 140.82.114.4. Was able to consistently repro locally against that IP as well from about 10am till 11:50am (PDT). Other IPs that were not failing include 140.82.112.3 140.82.112.4 140.82.113.3 140.82.113.4. Could not repro with git command-line or 64-bit cargo.

With trace debugging, saw that it consistently failed after downloading around 75MB of data. The last few lines always looked like this:

[2020-07-30T18:25:11Z TRACE cargo::ops::registry] http-debug: { (16384 bytes of data)
[2020-07-30T18:25:11Z TRACE cargo::ops::registry] http-debug: { (16384 bytes of data)
[2020-07-30T18:25:11Z TRACE cargo::ops::registry] http-debug: { (16384 bytes of data)
[2020-07-30T18:25:11Z TRACE cargo::ops::registry] http-debug: { (16384 bytes of data)
[2020-07-30T18:25:11Z TRACE cargo::ops::registry] http-debug: { (9673 bytes of data)
[2020-07-30T18:25:11Z DEBUG cargo::ops::registry] http-debug: * Connection #0 to host github.com left intact

which leads me to believe it finished downloading the entire pack file, since that is what normal behavior looks like (a bunch of 16384 lines, then a final partial line (9673) and then the "left intact" line). Counting the number of bytes against a successful download shows pretty much the same amount of data downloaded.

Ran with a debugger to investigate the part in libgit2 where it fails. This is where it fails. The local variables I was able to capture:

total=35
size=35
zstream.zerr=-5 (Z_BUF_ERROR)

Z_BUF_ERROR is not supposed to be fatal, so I'm a bit confused how it got in that state, or how it is supposed to recover. Here is a dump of the zstream:

$1 = {
  z = {
    next_in = 0xf6fc2000 "\325V\305f\a\001耥\t-\221&Mi\210\303\323Lu\322\373\032ߓ%\277\225\356\255\313\023L\320\005\202P\301\vu\f\350W\317\313묤Э:\232\365P\272AEب\253\264\334;\303.\364ZƙhD[Vܽ\024m\001\256\245\305l0\016\n\223\323d\242ӖcL\"R\363\360\241\340\332Q+\300\230\t\363\322\062\350\003tQ\034\367p_\364\373\334I\255\270\233\262\331\352G?X\356\346O\356`b-)*j\212fBVX\275\267E\224+$8\202n\245pa\037\065\317\061)\031:\234S\352ɩ\267<y\234\305t|\360\235\v\367R\262\305\060R*\211\341\v'\030\315;\270CG\367Yu\026T"...,
    avail_in = 0,
    total_in = 41,
    next_out = 0x5aa31df3 "",
    avail_out = 1,
    total_out = 35,
    msg = 0x0,
    state = 0x5a5675b0,
    zalloc = 0x57ed6392 <zcalloc>,
    zfree = 0x57ed63be <zcfree>,
    opaque = 0x0,
    data_type = 72,
    adler = 419172219,
    reserved = 0
  },
  type = GIT_ZSTREAM_INFLATE,
  in = 0xf6fc2000 "\325V\305f\a\001耥\t-\221&Mi\210\303\323Lu\322\373\032ߓ%\277\225\356\255\313\023L\320\005\202P\301\vu\f\350W\317\313묤Э:\232\365P\272AEب\253\264\334;\303.\364ZƙhD[Vܽ\024m\001\256\245\305l0\016\n\223\323d\242ӖcL\"R\363\360\241\340\332Q+\300\230\t\363\322\062\350\003tQ\034\367p_\364\373\334I\255\270\233\262\331\352G?X\356\346O\356`b-)*j\212fBVX\275\267E\224+$8\202n\245pa\037\065\317\061)\031:\234S\352ɩ\267<y\234\305t|\360\235\v\367R\262\305\060R*\211\341\v'\030\315;\270CG\367Yu\026T"...,
  in_len = 0,
  flush = 4, (Z_FINISH)
  zerr = -5 (Z_BUF_ERROR)
}

I am suspicious of this line which does not check if the return value is NULL. Although, zstream->z.next_in is not NULL, so that might not be it...

So, it looks like it was asked to unpack 35 bytes. It looks like it succeeded, but for some reason still set Z_BUF_ERROR.

A hypothesis is that there is some edge case in the zstream inflate routine that is getting triggered, which the 32-bit build is sensitive to. It's not clear what.

For next time I wish I had:

  • Kept a copy of the errant pack files.
  • Print out all of the local variables.

Another thing I'm uncertain about: The "mwindow" is a window into the pack file. I would suspect at the end of the compressed stream in the pack file, this window includes the 20-byte SHA1 checksum at the end of the pack file. Is it possible that zlib was confused by that extra 20-bytes somehow?

@ehuss
Copy link
Contributor Author

ehuss commented Jul 30, 2020

I found I had an extra copy of the problematic index. I have posted a repro here: https://github.com/ehuss/zlib-repro

I haven't really figured out much so far. The git_mwindow stuff is kinda wonky, and the window sizes on 32-bit are a lot different from 64-bit. In 64-bit, the window is fairly consistent (the entire size of the pack file which is 76,488,803 bytes). But in 32-bit, the sizes change a lot, I can't figure out what it is doing. The uncompressed size of all the git objects is about 1.9G, so that seems suspiciously close to the max signed 32-bit integer. The uncompressed size is closer to 1,059,758,028 bytes, I was including fs overhead.

@ehuss ehuss removed the A-networking Area: networking issues, curl, etc. label Jul 31, 2020
@alexcrichton
Copy link
Member

I've filed an upstream issue at libgit2/libgit2#5590

@bors bors closed this as completed in 60194f2 Oct 14, 2020
ehuss pushed a commit to ehuss/cargo that referenced this issue Oct 14, 2020
Update git2.

Closes rust-lang#8517
Closes rust-lang#8588
Closes rust-lang#8352
Closes rust-lang#4777
Closes rust-lang#8746

I only added a test for one of these. I can add for the others if you want.
This was referenced Oct 14, 2020
bors added a commit that referenced this issue Oct 14, 2020
[beta] backport libgit2

This is a backport of #8778 to beta. The reason is that #8517 has started showing up again on rust-lang/rust CI, and I think it is important to fix that.

This is risky, because this is a large update with a lot of untested changes. However, I think the risk is worth it to fix #8517.

This also includes #8772 to get Cargo's CI to pass. I think the risk for that is low (the [changes](toml-rs/toml-rs@0.5.6...0.5.7) are small).  However, I'd be fine with just modifying the tests to pass if you want to drop it.
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 15, 2020
…acrum

[beta] Update cargo

1 commits in 75615f8e69f748d7ef0df7bc0b064a9b1f5c78b2..65cbdd2dc0b7e877577474b98b7d071308d0bb6f
2020-09-29 18:42:19 +0000 to 2020-10-14 16:10:15 +0000
- [beta] backport libgit2 (rust-lang/cargo#8780)

This is primarily to bring in a fix for rust-lang/cargo#8517 which is starting to show up on rust-lang/rust's CI again.
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Oct 17, 2020
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 17, 2020
@ehuss
Copy link
Contributor Author

ehuss commented Nov 1, 2020

Reopening since this unfortunately wasn't completely fixed.

@ehuss ehuss reopened this Nov 1, 2020
@ehuss
Copy link
Contributor Author

ehuss commented Nov 5, 2020

@alexcrichton You mentioned maybe using net.git-fetch-with-cli to avoid this, but the problem (mentioned above at #8517 (comment)) is that rust-lang/rust is using debian-6 which has an old version of GNU TLS that does not support TLS 1.2, but GitHub requires that, and AFAIK it is not easy to update it on such an old distro.

Another option would be to pre-fetch the registry using a 64-bit Cargo. Either way, it is not a trivial change to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-git Area: anything dealing with git C-bug Category: bug
Projects
None yet
3 participants