Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incident: git binary repo is too big (78GB) - node-test-binary-windows timing out #952

Closed
refack opened this issue Oct 28, 2017 · 17 comments

Comments

@refack
Copy link
Contributor

refack commented Oct 28, 2017

I've run a git-delete-branch job, but the local repos are too "fat" so the auto GC is timing out.

@Trott
Copy link
Member

Trott commented Oct 28, 2017

Looks to me like it might only be failing on test-rackspace-win2008r2-x64-4? Going to try taking that one offline and running again...

@Trott
Copy link
Member

Trott commented Oct 28, 2017

Oh, nope, that's totally wrong, it's a bunch of hosts, but it's strange (to me) that some fail consistently and others don't.

@Trott
Copy link
Member

Trott commented Oct 28, 2017

Seeing what happens if I take all the failing nodes offline because why not? It's not as if anyone can get a CI run out of them right now.

Took these offline:

  • test-rackspace-win2008r2-x64-4
  • test-azure_msft-win10-x64-3
  • test-azure_msft-win10-x64-1
  • test-azure_msft-win2012r2-x64-3
  • test-azure_msft-win2012r2-x64-2

And here's a run to see if it fixes things or not: https://ci.nodejs.org/job/node-test-commit-windows-fanned/12947/

@refack
Copy link
Contributor Author

refack commented Oct 28, 2017

Cleaned and brought online:

  • test-rackspace-win2008r2-x64-4
  • test-azure_msft-win10-x64-3
  • test-azure_msft-win10-x64-1
  • test-azure_msft-win2012r2-x64-3
  • test-azure_msft-win2012r2-x64-2

@Trott if you see a stalled job you can go to it's job page (for example
https://ci.nodejs.org/job/node-test-binary-windows/COMPILED_BY=vs2015-x86,RUNNER=win2012r2,RUN_SUBSET=3/ )
if it's the top job you can click workspace and wipe it.

As for intermittent failures, I think it's because this is triggered by git's auto-GC logic...

@Trott
Copy link
Member

Trott commented Oct 28, 2017

Results are better, but still some failures...took this one offline:

  • test-azure_msft-win10-x64-5

There are two more with consistent build failures but they're in the middle of doing something right now so I don't want to take offline until they really fail again.

https://ci.nodejs.org/computer/test-azure_msft-win10-x64-1/builds
https://ci.nodejs.org/computer/test-rackspace-win2008r2-x64-5/builds

@Trott
Copy link
Member

Trott commented Oct 28, 2017

Build history is too convincing. Took these offline too:

  • test-azure_msft-win10-x64-1
  • test-rackspace-win2008r2-x64-5

@Trott
Copy link
Member

Trott commented Oct 28, 2017

Trying again to see if we can get a green Windows build now: https://ci.nodejs.org/job/node-test-commit-windows-fanned/12952/

@Trott
Copy link
Member

Trott commented Oct 28, 2017

Looks like it's gonna be green this time. Not sure how to fix the offline hosts but at least CI isn't perma-red.

@Trott
Copy link
Member

Trott commented Oct 29, 2017

Took test-azure_msft-win10-x64-5 offline too. Build failures, obviously.

  • test-azure_msft-win10-x64-5

@refack
Copy link
Contributor Author

refack commented Oct 29, 2017

@refack
Copy link
Contributor Author

refack commented Oct 29, 2017

@refack
Copy link
Contributor Author

refack commented Oct 29, 2017

test-azure_msft-win10-x64-5 was a little stubborn, but should now be Ok.
Also all PIs are back online.

@refack refack closed this as completed Oct 29, 2017
@refack refack mentioned this issue Oct 30, 2017
2 tasks
@joaocgreis joaocgreis reopened this Nov 1, 2017
@joaocgreis
Copy link
Member

Some workers were still failing. The node-test-binary-windows job is set up to only fetch the git branch with the binaries, so the problem is not the size of the binary repo but the local git repository. The automatic git gc would take too long.

I created a new job git-clean-windows similar to git-clean-rpi set to run evey week, should prevent this from happening again. I'll close this issue after I confirm it is working as expected.

@refack
Copy link
Contributor Author

refack commented Nov 5, 2017

@joaocgreis it seems like the local git repo does accumulate binaries even the branch was deleted from the remote. I rerun a node-test-binary-windows job and some workers were able to find the revision:
https://drive.google.com/file/d/0Bz0LZMH4OpErbVZkSm9YSFhCTVU/view?usp=sharing

@gibfahn
Copy link
Member

gibfahn commented Nov 5, 2017

@joaocgreis it seems like the local git repo does accumulate binaries even the branch was deleted from the remote. I rerun a node-test-binary-windows job and some workers were able to find the revision:

Does git fetch --prune help? That removes the tracking branches from remote repos (which are not otherwise removed).

I also run this locally to delete local branches which had an upstream, but the upstream was deleted:

# Delete orphaned local branches.
git fetch -p && git branch -vv | awk '/: gone]/{print $1}' | xargs git branch -D 

@refack
Copy link
Contributor Author

refack commented Nov 5, 2017

I see several optimization we could take for node-test-binary-windows

  1. Use shallow checkout
  2. Use git LFS
  3. Add a prune step
    image

I'll clone the job and try this out.

@joaocgreis
Copy link
Member

I never noticed that prune step before, it sounds promising. Some jobs already do it on the scripts, if it works feel free to add it all over (also the clean before and after, added it to test-binary-windows and seems to be working quite well).

As long as the weekly job to clean the workspaces does its job, we should not see this issue again. Note that git fetch --prune is not git prune, it will not delete objects. A commit can be kept around even if all branches pointing to it are deleted (IIUC because they are still pointed to by the reflog - git reflog). To delete objects, we have to run git gc. But I've had plenty of issues with that as well (tried it in the arm jobs for some time). It is not rare for git gc to hang and I suspect it can leave the repo in an inconsistent state. So the best option is to delete and clone again, it's not pretty but has been working decently well in the Raspberries for a long time.

About a shallow checkout: by having the full history in the test machines, git can transfer only what is new. Long time since I've tried this, feel free to double-check, but I think a shallow clone will have to transfer everything everytime, or at least much more if there are several commits in between.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants