Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Job Contention #418

Closed
CleanCut opened this issue Sep 2, 2020 · 9 comments
Closed

CI Job Contention #418

CleanCut opened this issue Sep 2, 2020 · 9 comments
Labels
A-Build-System Related to build systems or continuous integration C-Performance A change motivated by improving speed, memory usage or compile times

Comments

@CleanCut
Copy link
Member

CleanCut commented Sep 2, 2020

The way I phrased my information in #129 (comment) may have led to the conclusion that GitHub Actions are unlimited in every way for open source projects. It is true that the total number of minutes does not have a limit. It is not true that the number of concurrent jobs is unlimited.

There are two limits on the number of concurrent jobs for free, open source plans:

  • 20 total concurrent jobs
  • 5 of which may be macOS jobs

We're hitting at least the first limit. Due to the proliferation of the number of jobs in a CI run (#357, #373, etc.), combined with the high level of activity on this project, a number of jobs end up waiting in the queue for a slot to open for them to run.

For example, take a look at this run. If it had been able to launch all the jobs simultaneously, it would have completed in 45 seconds because clippy quickly failed. But the clippy job didn't get a chance to start for ~6 minutes because of other concurrent jobs from other pull requests.

Currently there are 9 jobs defined. That means we can only have 2 concurrent CI runs before we start piling up delays. I would suggest, at a minimum, to collapse 3 of the jobs (2 tests, 1 clean) into 2 of the build jobs, for a total of 6 jobs. Tests already have to build the code, so there shouldn't be any reduction in what CI is checking there. Adding the "clean" stuff on top of one of those may increase the max runtime of that particular job a little bit, but if we pick the job which completes the fastest on average, we may avoid increasing the max job time at all. Even if we do increase max job time by a few seconds, the overall increase in throughput is a clear win.

If no one else beats me to it, I'm willing to make a PR for this later this week. I won't complain if someone else would rather do it, though. 😄

/cc @AngelOnFira

@Moxinilian Moxinilian added A-Build-System Related to build systems or continuous integration C-Performance A change motivated by improving speed, memory usage or compile times labels Sep 2, 2020
@AngelOnFira
Copy link
Contributor

Just want to mention, working on this now, will have a PR and writeup in a bit 👌

@AngelOnFira
Copy link
Contributor

@CleanCut thanks for bringing this up, it's something I've kept an eye on, but haven't made a discussion for 😄 I'd love to go over my thought process, and hear about anything I might not be thinking about 👌

There are two different sides to optimize for. First, we could optimize for fewer jobs for each commit. If we had all of CI work in just 3 jobs as you suggest, then you could have more commits being tested at once. On the other hand, you could optimize for parallel jobs, which can reduce the total amount of time a commit will take to pass CI. I've optimized for the latter.

After examining your suggested optimizations, I got some results. The times might not hold up in the wild, as this is just n=1. A potential downside of combining the jobs is that it's more difficult to tell if a test failed, or if a check failed. Originally one of the reasons I decoupled them into different jobs was that I wasn't sure how their caches would clash before I looked deeper into it. It seems to be quite negligible:

total job time cache size open cache time
stable, cache, separate check 2:11 87MB 3s
stable, cache, separate test 6:38 586MB 39s
stable, pre-cache, combined test/check 7:36 - -
stable, cache, combined test/check 4:37 595MB 38s

Also to mention, I believe the frequency of commits to Bevy isn't as numerous and clustered as you assume. If we take a look at the actions tab, we can see that for the most part, commits are spaced out. I think it would be fair to say that it is less likely that multiple commits are running at the same time. If this occurs, as long as each is using fewer than 10 jobs, they should be able to work concurrently. Extending this, I believe it to be quite unlikely that 3 commits are running at the same time.

However, to adapt to this, we can try and optimize for 6 jobs running concurrently for a commit. This would allow 3 commits to be tested at once. To do this, we will use the 3 operating systems, all testing on nightly and stable. This would be our 6 concurrent jobs. We then leave clean into its own job, but make the rest of the jobs "need" it to be able to start. That way, if a commit fails, it will fail fast and in the place it's most likely to.

So tl;dr a few steps that I have taken instead of bundling the jobs entirely:

  1. Require the separate clean job to complete successfully before others can start. This will allow CI to "fail fast" on the most likely-to-fail job. It also guarantees it will go first in the required jobs for the commit. Further, this will allow commits that fail to take up a small footprint. The clean job itself is short, taking around 3 minutes.
  2. Combine the tests/check into the same job. It seems™️ that they synergize quite well together, having a negligible cache size increase, and improving the running time. This is due to less overhead from single jobs, as well as check and test using very similar target folders.

Side note: Something is stopping Macos + nightly from working properly. I've disabled it for now, and I'm going to look into why it might be happening on the cargo side. I made #380 for this already, but it's quite inconsistent.

Anyways, I made quite a post about this 😅 would love to hear what you think 💯

This was referenced Sep 2, 2020
@CleanCut
Copy link
Member Author

CleanCut commented Sep 3, 2020

👍 I like the plan. Given that jobs fail-fast (one job failing cancels all the others) I'm a little on the fence whether it is worth it to put the clean job first as a separate dependent job, rather than just pick one of the 6 jobs and run the clean as the first step. We could certainly give it a try and see how it works out in practice. In any case, I think this plan is a strict improvement over what we have now. 😄

There is one additional point to discuss: The limit of 5 concurrent mac jobs means that with our 2 mac jobs we will never be able to run 3 full CI runs concurrently. Is that worth optimizing for? I see four paths:

  1. Ignore the mac concurrency problem for now, leave it at 2 builds, see if it bothers us.
  2. Drop to one stable mac build to ensure we can run 3 simultaneous full CI runs
  3. Investigate whether making a nightly build not required and dependent on a stable build will make the more important build finish first (can you merge while CI is running if it's only jobs that aren't required?)
  4. Look into setting up a bot like bors to run a bigger set of runners just prior to merge, and have a much smaller set of runners for the typical push.

Which way we choose probably isn't real important, because it's easy to try another option if we're not happy with how things turn out. With that in mind, I recommend option 1 (ignoring the problem) and see if that's good enough. 😉 Thanks for taking this on! ❤️

@cart
Copy link
Member

cart commented Sep 3, 2020

Yup I like the plan. I also think that failing fast means that we might not actually need to worry about dependent jobs. I also think adding bors is an important next step so we can maintain the integrity of our checks. I think thats something I'll need to add, but if anyone has experience and could give me the run down that would be great.

In regard to this issue, I think we have agreement that should be able to merge #423?
(except maybe without the dependency on clean)

@CleanCut
Copy link
Member Author

CleanCut commented Sep 3, 2020

#423 implements "the plan" + option 2, but leaves in the clean job as a separate job everything depends on. It could be merged as-is. 👍

A later PR could combine the clean job with the Linux nightly job if we want to speed up successful jobs a bit more.

@CleanCut
Copy link
Member Author

CleanCut commented Sep 3, 2020

also think adding bors is an important next step so we can maintain the integrity of our checks. I think thats something I'll need to add, but if anyone has experience and could give me the run down that would be great.

There are two main advantages to using bors:

  • bors merges master and the PR branch into a dedicated branch and runs CI on the merged version. This ensures that even if Git merges cleanly, the merged code actually functions.
  • bors will try to combine things into batches using some fancy algorithms. So you can tell bors 10 pull requests are ready to merge, and it combine them and try to test them all at once, and then it will split things apart and try smaller batches if something fails, and it'll keep going until it can merge everything it can.

Most projects run bors as the free version of bors-ng pre-packaged as a GitHub App, notable exceptions being rather large projects such as Rust itself, which will run bors on their own hardware and customize its functionality even further. I recommend just using the free hosted option.

Configuring bors is pretty straightforward. The only gotcha I encountered when overhauling it for Amethyst was figuring out how to specify which GitHub Action jobs to tell it are required (answer: you match the generated names that you see in a CI run).

@cart
Copy link
Member

cart commented Sep 3, 2020

Brilliant. Thanks!

@AngelOnFira
Copy link
Contributor

Just to add on, #423 only removes Macos nightly because there is an issue in building it that I need to track down. This will require further examination into which of the 4 choices will work well. I still believe that option one has potential, so I'll keep an eye on the job frequency and see if I can pull together some stats.

@CleanCut CleanCut mentioned this issue Sep 3, 2020
@CleanCut
Copy link
Member Author

CleanCut commented Sep 3, 2020

I trust @AngelOnFira can take it from here. I moved the bors stuff over to #425 for whenever folks want to return to it.

@CleanCut CleanCut closed this as completed Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Build-System Related to build systems or continuous integration C-Performance A change motivated by improving speed, memory usage or compile times
Projects
None yet
Development

No branches or pull requests

4 participants