Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow generating different CI jobs for tests and/or separate outputs #1472

Open
h-vetinari opened this issue Apr 9, 2021 · 23 comments
Open

Comments

@h-vetinari
Copy link
Member

This is a topic I've been thinking about for a while regarding run-times (cf. conda-forge/conda-forge.github.io#902), but it also came up recently in the context of a GPU-specific build queue, see for example the notes of a conda-forge meeting about a month ago (cf. also conda-forge/conda-forge.github.io#1272):

may need to figure out how to build on CPU and test on the GPU since locking the GPU during build is expensive and unneeded.

The core idea is to enable an opt-in per feedstock to run certain test sections (or outputs) in a separate CI job.

  • This obviously would help if the build-times are close to timing out (at the cost of persisting the artefact somewhere, but perhaps azure even supports that natively within the same job group...)
  • It becomes even more important for such a possible GPU-specific queue. Since such a queue would have limited resources, it would be a waste to build GPU-enabled packages with that queue, when the build process generally does not need GPUs itself (and could thus be restricted to just testing).

In fact, I think the perhaps best as well as easiest solution might be to allow opting into separate jobs per output. Not only can tests be moved to a separate output (in a multi-output recipe; and I've seen this in several places already, cf. e.g. pyarrow), but that would also provide a way to bring long-running recipes under the 6h limit, as long as the build steps can be modularized even a little bit - not least in recipes that are already multi-output (e.g. it would help me a lot on the faiss-feedstock, where the GPU lib takes between 3-5h to build, often leaving not enough time for the python bindings).

The one thing that would be necessary of course is to have a simple graph of which outputs depend on what other ones, so that they can get built in the correct order. From looking at the CI in the staged-recipes repo (where e.g. GPU builds aren't started by default), it seems that such functionality should be possible in principle...

CC @beckermr @isuruf @wolfv @jakirkham

@isuruf
Copy link
Member

isuruf commented Apr 30, 2021

e.g. it would help me a lot on the faiss-feedstock, where the GPU lib takes between 3-5h to build, often leaving not enough time for the python bindings

I don't see how this proposal would help in this case.

@h-vetinari
Copy link
Member Author

Thanks for chiming in @isuruf!

@h-vetinari: e.g. it would help me a lot on the faiss-feedstock, where the GPU lib takes between 3-5h to build, often leaving not enough time for the python bindings

@isuruf: I don't see how this proposal would help in this case.

Well, roughly the infra takes 0.5h, then the lib takes 3-5 (for linux; 4-6 for win) and the python bindings + tests take 1h. Splitting off the latter means taking 1h off of a job that's flirting with the 6h timeout, and would definitely increase the success rate (if not to 100%, then from ~30% for the win cuda builds to ~80%).

But I want to stress that this is not the key driver of this issue - I just thought that splitting jobs by output might actually be easier than separating build & test phases (and noted this as an ancillary benefit).

@beckermr
Copy link
Member

The original reason for this issue is to not tie up GPUs on the CI while building code. If we can arrange other ways to prevent that, then we don't need to do this.

@isuruf
Copy link
Member

isuruf commented Apr 30, 2021

I just thought that splitting jobs by output might actually be easier than separating build & test phases

No, it's not. While conda-build has functionaility to separate build & test phases, there's nothing there can split jobs by output and would be impossible to implement.

@h-vetinari
Copy link
Member Author

The original reason for this issue is to not tie up GPUs on the CI while building code. If we can arrange other ways to prevent that, then we don't need to do this.

Absolutely, but I don't see how we can get around the part of having to choose a different agent for running the test suite?

While conda-build has functionaility to separate build & test phases, there's nothing there can split jobs by output and would be impossible to implement.

I know that conda-build supports that split, but I was also thinking about the part where this needs to be configurable & generatable by smithy. I assume I'm not seeing the full picture re:implementability, but conda-build already generates a DAG of outputs and processes them in the correct order. The only point that I imagined would need to be added is to optionally ingest an artefact rather than running a particular output build script.

@beckermr
Copy link
Member

@jakirkham This all depends on the software that runs the CI server. That software doesn't have to tie a given container to a specific GPU for the whole job.

@isuruf
Copy link
Member

isuruf commented Apr 30, 2021

re:implementability, but conda-build already generates a DAG of outputs and processes them in the correct order.

conda-build runs the top level build.sh first and then copies the working directory for each output. Therefore building one output in one machine and building another in another machine is impossible.

@h-vetinari
Copy link
Member Author

conda-build runs the top level build.sh first and then copies the working directory for each output. Therefore building one output in one machine and building another in another machine is impossible.

Some recipes do not have a top-level build.sh, and get by with one script per output based on the same sources (and possibly relying on previously compiled subpackages). Would that be a reasonable restriction?

@h-vetinari
Copy link
Member Author

@beckermr: This all depends on the software that runs the CI server. That software doesn't have to tie a given container to a specific GPU for the whole job.

[Assuming this was addressed to me 😅]
That's a fair point, I guess I was thinking mostly about Azure (where I'm not aware this would be possible), because it'd still be nice to reuse those for long-compiling jobs, and then switch to the GPU queue for testing.

@beckermr
Copy link
Member

🤦

@beckermr
Copy link
Member

On azure we can upload artifacts and pass them between jobs iiuic

@h-vetinari
Copy link
Member Author

On azure we can upload artifacts and pass them between jobs iiuc

Definitely, and if we go with the build/test split, then it would just mean how that split could be configured, generated & executed by smithy (which is what I originally intended this issue to be about - sorry for complicating the discussion with the other option about splitting on outputs).

@beckermr
Copy link
Member

💯

@isuruf
Copy link
Member

isuruf commented Apr 30, 2021

On azure we can upload artifacts and pass them between jobs iiuc

This avoid restrict us to use the same provider for the build/test. I'd rather we use the artifacts from cf-staging to test. One CI uploads the artifacts, and the other downloads it and runs it.

@beckermr
Copy link
Member

🤔

@beckermr
Copy link
Member

beckermr commented Apr 30, 2021

Our current infrastructure cannot securely support this. A well-timed attacker could inject another feedstock's artifact at just the right moment since uploads to cf-staging are not gated in any way.

We could upload to a test label, pull from there, and then make a secure call over https to our admin server to relabel the object if the tests pass.

@beckermr
Copy link
Member

I also do not know how we will be able to orchestrate dependent jobs across CI services.

@beckermr
Copy link
Member

Actually no, we can't use a test label since we cannot upload to our channels from PRs on forks.

@h-vetinari
Copy link
Member Author

Staying within Azure sounds like a reasonable first step then? It's not great to be locked in like that, but at least one GPU queue proposal was azure-based anyway, so that might provide a faster way forward (while figuring out secure artefact sharing within PRs can still come later).

@isuruf
Copy link
Member

isuruf commented Apr 30, 2021

Do we want to test PRs on GPU queue though?

@h-vetinari
Copy link
Member Author

Do we want to test PRs on GPU queue though?

That was my assumption. I guess it would be another thing that could ultimately be considered not as crucial as having any GPU tests run on master-only. It would be more painful/manual to not have GPU-CI on PRs, but I guess that would in a way be even more resource-conservative 😄

@h-vetinari
Copy link
Member Author

So we have two options (third one just for completeness):

Artefact sharing Cache within azure Push to test label Push to cf-staging
Can be made secure ✔️ ✔️
Share across CI providers ✔️ ✔️
Can run in PRs ✔️
Implementation effort

While it would be nice to be able to debug GPU builds in PRs, this could still be done manually with artefact persistence... Sharing between CI providers might be more desirable in comparison, even if it's master-only.

@beckermr
Copy link
Member

beckermr commented May 1, 2021

Let's start with caching within azure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants