-
-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow generating different CI jobs for tests and/or separate outputs #1472
Comments
I don't see how this proposal would help in this case. |
Thanks for chiming in @isuruf!
Well, roughly the infra takes 0.5h, then the lib takes 3-5 (for linux; 4-6 for win) and the python bindings + tests take 1h. Splitting off the latter means taking 1h off of a job that's flirting with the 6h timeout, and would definitely increase the success rate (if not to 100%, then from ~30% for the win cuda builds to ~80%). But I want to stress that this is not the key driver of this issue - I just thought that splitting jobs by output might actually be easier than separating build & test phases (and noted this as an ancillary benefit). |
The original reason for this issue is to not tie up GPUs on the CI while building code. If we can arrange other ways to prevent that, then we don't need to do this. |
No, it's not. While conda-build has functionaility to separate build & test phases, there's nothing there can split jobs by output and would be impossible to implement. |
Absolutely, but I don't see how we can get around the part of having to choose a different agent for running the test suite?
I know that conda-build supports that split, but I was also thinking about the part where this needs to be configurable & generatable by smithy. I assume I'm not seeing the full picture re:implementability, but conda-build already generates a DAG of outputs and processes them in the correct order. The only point that I imagined would need to be added is to optionally ingest an artefact rather than running a particular output build script. |
@jakirkham This all depends on the software that runs the CI server. That software doesn't have to tie a given container to a specific GPU for the whole job. |
conda-build runs the top level build.sh first and then copies the working directory for each output. Therefore building one output in one machine and building another in another machine is impossible. |
Some recipes do not have a top-level |
[Assuming this was addressed to me 😅] |
🤦 |
On azure we can upload artifacts and pass them between jobs iiuic |
Definitely, and if we go with the build/test split, then it would just mean how that split could be configured, generated & executed by smithy (which is what I originally intended this issue to be about - sorry for complicating the discussion with the other option about splitting on outputs). |
💯 |
This avoid restrict us to use the same provider for the build/test. I'd rather we use the artifacts from |
🤔 |
Our current infrastructure cannot securely support this. A well-timed attacker could inject another feedstock's artifact at just the right moment since uploads to cf-staging are not gated in any way. We could upload to a test label, pull from there, and then make a secure call over https to our admin server to relabel the object if the tests pass. |
I also do not know how we will be able to orchestrate dependent jobs across CI services. |
Actually no, we can't use a test label since we cannot upload to our channels from PRs on forks. |
Staying within Azure sounds like a reasonable first step then? It's not great to be locked in like that, but at least one GPU queue proposal was azure-based anyway, so that might provide a faster way forward (while figuring out secure artefact sharing within PRs can still come later). |
Do we want to test PRs on GPU queue though? |
That was my assumption. I guess it would be another thing that could ultimately be considered not as crucial as having any GPU tests run on master-only. It would be more painful/manual to not have GPU-CI on PRs, but I guess that would in a way be even more resource-conservative 😄 |
So we have two options (third one just for completeness):
While it would be nice to be able to debug GPU builds in PRs, this could still be done manually with artefact persistence... Sharing between CI providers might be more desirable in comparison, even if it's master-only. |
Let's start with caching within azure. |
This is a topic I've been thinking about for a while regarding run-times (cf. conda-forge/conda-forge.github.io#902), but it also came up recently in the context of a GPU-specific build queue, see for example the notes of a conda-forge meeting about a month ago (cf. also conda-forge/conda-forge.github.io#1272):
The core idea is to enable an opt-in per feedstock to run certain test sections (or outputs) in a separate CI job.
In fact, I think the perhaps best as well as easiest solution might be to allow opting into separate jobs per output. Not only can tests be moved to a separate output (in a multi-output recipe; and I've seen this in several places already, cf. e.g. pyarrow), but that would also provide a way to bring long-running recipes under the 6h limit, as long as the build steps can be modularized even a little bit - not least in recipes that are already multi-output (e.g. it would help me a lot on the faiss-feedstock, where the GPU lib takes between 3-5h to build, often leaving not enough time for the python bindings).
The one thing that would be necessary of course is to have a simple graph of which outputs depend on what other ones, so that they can get built in the correct order. From looking at the CI in the staged-recipes repo (where e.g. GPU builds aren't started by default), it seems that such functionality should be possible in principle...
CC @beckermr @isuruf @wolfv @jakirkham
The text was updated successfully, but these errors were encountered: