Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tasks and discussion for new GPU CI queue #1272

Closed
1 of 11 tasks
beckermr opened this issue Mar 12, 2021 · 36 comments · Fixed by #2038
Closed
1 of 11 tasks

tasks and discussion for new GPU CI queue #1272

beckermr opened this issue Mar 12, 2021 · 36 comments · Fixed by #2038
Assignees

Comments

@beckermr
Copy link
Member

beckermr commented Mar 12, 2021

This issue is to document tasks and todos for the new GPU CI queue.

To-Do Items:

  • decide on software to provision the CI (decided on drone)
    • best candidates from internal discussions are drone or azure
  • develop a process to permission feedstocks on the CI
    • we will want an allow list of feedstocks that are allowed to use the queue
    • we will want to add a job to conda-forge/admin-requests to add feedstocks to this list and provision them with the proper keys/permissions to access the CI
    • add allowed list of users
  • put in changes to smithy to allow separate build and test phases in the CI config files
  • make sure the build phase does not tie up a GPU in the CI system
    • separate queues for build and test
    • move jobs from build on cpu to test on GPU
  • put in monitoring for the load on the queues
    • we have existing tools that will be able to output the load in five-minute increments to the conda-forge status page
    • we may want more than this however
  • establish and document a clear process of who to contact when things fail or break

cc @dharhas @jakirkham @kkraus14 @viniciusdc

@mariusvniekerk for azure stuff

I missed Ray and Kent on the github handles. Can someone ping them here?

hackmd: https://hackmd.io/QCX9xMnzS2WeobW0athINA

@beckermr beckermr pinned this issue Mar 12, 2021
@jakirkham
Copy link
Member

cc @ocefpaf @mike-wendt @raydouglass @teoliphant

@jakirkham
Copy link
Member

cc @h-vetinari (Axel)

@jakirkham
Copy link
Member

Should add I also don't know Kent's or others' GH handles. So please cc others as needed. Thanks! 😄

@beckermr
Copy link
Member Author

Ahhhh thanks! I could not find Axel's github handle. See this one too for the other azure stuff: #1273

@dharhas
Copy link
Member

dharhas commented Mar 13, 2021

cc @kllewelyn @rtiwariops from openteams.

@leofang
Copy link
Member

leofang commented Mar 13, 2021

Very exciting!

@h-vetinari
Copy link
Member

h-vetinari commented Mar 13, 2021

Thanks for opening this @beckermr!

Linking my closing comment from #1062 for reference. TL;DR:

In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra to enable conda-forge to do the building & integration costs would provide huge bang-per-buck for the people & companies that are building & using such packages.

@dharhas
Copy link
Member

dharhas commented Mar 22, 2021

Adding @leej3 from Quansight.

@jakirkham
Copy link
Member

cc @aktech (as it looks like you have been doing work in this area as well)

@viniciusdc
Copy link
Contributor

@beckermr have we decided upon Azure as the software to provision the CI for the GPU tasks?

@dharhas
Copy link
Member

dharhas commented Jul 15, 2021

Hey Folks,

Looks like this is finally happening. We expect hardware to arrive in 2-4 weeks. There are still a lot of questions that are unanswered about the software stack we should run on it and getting this setup and managed etc. So I wanted to restart the discussion.

Also adding @jaimergp to the conversation.

@jakirkham
Copy link
Member

Would it make sense to have a meeting?

@dharhas
Copy link
Member

dharhas commented Jul 15, 2021

I think a coordination meeting makes sense. We probably need some higher bandwidth time to get broad strokes of what this will look like sorted out.

@jakirkham
Copy link
Member

Ok. Went ahead and created a poll for us to figure out when is best time to meet in the next 2 weeks. Also please make sure to configure timezone before filling out the poll. Will share the results here and we can go from there

@leofang
Copy link
Member

leofang commented Jul 15, 2021

May I invite myself? 😛

@jakirkham
Copy link
Member

Appreciate the general enthusiasm around this work! 😄

Anyone is welcome. Though my guess is this will be focused on technical issues around integration into conda-forge. So doubt this will be of interest outside those planning to do that work. That said, we can take notes, raise new issues, and summarize here for broader community awareness.

@h-vetinari
Copy link
Member

Awesome news! Would love to participate, but on holidays for the next two weeks 😅

@beckermr
Copy link
Member Author

hey @jakirkham! I totally missed this poll. If there is a time for the next meeting already, that is fine. Otherwise, I have filled out the poll.

@jakirkham
Copy link
Member

Ok every time has some conflicts for someone. That said, the least conflicting time is on 27 July at 9a US Pacific / 11a US Central / 12p US Eastern / 5p UK / 6p European. We can take notes and summarize here for those that miss. Will send out an invite and we can go from there.

@jakirkham
Copy link
Member

Alright have sent that out 📬

Think I got everyone who responded to the poll. Though feel free to forward to others that I may have been missed.

Also set it up with Microsoft Teams since that's what I have easy access to. Though if people prefer to use something different, feel free to propose (and be ready to set it up 😉). Otherwise we will stick with Teams.

Thanks all! 😄

@viniciusdc
Copy link
Contributor

I'm not totally sure about the feasibility of this, but Drone seems to have and admin management for queues

@dharhas
Copy link
Member

dharhas commented Jul 28, 2021

Just noticed that drone's open source version is fairly hobbled vs their paid version.

https://www.drone.io/enterprise/opensource/#features

Not sure if we need any of the features that are not present in the OSS version but I thought I'd raise.

@beckermr
Copy link
Member Author

Thanks for this. We'll have to find out by doing I imagine.

@leofang
Copy link
Member

leofang commented Jul 28, 2021

btw What's the GPU model that the CI would use? I was under the wrong impression yesterday that MIG would work out of box for any existing models, but it looks like only certain Ampere GPUs support this feature.

@viniciusdc
Copy link
Contributor

The server under Quansight is currently maintained using openstack and we will be receiving an account for admin management. That's said the overall architecture we ended up with will split the GPUs into VMs (each contains 2 GPU's), we can change that later on if needed, as openstack uses some configurable profiles (they call it Flavors).

The idea would then be using Drone to manage the webhooks requests from GIthub and choose one of the VM (it will contain installed runners in there). There is support for openstack in Drone already, so the implementation might be easier than what we had previously assumed.

We will need to think about how we will trigger these special jobs as well. Are we going to create some new flags for thos feedstocks? should we whitelist those as well?

@viniciusdc
Copy link
Contributor

viniciusdc commented Mar 22, 2022

@beckermr about the current status of the CI-run integration, do you think we can have a feedstock to run some tests and check how the permissioning will be held?

@beckermr
Copy link
Member Author

I do not.

@viniciusdc
Copy link
Contributor

I do not.

Should we add a test suit somewhere in the bots tests to eval the GPU builds before enabling it? I am open to any suggestions to test this integration.

@beckermr
Copy link
Member Author

We need to hear back from the ci run folks.

@jakirkham
Copy link
Member

jakirkham commented Mar 23, 2022

@jaimergp would you be able to share an update on where things stand at the meeting tomorrow later today? 🙂

@jaimergp
Copy link
Member

Will do!

@dharhas
Copy link
Member

dharhas commented Mar 23, 2022

@aktech just realized you are not part of this chain. See above ping about cirun status.

@aktech
Copy link

aktech commented Mar 23, 2022

We need to hear back from the ci run folks.

@beckermr Let me know if you guys need anything from me, Cirun works with OpenStack as well.

@leofang
Copy link
Member

leofang commented May 12, 2022

Just curious, what's the status here?

@jaimergp
Copy link
Member

Hi Leo!

I gave a little update in the last call, but I should have posted it here too, my apologies!

@viniciusdc and I are working on a test feedstock (outside of conda-forge while prototyping) to build OpenMM on the GPU server. We can successfully run CPU builds with almost no changes in the conda-forge machinery (same Docker images and scripts). We are now debugging the GPU drivers, which are preventing us from correctly invoking the nvidia-docker integrations.

We are now waiting for the datacenter tech support to get back to us on the ticket we submitted some days ago.

So, slowly but surely we are getting there!

@beckermr
Copy link
Member Author

resources for automating cirun:

cirun api client: https://github.com/AktechLabs/cirun-py
github api endpoint: https://docs.github.com/en/rest/apps/installations#add-a-repository-to-an-app-installation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

8 participants