tasks and discussion for new GPU CI queue #1272

beckermr · 2021-03-12T22:14:33Z

jakirkham · 2021-03-12T22:25:07Z

cc @ocefpaf @mike-wendt @raydouglass @teoliphant

jakirkham · 2021-03-12T22:26:48Z

cc @h-vetinari (Axel)

jakirkham · 2021-03-12T22:30:15Z

Should add I also don't know Kent's or others' GH handles. So please cc others as needed. Thanks! 😄

beckermr · 2021-03-12T22:30:18Z

Ahhhh thanks! I could not find Axel's github handle. See this one too for the other azure stuff: #1273

dharhas · 2021-03-13T00:22:56Z

cc @kllewelyn @rtiwariops from openteams.

leofang · 2021-03-13T05:03:52Z

Very exciting!

h-vetinari · 2021-03-13T15:33:01Z

Thanks for opening this @beckermr!

Linking my closing comment from #1062 for reference. TL;DR:

In short: it would be amazing if any of the involved people / companies could take this as impetus to chip in something as well. GPU computing is only ever going to get larger, and I believe that sharing some (comparatively low) CI infra to enable conda-forge to do the building & integration costs would provide huge bang-per-buck for the people & companies that are building & using such packages.

dharhas · 2021-03-22T16:37:45Z

Adding @leej3 from Quansight.

jakirkham · 2021-06-30T18:49:04Z

cc @aktech (as it looks like you have been doing work in this area as well)

viniciusdc · 2021-07-08T18:30:34Z

@beckermr have we decided upon Azure as the software to provision the CI for the GPU tasks?

dharhas · 2021-07-15T21:28:51Z

Hey Folks,

Looks like this is finally happening. We expect hardware to arrive in 2-4 weeks. There are still a lot of questions that are unanswered about the software stack we should run on it and getting this setup and managed etc. So I wanted to restart the discussion.

Also adding @jaimergp to the conversation.

jakirkham · 2021-07-15T21:33:54Z

Would it make sense to have a meeting?

dharhas · 2021-07-15T21:35:06Z

I think a coordination meeting makes sense. We probably need some higher bandwidth time to get broad strokes of what this will look like sorted out.

jakirkham · 2021-07-15T21:50:28Z

Ok. Went ahead and created a poll for us to figure out when is best time to meet in the next 2 weeks. Also please make sure to configure timezone before filling out the poll. Will share the results here and we can go from there

leofang · 2021-07-15T21:53:08Z

May I invite myself? 😛

jakirkham · 2021-07-15T22:01:41Z

Appreciate the general enthusiasm around this work! 😄

Anyone is welcome. Though my guess is this will be focused on technical issues around integration into conda-forge. So doubt this will be of interest outside those planning to do that work. That said, we can take notes, raise new issues, and summarize here for broader community awareness.

h-vetinari · 2021-07-16T08:56:07Z

Awesome news! Would love to participate, but on holidays for the next two weeks 😅

beckermr · 2021-07-20T17:50:03Z

hey @jakirkham! I totally missed this poll. If there is a time for the next meeting already, that is fine. Otherwise, I have filled out the poll.

jakirkham · 2021-07-20T19:27:30Z

Ok every time has some conflicts for someone. That said, the least conflicting time is on 27 July at 9a US Pacific / 11a US Central / 12p US Eastern / 5p UK / 6p European. We can take notes and summarize here for those that miss. Will send out an invite and we can go from there.

jakirkham · 2021-07-21T01:31:31Z

Alright have sent that out 📬

Think I got everyone who responded to the poll. Though feel free to forward to others that I may have been missed.

Also set it up with Microsoft Teams since that's what I have easy access to. Though if people prefer to use something different, feel free to propose (and be ready to set it up 😉). Otherwise we will stick with Teams.

Thanks all! 😄

viniciusdc · 2021-07-27T17:02:12Z

I'm not totally sure about the feasibility of this, but Drone seems to have and admin management for queues

dharhas · 2021-07-28T18:00:20Z

Just noticed that drone's open source version is fairly hobbled vs their paid version.

https://www.drone.io/enterprise/opensource/#features

Not sure if we need any of the features that are not present in the OSS version but I thought I'd raise.

beckermr · 2021-07-28T18:44:05Z

Thanks for this. We'll have to find out by doing I imagine.

leofang · 2021-07-28T19:03:35Z

btw What's the GPU model that the CI would use? I was under the wrong impression yesterday that MIG would work out of box for any existing models, but it looks like only certain Ampere GPUs support this feature.

viniciusdc · 2021-10-14T18:30:42Z

The server under Quansight is currently maintained using openstack and we will be receiving an account for admin management. That's said the overall architecture we ended up with will split the GPUs into VMs (each contains 2 GPU's), we can change that later on if needed, as openstack uses some configurable profiles (they call it Flavors).

The idea would then be using Drone to manage the webhooks requests from GIthub and choose one of the VM (it will contain installed runners in there). There is support for openstack in Drone already, so the implementation might be easier than what we had previously assumed.

We will need to think about how we will trigger these special jobs as well. Are we going to create some new flags for thos feedstocks? should we whitelist those as well?

viniciusdc · 2022-03-22T16:27:49Z

@beckermr about the current status of the CI-run integration, do you think we can have a feedstock to run some tests and check how the permissioning will be held?

beckermr · 2022-03-22T17:02:10Z

I do not.

viniciusdc · 2022-03-22T19:11:39Z

I do not.

Should we add a test suit somewhere in the bots tests to eval the GPU builds before enabling it? I am open to any suggestions to test this integration.

beckermr · 2022-03-22T19:23:52Z

We need to hear back from the ci run folks.

jakirkham · 2022-03-23T09:42:22Z

@jaimergp would you be able to share an update on where things stand at the meeting ~~tomorrow~~ later today? 🙂

jaimergp · 2022-03-23T09:57:20Z

Will do!

dharhas · 2022-03-23T13:45:04Z

@aktech just realized you are not part of this chain. See above ping about cirun status.

aktech · 2022-03-23T13:57:59Z

We need to hear back from the ci run folks.

@beckermr Let me know if you guys need anything from me, Cirun works with OpenStack as well.

leofang · 2022-05-12T05:51:48Z

Just curious, what's the status here?

jaimergp · 2022-05-12T12:49:02Z

Hi Leo!

I gave a little update in the last call, but I should have posted it here too, my apologies!

@viniciusdc and I are working on a test feedstock (outside of conda-forge while prototyping) to build OpenMM on the GPU server. We can successfully run CPU builds with almost no changes in the conda-forge machinery (same Docker images and scripts). We are now debugging the GPU drivers, which are preventing us from correctly invoking the nvidia-docker integrations.

We are now waiting for the datacenter tech support to get back to us on the ticket we submitted some days ago.

So, slowly but surely we are getting there!

beckermr · 2022-11-25T14:36:16Z

resources for automating cirun:

cirun api client: https://github.com/AktechLabs/cirun-py
github api endpoint: https://docs.github.com/en/rest/apps/installations#add-a-repository-to-an-app-installation

beckermr pinned this issue Mar 12, 2021

beckermr mentioned this issue Mar 12, 2021

CI for GPU packages #1062

Closed

leofang mentioned this issue Mar 13, 2021

Add a CI to test conda-forge/cupy-feedstock? chainer/chainer-test#576

Open

leofang mentioned this issue Mar 23, 2021

CUDA awareness is not built in 4.1.0 conda-forge/openmpi-feedstock#83

Closed

h-vetinari mentioned this issue Apr 9, 2021

Allow generating different CI jobs for tests and/or separate outputs conda-forge/conda-smithy#1472

Open

h-vetinari mentioned this issue Jun 10, 2021

initial win support conda-forge/pypy3.6-feedstock#41

Merged

5 tasks

h-vetinari mentioned this issue Jul 20, 2021

Cannot install torchvision 0.8.2 with conda-forge without defaults channel pytorch/vision#3264

Closed

leofang mentioned this issue Jul 28, 2021

CuPy: Support CUDA 12 conda-forge/cupy-feedstock#139

Closed

h-vetinari mentioned this issue Aug 11, 2021

TensorFlow 2 support for win-64 nonexistent conda-forge/tensorflow-feedstock#124

Open

leofang mentioned this issue Aug 5, 2022

Split cupy.cuda.thrust into function-and-dtype-based translation units for parallelization cupy/cupy#6941

Closed

7 tasks

jaimergp self-assigned this Dec 15, 2022

h-vetinari mentioned this issue Dec 17, 2022

Splitting this package in managable chunks conda-forge/pytorch-cpu-feedstock#108

Open

jaimergp mentioned this issue Dec 19, 2022

Add support for Cirun on self-hosted GHA runners conda-forge/conda-smithy#1703

Merged

1 task

leofang mentioned this issue Mar 9, 2023

cuda-version: How do we want to use this? conda-forge/cuda-version-feedstock#1

Closed

jaimergp mentioned this issue Dec 4, 2023

document gpu/long-running builds #2038

Merged

2 tasks

isuruf closed this as completed in #2038 Dec 5, 2023

jakirkham unpinned this issue Dec 5, 2023

tasks and discussion for new GPU CI queue #1272

tasks and discussion for new GPU CI queue #1272

Comments

beckermr commented Mar 12, 2021 • edited Loading

jakirkham commented Mar 12, 2021

jakirkham commented Mar 12, 2021

jakirkham commented Mar 12, 2021

beckermr commented Mar 12, 2021

dharhas commented Mar 13, 2021

leofang commented Mar 13, 2021

h-vetinari commented Mar 13, 2021 • edited Loading

dharhas commented Mar 22, 2021

jakirkham commented Jun 30, 2021

viniciusdc commented Jul 8, 2021

dharhas commented Jul 15, 2021

jakirkham commented Jul 15, 2021

dharhas commented Jul 15, 2021 • edited Loading

jakirkham commented Jul 15, 2021

leofang commented Jul 15, 2021

jakirkham commented Jul 15, 2021

h-vetinari commented Jul 16, 2021

beckermr commented Jul 20, 2021

jakirkham commented Jul 20, 2021

jakirkham commented Jul 21, 2021

viniciusdc commented Jul 27, 2021

dharhas commented Jul 28, 2021

beckermr commented Jul 28, 2021

leofang commented Jul 28, 2021

viniciusdc commented Oct 14, 2021

viniciusdc commented Mar 22, 2022 • edited Loading

beckermr commented Mar 22, 2022

viniciusdc commented Mar 22, 2022

beckermr commented Mar 22, 2022

jakirkham commented Mar 23, 2022 • edited Loading

jaimergp commented Mar 23, 2022

dharhas commented Mar 23, 2022

aktech commented Mar 23, 2022

leofang commented May 12, 2022

jaimergp commented May 12, 2022

beckermr commented Nov 25, 2022

beckermr commented Mar 12, 2021 •

edited

Loading

h-vetinari commented Mar 13, 2021 •

edited

Loading

dharhas commented Jul 15, 2021 •

edited

Loading

viniciusdc commented Mar 22, 2022 •

edited

Loading

jakirkham commented Mar 23, 2022 •

edited

Loading