Add experimental GPU support #5605

matiasinsaurralde · 2023-08-09T21:39:39Z

Implements basic GPU access by relying on the nvidia-container-runtime-hook.
Allows specifying the GPU ID to be exposed to the container on machines that host multiple GPUs.
A new WithGPU method is implemented to allow specifying GPU ID (or just all if you want to expose all GPUs):

...
	ctr := c.Container().From(cudaImage)
	contents, err := ctr.
		// WithGPU(dagger.ContainerWithGPUOpts{Devices: "GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).
		WithGPU(dagger.ContainerWithGPUOpts{Devices: "all"}).
		WithExec([]string{"nvidia-smi", "-L"}).
		Stdout(ctx)
...

There's an integration test that tests functionality for both single GPU and multi GPU environments here.
Requires the host to add a shell script in /usr/bin/nvidia_helper.sh with the following contents (still don't know the best way to ship this and the commands look too verbose and probably unreadable if we inline them directly with WithExec, open to suggestions!): nvidia_helper.sh contents.
For now the dev engine container was switched completely to Ubuntu and we should be able to improve this part. For example, if the user requires GPU access, use Ubuntu as the base image. If not keep using Alpine. The reasons were mentioned here. TLDR is that Nvidia doesn't ship official container runtime tooling for Alpine as you can find here and it looks like it's generally better and safer to use an standard base image rather than hacking around Alpine to make it work.

Ticket is #4675

TomChv

Amazing PR! I left few questions and comments and look forward for a deeper review when the PR is ready :D

cmd/shim/main.go

core/schema/container.graphqls

internal/mage/util/engine.go

TomChv · 2023-08-10T13:58:44Z

Requires the host to add a shell script in /usr/bin/nvidia_helper.sh with the following contents (still don't know the best way to ship this and the commands look too verbose and probably unreadable if we inline them directly with WithExec, open to suggestions!): nvidia_helper.sh contents.

We could embbed it in Dagger binary and insert it in the container on call to WithGPU

matiasinsaurralde · 2023-08-15T17:03:48Z

Updates:

By default no GPU access is attempted.
An environment variable called _EXPERIMENTAL_DAGGER_GPU_SUPPORT needs to be set to enable GPU support.
If _EXPERIMENTAL_DAGGER_GPU_SUPPORT is set dev engine starts with Ubuntu and installs all Nvidia requirements.
If _EXPERIMENTAL_DAGGER_GPU_SUPPORT is not set, dev engine starts with Alpine with all regular dependencies.
If _EXPERIMENTAL_DAGGER_GPU_SUPPORT is set and WithGPU is used before WithExec, WithExec throws an error:

paperspace@psal6i8au:~/go/src/github.com/dagger/dagger$ go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration
=== RUN   TestGPUAccess
    gpu_test.go:43: 
        	Error Trace:	/home/paperspace/go/src/github.com/dagger/dagger/core/integration/gpu_test.go:43
        	Error:      	Received unexpected error:
        	            	input:1: container.from.withGPU.withExec GPU support is not enabled, set _EXPERIMENTAL_DAGGER_GPU_SUPPORT
        	            	
        	            	Please visit https://dagger.io/help#go for troubleshooting guidance.
        	Test:       	TestGPUAccess
--- FAIL: TestGPUAccess (1.57s)
FAIL
FAIL	github.com/dagger/dagger/core/integration	1.593s
FAIL

If _EXPERIMENTAL_DAGGER_GPU_SUPPORT is set and WithGPU is used and there are available GPUs, selected GPUs should be exposed to the container as requested.

shykes · 2023-08-15T18:08:37Z

Replying here to a discord message by @matiasinsaurralde :

We introduce an environment variable to enable experimental GPU support. If this is not set we follow the regular Dagger behavior (Alpine is used, etc.).

If the environment variable is set we use an Ubuntu base image and setup all Nvidia dependencies on it.

I've also spent time testing different scenarios like what happens when you run in a host that doesn't have any Nvidia GPUs, etc.

Shim is also conditioned to the experimental GPU support flag, e.g. won't attempt to inject the Nvidia container runtime hook if this is not set.

Basic flow to try it out should be: (i) add export _EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 to hack/dev. (ii) run ./hack/dev bash, (iii) Run the tests (or replace them with anything you would like to try): go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration

Is the idea to move everything to ubuntu when the experimental gate is removed ? I worry about maintaining two different images for a long period of time.

TomChv

Nice improvement, I really like it!

I left a comment related to the actual usage outside dagger dogfeed :)

Careful btw, the DCO is currently failing :)

TomChv · 2023-08-15T19:36:12Z

internal/mage/util/engine.go

@@ -58,6 +61,17 @@ insecure-entitlements = ["security.insecure"]
 {{ end -}}
 `

+// nvidiaSetupHelper provides the required steps to setup nvidia-container-toolkit:
+const nvidiaSetupHelper = `


What if I want to use in on my computer for my own pipeline? I'll not have access to this file since it's internal to mage.
Is there another way to handle that? Or should I prepare a container the same way you do it here but on my own.

@TomChv Good point, thinking this could be moved to the shim, and the file gets initialized there if GPU access is enabled?

I think that could work yeah, give it a try

Have been exploring this a bit, would it be better to setup the Nvidia runtime at this level? https://github.com/matiasinsaurralde/dagger/blob/gpu-access-2/core/container.go#L1034

So WithGPU is called, Nvidia runtime is setup and we still pass the parameters to shim (we'll always need this to signal GPU visibility to the prestart hook).
I don't see an alternative that doesn't involve installing the Nvidia runtime every time we create and start a container. We previously tried mounting Nvidia runtime files from the host into the container -so that no installation step happens- but turns out to be tricky if the container and the host aren't running similar environments.

On the other side I believe that if WithGPU introduces an additional step to run the helper script and install the Nvidia runtime it could play well with caching. Subsequent runs wouldn't be installing the runtime again. Makes sense?

Let me know if I misunderstood the scenario you described, still thinking about this.

Have been exploring this a bit, would it be better to setup the Nvidia runtime at this level? https://github.com/matiasinsaurralde/dagger/blob/gpu-access-2/core/container.go#L1034

This would be better but you do not know which image is used by the container, since nvinda runtime can also be setup on ubuntu, (as far as I understand), this step will mostly fails except if the base image is correct.
That would become tricky become some part of the setup would be up to the user and some other would be on Dagger side, I think it would create a lack of flexibility.

But with Zenith we might be able to solve this issue thanks to special GPU environment (think about it as extension) that could be load by the user. This would make it much easier!

@TomChv I didn't know about Zenith, just reading about it.

CUDA only supports four image types for now (Ubuntu, UBI, RockyLinux and CentOS) and installation steps for the runtime hook are limited too. A set of instructions work for Ubuntu and the other set for CentOS/RHEL: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#

Wandering if we could try to be smart here and perform some small probe to determine if the container image is running one or the other, e.g. a CentOS/RHEL will contain the dnf binary but Ubuntu won't, etc.

As an alternative WithGPU could introduce a configuration parameter that takes the distro flavor.

What if I want to use in on my computer for my own pipeline? I'll not have access to this file since it's internal to mage.

I'm confused - wouldn't you be able to just export _EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 and then ./hack/dev?

@TomChv have been revisiting this. I think there's some confussion on the environment where the hook runs:

If we change the engine's base image to Ubuntu, we only need to install Nvidia Container Runtime at that level (engine container).

Shim should be able to find the path to nvidia-container-runtime-hook in the engine's container, that's all.

We don't need to install Nvidia tooling in the Dagger-created containers, we assume an image that's specified by the user already contains Nvidia dependencies. It could be a CUDA image directly -like nvidia/11.7.1-base-ubuntu20.04 or nvidia/11.7.1-base-centos7- or a custom image created by the user that's based on any of these original CUDA images. I will do some additional testing around this topic today.

We don't really need to have different behavior for distro flavors as we have control on which image to use for the engine's container.

matiasinsaurralde · 2023-08-16T19:59:32Z

Replying here to a discord message by @matiasinsaurralde :

We introduce an environment variable to enable experimental GPU support. If this is not set we follow the regular Dagger behavior (Alpine is used, etc.).

If the environment variable is set we use an Ubuntu base image and setup all Nvidia dependencies on it.

I've also spent time testing different scenarios like what happens when you run in a host that doesn't have any Nvidia GPUs, etc.

Shim is also conditioned to the experimental GPU support flag, e.g. won't attempt to inject the Nvidia container runtime hook if this is not set.

Basic flow to try it out should be: (i) add export _EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 to hack/dev. (ii) run ./hack/dev bash, (iii) Run the tests (or replace them with anything you would like to try): go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration

Is the idea to move everything to ubuntu when the experimental gate is removed ? I worry about maintaining two different images for a long period of time.

@shykes I think that moving to Ubuntu after the experimental feature makes sense. After integrating these changes we could also spend some more time trying out wolfi, my initial experiments weren't successful: #4675 (comment)

Probably Ubuntu is generally better as it's an official Nvidia supported distro.

TomChv · 2023-08-17T10:34:27Z

Probably Ubuntu is generally better as it's an official Nvidia supported distro.

I agree with that, the only disadvantage is that it will make our pipeline a bit slower because ubuntu is heavier than alpine.

sipsma · 2023-08-19T15:49:34Z

core/schema/container.graphqls

+  """
+  Sets GPU access parameters for the given container, currently works for Nvidia only.
+  """
+  withGPU(
+    devices: String


More descriptive docs will help here, I wouldn't know what I'm supposed to set devices to. Also some other basic stuff like whether it's valid to call multiple times (to configure multiple devices), etc.

If the answer is "it's complicated" that's alright, but then we can just have a brief description here and maybe point to our official docs once those exist :-)

sipsma · 2023-08-19T15:50:24Z

core/schema/container.graphqls

+  withGPU(
+    devices: String
+  ): Container!
+


In line with our other APIs, we should also have fields like gpu (to read which gpu is configured, if any) and withoutGPU to remove the setting

sipsma · 2023-08-19T15:58:03Z

core/container.go

@@ -1025,9 +1027,18 @@ func (container *Container) WithPipeline(ctx context.Context, name, description
 	return container, nil
 }

-func (container *Container) WithExec(ctx context.Context, gw bkgw.Client, progSock *Socket, defaultPlatform specs.Platform, opts ContainerExecOpts) (*Container, error) { //nolint:gocyclo
+type ContainerGPUOpts struct {
+	Devices string


If this is a list of devices can we make it a []string here?

sipsma · 2023-08-19T16:00:09Z

core/integration/gpu_test.go

+	ctr := c.Container().From(cudaImage)
+	contents, err := ctr.
+		// WithGPU(dagger.ContainerWithGPUOpts{Devices: "GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).
+		WithGPU(dagger.ContainerWithGPUOpts{Devices: "all"}).


I think I'd personally prefer if all wasn't a special string and instead we had a separate api call like WithAllGPUs (or similar, could instead be a bool option to WithGPU perhaps, though I like that less).

Just to cut back on the need for users to remember and type one-off strings like that correctly.

sipsma · 2023-08-19T16:30:14Z

internal/mage/engine.go

+		runArgs = append(runArgs, []string{"--gpus", "all"}...)
+	}
+	runArgs = append(runArgs, []string{
+		"--rm",


nit: actually just delete --rm I think, not sure why it was commented out, but it's useful to not remove the dev engine if it dies because you can still look at the logs if it crashed.

Fixed a few weeks ago

sipsma · 2023-08-19T16:45:15Z

internal/mage/util/engine.go

+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
+apt-get update
+apt-get install nvidia-container-toolkit -y


Curious what the new image size is for the ubuntu image. Totally okay with the tradeoff here for the moment, just wondering what final number actually ends up being.

sipsma · 2023-08-19T16:48:21Z

core/integration/gpu_test.go

+	contents, err := ctr.
+		// WithGPU(dagger.ContainerWithGPUOpts{Devices: "GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).
+		WithGPU(dagger.ContainerWithGPUOpts{Devices: "all"}).
+		WithExec([]string{"nvidia-smi", "-L"}).


This just lists the GPUs to make sure they are visible right? If so, that's great for a basic test, but have you verified programs that actually utilize the GPU work?

I'm guessing there's probably some python ML libraries that could be run fairly in a WithExec, would be good to have that test too.

@sipsma I covered this a few weeks ago with a test called TestGPUAccessWithPython. It runs some pytorch computation using GPU inside the Dagger container.

gerhard · 2023-09-11T18:27:04Z

Anything that I can do to help move this along @matiasinsaurralde?

matiasinsaurralde · 2023-09-12T17:25:54Z

I've updated this PR to incorporate b40b4a6, this should unblock anyone who wants to test this: if service containers are disabled and GPU access is enabled the PR will work in its current form.

However as @vito pointed out on Discord we should aim to always support this feature due to the fact that service containers will be enabled by default (see #5557). I'm still rewriting and testing CNI setup for Ubuntu: https://github.com/dagger/dagger/blob/main/internal/mage/util/engine.go#L242

I've also updated the base container image to Ubuntu 22.04 -it was previously 20.04- due to incompatibilities with dnsmasq CLI flags: 0c4abace76e44267aa562d711a7e05dbbdd4e553
Ubuntu is only used when GPU access is enabled though.

matiasinsaurralde · 2023-09-20T04:41:18Z

@shykes / @gerhard / @samalba
A summary of latest changes:

Fixed service containers when using Ubuntu -when GPU access is enabled-. CNI plugin builds were failing because the compilation host was Alpine (probably related to the same issue with musl we initially spotted while working on this feature).
Simplified WithGPU so that it takes a list of devices directly:

ctr := c.Container().From(cudaImage)
ctr.WithGPU([]string{"0", "1"}).
// Or:
ctr.WithGPU([]string{"GPU-5d8950fe-17a6-2fa7-9baa-afa83bba0e2b"}).

Added WithAllGPUs which ends up passing the all keyword to the Nvidia Container Toolkit:

ctr := c.Container().From(cudaImage)
ctr.WithAllGPUs()

Refactored tests a bit. Added two tests that use Pytorch images to run computations on GPU.
If we decide to switch the engine's base image to Ubuntu (Standardize on a single base OS for engine release #5668) I think we have a lot of useful code in this PR.

Need to look into SDK lint issues as I manually tweaked SDK Go code for the past tests. And also just try this with other SDKs.

github-actions · 2023-10-05T01:48:15Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

…nal Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de>

Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io>

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

…nd GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de>

… an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de>

…ontaine build Signed-off-by: Matias Insaurralde <matias@insaurral.de>

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

… the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de>

gerhard · 2023-10-27T18:19:13Z

My issues seem to be related to the pinned nvidia-driver package 515 which is too old for this.

I am unable to upgrade this package on this host:

I will restart this with a different image. Yesterday's base Ubuntu 20.04 seems to have worked fine with https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

matiasinsaurralde · 2023-10-27T18:26:56Z

Have added a quick sample in examples/sdk/go/gpu. After running ./hack/dev bash it should be possible to build it and run it (fc77da6):

$ cd examples/sdk/go/gpu
$ go build
$ ./gpu

Expected output:

Creating new Engine session... OK!
Establishing connection to Engine... 1: connect
1: > in init
1: starting engine 
1: starting engine [0.10s]
1: starting session 
1: [0.14s] OK!
1: starting session [0.05s]
1: connect DONE
OK!

6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
6: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8 
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8 [0.02s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176 6.716KiB / 6.716KiB [0.11s]
11: sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 0B / 183B 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 0B / 45.66MiB 
11: sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 0B / 7.575MiB 
11: sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 183B / 183B [0.20s]
11: sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 0B / 26.23MiB 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 3MiB / 45.66MiB 
11: sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 7.575MiB / 7.575MiB [0.39s]
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 23.76MiB / 45.66MiB 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 40MiB / 45.66MiB 
11: sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 16MiB / 26.23MiB 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 45.66MiB / 45.66MiB 
11: sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 45.66MiB / 45.66MiB [0.78s]
11: sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 26.23MiB / 26.23MiB [0.75s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 

11: extracting sha256:56e0351b98767487b3c411034be95479ed1710bb6be860db6df0be3a98653027 [1.85s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 
11: extracting sha256:ccfccf24900f621770df167b6811a582af777df108eee444c85ab551d39e9b77 [0.51s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 
11: extracting sha256:6eeb573e3fe082c86b74ee5c6f0cf7c091d322a1584c89d8f843d69af7c099d3 [1.63s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 
11: extracting sha256:ba47ed5447555d2c58ee4005528f925fc538e093f9fe584e7f0b1e5198ea6b94 [0.01s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: extracting sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176 
11: extracting sha256:e5c76c058a4460c199c66189c15efca36f4a512a35fe57273a979ccca58a0176 [0.01s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

9: exec nvidia-smi -L
9: [0.29s] GPU 0: Quadro P4000 (UUID: GPU-ca2c7679-d68c-5af1-f517-f991d89438e4)
9: exec nvidia-smi -L DONE
available GPUs GPU 0: Quadro P4000 (UUID: GPU-ca2c7679-d68c-5af1-f517-f991d89438e4)

By the way I will need to re-test with multiple GPUs after all the latest refactoring, will do it over the weekend.

If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

gerhard · 2023-10-27T22:23:21Z

I am picking this one up now. Third time lucky 🤞

Started with Ubuntu 20.04 server image this time with a P4000 card.

Capturing the commands that I ran as soon as I logged in:

sudo apt-get update && sudo apt-get upgrade -y
sudo apt-get install -y build-essential tmux
# consider tmux-ing it...

### DOCKER
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Add the repository to Apt sources:
echo \
  "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world

### NVIDIA
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo apt-get install -y nvidia-driver-535
nvidia-smi

sudo nvidia-ctk runtime configure --runtime=docker

### LOAD NEW DRIVERS
sudo reboot
nvidia-smi

### GOLANG
/bin/bash -c "$(curl -fsSL https://github.com/raw/Homebrew/install/HEAD/install.sh)"
(echo; echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"') >> /home/paperspace/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"
brew install gcc golang

### THIS PR
git clone https://github.com/matiasinsaurralde/dagger.git
cd dagger
git checkout gpu-access-2

And now to check that this works:

_EXPERIMENTAL_DAGGER_GPU_SUPPORT=1 ./hack/dev bash
export DAGGER_GPU_TESTS_ENABLED=1

go test -v -count 1 -timeout 1000s -run=TestGPUAccess ./core/integration
# ...
=== RUN   TestGPUAccess
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04/use_specific_GPU
    gpu_test.go:132: this test requires at least 2 GPUs to run
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8/use_specific_GPU
    gpu_test.go:132: this test requires at least 2 GPUs to run
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-centos7
=== RUN   TestGPUAccess/nvidia/cuda:11.7.1-base-centos7/use_specific_GPU
    gpu_test.go:132: this test requires at least 2 GPUs to run
--- FAIL: TestGPUAccess (26.99s)
    --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04 (7.54s)
        --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubuntu20.04/use_specific_GPU (0.00s)
    --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8 (8.74s)
        --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-ubi8/use_specific_GPU (0.00s)
    --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-centos7 (10.54s)
        --- FAIL: TestGPUAccess/nvidia/cuda:11.7.1-base-centos7/use_specific_GPU (0.00s)
=== RUN   TestGPUAccessWithPython
=== RUN   TestGPUAccessWithPython/pytorch_CUDA_availibility_check
=== RUN   TestGPUAccessWithPython/pytorch_tensors_sample
--- PASS: TestGPUAccessWithPython (136.94s)
    --- PASS: TestGPUAccessWithPython/pytorch_CUDA_availibility_check (133.12s)
    --- PASS: TestGPUAccessWithPython/pytorch_tensors_sample (3.68s)
FAIL
FAIL    github.com/dagger/dagger/core/integration       163.948s
FAIL

Which of the following instances did you provision in Paperspace @matiasinsaurralde for the tests with 2 GPUs?

Check that the Go SDK GPU example works:

cd examples/sdk/go/gpu
go run main.go
Creating new Engine session... OK!
Establishing connection to Engine... 1: connect
1: > in init
1: starting engine
1: starting engine [0.08s]
1: starting session
1: [0.11s] OK!
1: starting session [0.03s]
1: connect DONE
OK!

6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
6: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
6: resolve image config for docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04
11: > in from nvidia/cuda:11.7.1-base-ubuntu20.04
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8
11: resolve docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04@sha256:745cc6cbd3e36d20441a4fee04b7fab8d2785584cf0d2cf667408f5f773ec9e8 [0.01s]
11: pull docker.io/nvidia/cuda:11.7.1-base-ubuntu20.04 DONE

9: exec nvidia-smi -L CACHED
9: exec nvidia-smi -L CACHED
available GPUs GPU 0: Quadro P4000 (UUID: GPU-14985f0a-d0d7-2168-0baa-4a077ac0f6c1)

gerhard

This now works as advertised 🙌

Thank you to all that reviewed this PR & helped move along - it's been a long time coming!

Thank you for sticking with it @matiasinsaurralde & seeing it through 💪

Next steps (a.k.a. follow-up PRs):

Add docs (as already discussed in other comments)
- Paperspace install instructions in my last comment might come in handy
- ✨ Zenith module? ✨
Add instructions for multi-GPU tests (see my last comment)
Ensure that creating the release works - cc @sipsma
Test that the released CLI & Engine image work as advertised - cc @sipsma
Create a Zenith module that showcases this with an LLM - cc @lukemarsden @samalba

As soon as the checks go green, this will get merged 🚀

* shim: incorporate GPU access hooks and pass GPU visibility parameters Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: extend container to implement WithGPU Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: extend dockerImageProvider to pass GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: extend dev engine container with GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: use Ubuntu when the dev engine is initialized with GPU support Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image. Signed-off-by: Matias Insaurralde <matias@insaurral.de> * sdk: update Go SDK Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: add GPU access tests Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: temp change to disable service containers while enabling GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: bump Ubuntu version when GPU access is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: always enable service containers and GPU access Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update cniPlugins to be compatible with Ubuntu Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: change logic around WithGPU and implement WithAllGPUs Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: fix EnabledGPUs usage Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: refactor GPU integration test with new calls Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: remove experimental flags from dev script Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update mage flows to support building and publishing an additional Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * api+publish fixups Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io> * schema: fix context usage in GPU methods Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: restore "no-cache" flag usage when running apk for dev engine containe build Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Add changelog fragment Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de> * examples: add simple GPU example Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Use latest available dagger Go package & fix replace If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Matias Insaurralde <matias@insaurral.de> Signed-off-by: Erik Sipsma <erik@dagger.io> Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Erik Sipsma <erik@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io> Signed-off-by: Christian Schlatter <schlatter@puzzle.ch>

* shim: incorporate GPU access hooks and pass GPU visibility parameters Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: extend container to implement WithGPU Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: extend dockerImageProvider to pass GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: extend dev engine container with GPU support flag Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: use Ubuntu when the dev engine is initialized with GPU support Also embed helper script for setting up Nvidia Container Toolkit on the Ubuntu base image. Signed-off-by: Matias Insaurralde <matias@insaurral.de> * sdk: update Go SDK Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: add GPU access tests Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: temp change to disable service containers while enabling GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: bump Ubuntu version when GPU access is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: always enable service containers and GPU access Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update cniPlugins to be compatible with Ubuntu Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: change logic around WithGPU and implement WithAllGPUs Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: fix EnabledGPUs usage Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: refactor GPU integration test with new calls Signed-off-by: Matias Insaurralde <matias@insaurral.de> * core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set Signed-off-by: Matias Insaurralde <matias@insaurral.de> * hack: remove experimental flags from dev script Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: update mage flows to support building and publishing an additional Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de> * api+publish fixups Significant (non merge conflict resolution) changes: * Prefixed APIs w/ "experimental" * Append `--gpus=all` when gpus are enabled in docker-image:// connhelper logic * Only publish amd64 image * Add gpu image variant to engine:testpublish * Run nvidia setup as commands rather than including extra script in image Signed-off-by: Erik Sipsma <erik@dagger.io> * schema: fix context usage in GPU methods Signed-off-by: Matias Insaurralde <matias@insaurral.de> * engine: Ensure "gpu" suffix is used when pulling the engine's image and GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: panic if there's an attempt to build the GPU enabled image with an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de> * mage: restore "no-cache" flag usage when running apk for dev engine containe build Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Add changelog fragment Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * mage: fix engine's dev step so that it loads a GPU enabled image when the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de> * examples: add simple GPU example Signed-off-by: Matias Insaurralde <matias@insaurral.de> * Use latest available dagger Go package & fix replace If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Matias Insaurralde <matias@insaurral.de> Signed-off-by: Erik Sipsma <erik@dagger.io> Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Erik Sipsma <erik@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io>

TomChv requested review from jlongtine, vito and sipsma and removed request for vito August 10, 2023 13:48

TomChv reviewed Aug 10, 2023

View reviewed changes

cmd/shim/main.go Outdated Show resolved Hide resolved

cmd/shim/main.go Outdated Show resolved Hide resolved

core/schema/container.graphqls Outdated Show resolved Hide resolved

internal/mage/util/engine.go Outdated Show resolved Hide resolved

matiasinsaurralde force-pushed the gpu-access-2 branch 2 times, most recently from 7832a1b to d132697 Compare August 15, 2023 15:38

TomChv reviewed Aug 15, 2023

View reviewed changes

shykes mentioned this pull request Aug 18, 2023

Standardize on a single base OS for engine release #5668

Open

sipsma reviewed Aug 19, 2023

View reviewed changes

matiasinsaurralde force-pushed the gpu-access-2 branch from f1f4bfc to 8e70537 Compare September 1, 2023 22:03

matiasinsaurralde force-pushed the gpu-access-2 branch 2 times, most recently from 9b634ac to 0c4abac Compare September 12, 2023 17:15

matiasinsaurralde force-pushed the gpu-access-2 branch 2 times, most recently from c1304e4 to 2764a32 Compare September 20, 2023 04:35

matiasinsaurralde marked this pull request as ready for review September 20, 2023 04:36

matiasinsaurralde mentioned this pull request Sep 20, 2023

GPU access #4675

Closed

github-actions bot added the kind/stale label Oct 5, 2023

matiasinsaurralde force-pushed the gpu-access-2 branch from 2764a32 to f6ca78c Compare October 9, 2023 09:01

gerhard removed the kind/stale label Oct 9, 2023

matiasinsaurralde and others added 14 commits October 27, 2023 18:11

mage: update cniPlugins to be compatible with Ubuntu

efbe4e6

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

core: change logic around WithGPU and implement WithAllGPUs

7305a55

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

core: fix EnabledGPUs usage

52e5555

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

core: refactor GPU integration test with new calls

14428e8

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

core: only run GPU tests when DAGGER_GPU_TESTS_ENABLED is set

efe50c6

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

hack: remove experimental flags from dev script

c424e1f

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

mage: update mage flows to support building and publishing an additio…

b447b0a

…nal Dagger image with GPU support Signed-off-by: Matias Insaurralde <matias@insaurral.de>

schema: fix context usage in GPU methods

9663615

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

engine: Ensure "gpu" suffix is used when pulling the engine's image a…

28f03a0

…nd GPU support is enabled Signed-off-by: Matias Insaurralde <matias@insaurral.de>

mage: panic if there's an attempt to build the GPU enabled image with…

d4f4d9a

… an arch that's not supported Signed-off-by: Matias Insaurralde <matias@insaurral.de>

mage: restore "no-cache" flag usage when running apk for dev engine c…

7fdd50a

…ontaine build Signed-off-by: Matias Insaurralde <matias@insaurral.de>

Add changelog fragment

48b780b

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

mage: fix engine's dev step so that it loads a GPU enabled image when…

aec0d23

… the appropriate flag is used - for local testing Signed-off-by: Matias Insaurralde <matias@insaurral.de>

matiasinsaurralde force-pushed the gpu-access-2 branch from c622c80 to aec0d23 Compare October 27, 2023 18:11

examples: add simple GPU example

fc77da6

Signed-off-by: Matias Insaurralde <matias@insaurral.de>

Use latest available dagger Go package & fix replace

ea9308e

If it's not relative, it's unlikely to work on other machines. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

gerhard approved these changes Oct 27, 2023

View reviewed changes

gerhard merged commit 8c90760 into dagger:main Oct 27, 2023
44 checks passed

matiasinsaurralde mentioned this pull request Oct 31, 2023

core: omit execution of multi GPU tests on single GPU environments #6031

Merged

gerhard mentioned this pull request Nov 28, 2023

Feat/sdk PHP #6165

Merged

jedevc mentioned this pull request Mar 18, 2024

✨ Official Docker image for the CLI? #6887

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental GPU support #5605

Add experimental GPU support #5605

matiasinsaurralde commented Aug 9, 2023

TomChv left a comment

TomChv commented Aug 10, 2023

matiasinsaurralde commented Aug 15, 2023

shykes commented Aug 15, 2023

TomChv left a comment •

edited

Loading

TomChv Aug 15, 2023

matiasinsaurralde Aug 16, 2023

TomChv Aug 16, 2023

matiasinsaurralde Aug 17, 2023 •

edited

Loading

TomChv Aug 17, 2023

matiasinsaurralde Aug 17, 2023 •

edited

Loading

vito Aug 18, 2023

matiasinsaurralde Aug 21, 2023 •

edited

Loading

matiasinsaurralde commented Aug 16, 2023

TomChv commented Aug 17, 2023

sipsma Aug 19, 2023

sipsma Aug 19, 2023

sipsma Aug 19, 2023

sipsma Aug 19, 2023

sipsma Aug 19, 2023

matiasinsaurralde Sep 20, 2023

sipsma Aug 19, 2023

sipsma Aug 19, 2023

matiasinsaurralde Sep 20, 2023 •

edited

Loading

gerhard commented Sep 11, 2023

matiasinsaurralde commented Sep 12, 2023

matiasinsaurralde commented Sep 20, 2023 •

edited

Loading

github-actions bot commented Oct 5, 2023

gerhard commented Oct 27, 2023

matiasinsaurralde commented Oct 27, 2023

gerhard commented Oct 27, 2023

gerhard left a comment •

edited by matiasinsaurralde

Loading

Add experimental GPU support #5605

Add experimental GPU support #5605

Conversation

matiasinsaurralde commented Aug 9, 2023

TomChv left a comment

Choose a reason for hiding this comment

TomChv commented Aug 10, 2023

matiasinsaurralde commented Aug 15, 2023

shykes commented Aug 15, 2023

TomChv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matiasinsaurralde Aug 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matiasinsaurralde Aug 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matiasinsaurralde Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

matiasinsaurralde commented Aug 16, 2023

TomChv commented Aug 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matiasinsaurralde Sep 20, 2023 • edited Loading

Choose a reason for hiding this comment

gerhard commented Sep 11, 2023

matiasinsaurralde commented Sep 12, 2023

matiasinsaurralde commented Sep 20, 2023 • edited Loading

github-actions bot commented Oct 5, 2023

gerhard commented Oct 27, 2023

matiasinsaurralde commented Oct 27, 2023

gerhard commented Oct 27, 2023

gerhard left a comment • edited by matiasinsaurralde Loading

Choose a reason for hiding this comment

TomChv left a comment •

edited

Loading

matiasinsaurralde Aug 17, 2023 •

edited

Loading

matiasinsaurralde Aug 17, 2023 •

edited

Loading

matiasinsaurralde Aug 21, 2023 •

edited

Loading

matiasinsaurralde Sep 20, 2023 •

edited

Loading

matiasinsaurralde commented Sep 20, 2023 •

edited

Loading

gerhard left a comment •

edited by matiasinsaurralde

Loading