Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update use of CDI API #12825

Merged
merged 2 commits into from
Jan 14, 2022
Merged

Conversation

elezar
Copy link
Contributor

@elezar elezar commented Jan 12, 2022

This change updates implementation from #10081 to use the CDI API changes by @klihub merged in cncf-tags/container-device-interface@46367ec. This also serves to indicate how CDI could be used to address #11088

From a user perspective the major change is that the device name specified through the podman command line (when executing podman run) must be a fully-qualified CDI device name.

This can be tested on an NVIDIA GPU-based system assuming the following:

  1. The nvidia-container-toolkit is installed
  2. The following file has been created at /etc/cdi/nvidia.json
{
  "cdiVersion": "0.2.0",
  "kind": "nvidia.com/gpu",
  "devices": [
    {
      "name": "gpu0",
      "containerEdits": {
        "env": [
          "NVIDIA_VISIBLE_DEVICES=0"
        ]
      }
    }
  ],
  "containerEdits": {
    "hooks": [
      {
        "hookName": "prestart",
        "path": "/usr/bin/nvidia-container-toolkit",
        "args": [
          "nvidia-container-toolkit",
          "prestart"
        ]
      }
    ]
  }
}

Note: this could also be a YAML file: /etc/cdi/nvidia.yaml:

---
cdiVersion: 0.2.0
kind: nvidia.com/gpu
devices:
- name: gpu0
  containerEdits:
    env:
    - NVIDIA_VISIBLE_DEVICES=0
containerEdits:
  hooks:
  - hookName: prestart
    path: "/usr/bin/nvidia-container-toolkit"
    args:
    - nvidia-container-toolkit
    - prestart

In which case selecting the nvidia.com/gpu=gpu0 device yields:

sudo ./bin/podman run -ti --rm  --net=host --device=nvidia.com/gpu=gpu0 ubuntu:18.04 nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB-N (UUID: GPU-edfee158-11c1-52b8-0517-92f30e7fac88)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2022
registry := cdi.GetRegistry()
unresolved, err := registry.InjectDevices(g.Config, c.config.CDIDevices...)
if len(unresolved) > 0 {
logrus.Debugf("Could not resolve CDI devices: %v", unresolved)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this ought to be more serious than a Debugf? logrus.Warnf maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should probably be a warning. This would indicate that the state of the registry has changed between determining the list of CDI devices (where GetDevice is called) and here where the actual injection takes place.

One question I did have was whether there would be a recommended way to propagate a reference to the registry from MakeContainer in create_container.go to the use here in generateSpec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container's image name and image ID are stored in its configuration at this point, but there's no guarantee those are actually populated, Podman supports creating images from user-specified root filesystems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "registry" in this case I mean a cache of the CDI devices defined by spec files available on the system (see the interface here. The idea is that spec files in, for example, /etc/cdi/*.json are loaded once and then these specifications are used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. Sorry, registry usually means a very different thing for containers.

I checked, and there's no location for the spec files currently listed that I can find. If one were to be added, containers.conf would be the most logical location - that's a centralized config file that we could put the path in, that all containers can access

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm still miscommunicating what I need to know.

When processing the original list of devices from the config in create_container.go, we instantiate a cache that stores the state of the CDI specs for the system (defining where these are read from is currently out of scope). The question is whether there is a way to pass this instance of the cache along so that this is also used in generateSpec in container_internal_linux.go? The motivation for this is that the files defining the CDI definition may have changed on disk so that the list of CDI devices being processed here does not have the desired effect -- hence the warning.

To be clear, I don't think this is a blocker for this particular PR, but it may be good to consider how to plumb this through in future if there is something that makes sense and does not add too much additional overhead. On the other hand, maybe @klihub has insights into the lifetime of the singleton instantiated in the cdi package when used at these two locations -- if we have some guarantees that these are in fact the same object then there would be no need to pass these through explicitly (except making the link clearer).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are temporary files directories that you could use, but they're all internal to Libpod, so accessing them from pkg/specgen could prove to be a problem. Is this a global directory, or a per-container one? We could definitely do a global CDI state directory inside Libpod's temporary files directory, and expose that to the world.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a global directory. For what it's worth, I have changed the implementation to only look at whether the device specified is a valid CDI device name. This means that we are no longer instantiating this cache twice. I think this is sufficient for the time being. If this becomes a requirement we can open a discussion as to how this should be done correctly.

@mheon
Copy link
Member

mheon commented Jan 12, 2022

/approve
One nit otherwise LGTM

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 12, 2022
@elezar elezar force-pushed the update-cdi branch 5 times, most recently from cc1e59f to ee090cc Compare January 12, 2022 15:32
@rhatdan
Copy link
Member

rhatdan commented Jan 12, 2022

@elezar is this still a draft?

@elezar
Copy link
Contributor Author

elezar commented Jan 12, 2022

@elezar is this still a draft?

@rhatdan I think I would like @klihub and @bart0sh as well to have a look at this PR as well since they worked on the new API and a similar implementation for containerd. Should be good to go otherwise.

@rhatdan
Copy link
Member

rhatdan commented Jan 12, 2022

Ok lets get their approval.

@klihub
Copy link

klihub commented Jan 13, 2022

@elezar LGTM, with a nitpick related to CDI device reference detection.

logrus.Debugf("CDI HasDevice Error: %v", err)
}
if err == nil && isCDIDevice {
if d := cdiDeviceDB.GetDevice(device.Path); d != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: you could also do syntactic filtering/check for CDI device references here (using IsQualifiedName).

One benefit of that would be that unresolvable CDI devices would go through the CDI resolution/injection attempt and come back as unresolved. This would potentially allow you to produce better/more contextually accurate error messages. AFAICT, with the current implementation such devices/errors are only caught later in the container creation pipeline, probably only in the runtime once it tries to inject the device into the container.

Copy link
Contributor Author

@elezar elezar Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid relying on IsQualifiedName here. One motivator is that I want to have some flexibility (in our library) for determining whether or not a device is a CDI device without requiring that clients such as podman be updated to support these changes. I understand that they would require an update, but this would be limited to updating a dependency.

Another motivation is that a fully-qualified device name does not mean that the device has an associated CDI spec.

example fully qualified has CDI spec action at filtering result at injection
1 nvidia.com/gpu=invalid yes no assumed to be standard device errors with Error: stat nvidia.com/gpu=invalid: no such file or directory
2 nvidia.com/gpu=gpu0 yes yes identified as CDI device resolves and injects modifications for nvidia.com/gpu=gpu0
3 gpu0 no yes assumed to be standard device (CDI currently requires a fully-qualified name) errors with Error: stat gpu0: no such file or directory
4 /dev/nvidia0 no no assumed to be standard device injects standard device node /dev/nvidia0

Using IsQualifiedName would change the filtering behaviour for 1 and would result in a warning and the container being launched without injecting the device. Is this better behaviour than exiting with the error shown? Is it more desirable to have the contains start with no modifications to the spec than to error out for an invalid device?

Thinking about this a bit more, I think raising an error for unresolved CDI devices instead of just issuing a warning would be more in keeping with the current behaviour.

Update: The injection will fail as an unresolved device is considered an error by InjectDevices used to make the spec modifications.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not how it would behave IMO. If device resolution/injection fails for any of the devices, inject will also return a non-nil error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Good point. I just confirmed that InjectDevices also returns an error if unresolved is non-empty meaning that injection will fail for any unresolved devices.

Copy link
Contributor Author

@elezar elezar Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klihub I have reworked the implementation to use IsQualifiedName as discussed. I abstracted this behind a function isCDIDevice which could later be moved into our API if this makes sense.

This also removes the requirement that the registry be instantiated here meaning that the error presented to the user should be clearer. When selecting a fully-qualified but unresolvable CDI device we now see:

sudo ./bin/podman run -ti --rm  --net=host --device=nvidia.com/gpu=gpu3 ubuntu:18.04 nvi
dia-smi -L
WARN[0000] Ignoring global metacopy option, not supported with booted kernel
Error: error setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=gpu3

UPDATE: (Note I have removed the warning that was printed as it repeats what is in the error message)

If there were errors creating the registry these are also shown as a warning available in the debug logs (in the event of resolution errors).

opts = ExtractCDIDevices(s)
registry := cdi.GetRegistry()
if errs := registry.GetErrors(); len(errs) > 0 {
logrus.Debugf("Error creating CDI Registry: %v", errs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fail container creation here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. These errors may not be related to the devices that we're interested in. For example, the spec from vendorA may be malformed, but we are using devices from vendorB.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes sense

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2022
@elezar elezar force-pushed the update-cdi branch 3 times, most recently from cb00a64 to 92b85db Compare January 13, 2022 20:33
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2022
@elezar elezar marked this pull request as ready for review January 13, 2022 20:37
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 13, 2022
@elezar
Copy link
Contributor Author

elezar commented Jan 13, 2022

@rhatdan after the discussions with @klihub and @bart0sh and taking their feedback into account, I think this is ready now. If there is anything from your side (or from @mheon) then let me know.

logrus.Warnf("The following errors were triggered when creating the CDI Registry: %v", errs)
}
}
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the err check should happen before the data check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return on err != nil we don't provide the warning / error output. With that said, I could include the output in the if err != nil block as there would (or at least should) not be a case where unresolved is non-empty and err == nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I have simplified things a bit. The error message will already show the unresolved devices, so there is no need to print them again. I am now also relying on the Debugf output to ensure that registry creation errors are available to the user instead of repeating them.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 14, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elezar, mheon, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhatdan
Copy link
Member

rhatdan commented Jan 14, 2022

@klihub @bart0sh @mheon PTAL

This change updates the CDI API to commit 46367ec063fda9da931d050b308ccd768e824364
which addresses some inconistencies in the previous implementation.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@bart0sh
Copy link

bart0sh commented Jan 14, 2022

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 14, 2022

@bart0sh: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mheon
Copy link
Member

mheon commented Jan 14, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 14, 2022
@openshift-merge-robot openshift-merge-robot merged commit 7ab9922 into containers:main Jan 14, 2022
@elezar elezar deleted the update-cdi branch January 17, 2022 12:35
@elezar elezar mentioned this pull request Jan 17, 2022
@vans163
Copy link

vans163 commented Feb 2, 2022

Catching up with this now, how does this address being able to dynamically change which GPUs are passed to which podman instances?
Say I have 4 GPUs and want to start 3 podman instance, first instance has access to gpu 0 and 1, second to gpu 2, third to gpu 0 1 2 3?

@elezar
Copy link
Contributor Author

elezar commented Feb 3, 2022

@vans163 your question seems out of scope of this PR which was "just" updating the CDI dependency.

Then, what do you mean by "instances"? If you mean containers you should be able to start three of them:

podman run --device nvidia.com/gpu0 --device nvidia.com/gpu2 ...
podman run --device nvidia.com/gpu1 ...
podman run --device nvidia.com/gpu0 --device nvidia.com/gpu1 --device nvidia.com/gpu2 --device nvidia.com/gpu3 ...

If this is not what you mean, please open an issue that describes your use case and how you would expect this to work.

@vans163
Copy link

vans163 commented Feb 3, 2022

@vans163 your question seems out of scope of this PR which was "just" updating the CDI dependency.

Then, what do you mean by "instances"? If you mean containers you should be able to start three of them:

podman run --device nvidia.com/gpu0 --device nvidia.com/gpu2 ...
podman run --device nvidia.com/gpu1 ...
podman run --device nvidia.com/gpu0 --device nvidia.com/gpu1 --device nvidia.com/gpu2 --device nvidia.com/gpu3 ...

If this is not what you mean, please open an issue that describes your use case and how you would expect this to work.

Yea this is exactly what I mean, so I would need to register devices 0-3 over here?

  "devices": [
    {
      "name": "gpu0",
      "containerEdits": {
        "env": [
          "NVIDIA_VISIBLE_DEVICES=0"
        ]
      },
      {"name": "gpu1", ..},
      {"name": "gpu2", ..},
      {"name": "gpu3", ..},
    }

Because then wont the envvar "NVIDIA_VISIBLE_DEVICES={gpuX}" conflict?

Also does this mean we can keep cgroups enabled in nvidia-container-runtime?

#set no-cgroups for nvidia-container-runtime
#TODO: remove this stage once cgroupsV2 support is stable (likely the next major release)
sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml

@TheFloatingBrain
Copy link

TheFloatingBrain commented Mar 5, 2022

Having a bit of difficulty with this, running:
sudo ./bin/podman run -ti --rm --net=host --device=nvidia.com/gpu=gpu0 ubuntu:18.04 nvidia-smi -L

I get:
Error: error setting up CDI devices: CDI: device "nvidia.com/gpu=gpu0" not found for spec "nvidia.com/gpu"

Though I can confirm I have the exact file in /etc/cdi/nvidia.json some of my config files are in locations I don't always see some of my files where others have them, for example /usr/share/containers/containers.conf

@klihub
Copy link

klihub commented Mar 6, 2022

Having a bit of difficulty with this, running: sudo ./bin/podman run -ti --rm --net=host --device=nvidia.com/gpu=gpu0 ubuntu:18.04 nvidia-smi -L

I get: Error: error setting up CDI devices: CDI: device "nvidia.com/gpu=gpu0" not found for spec "nvidia.com/gpu"

Though I can confirm I have the exact file in /etc/cdi/nvidia.json some of my config files are in locations I don't always see some of my files where others have them, for example /usr/share/containers/containers.conf

@TheFloatingBrain We have a CDI test utility in the repo that you could use for diagnosis. You can use it to get the list of vendors or devices your CDI Specs declare, check whether your Specs have any errors, test how an OCI Spec gets modified when you inject a particular CDI device into it, and for a few other things.

You can give it a try by cloning the CDI repo (git clone https://github.com/container-orchestrated-devices/container-device-interface) and building it (cd container-device-interface; make all). It should be available as ./bin/cdi from the root of your repo after compilation. Once you have that you can, for instance, list your devices with cdi devices, check for any errors with cdi validate and also test the result of injecting a given CDI device to an otherwise empty OCI Spec with something like echo "" | ./bin/cdi inject - vendor.com/device=name.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants