Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Feature request: Running nvidia-docker on a system which has Intel (Power Saving Mode) enabled #612

Closed
koenlek opened this issue Jan 19, 2018 · 11 comments

Comments

@koenlek
Copy link

koenlek commented Jan 19, 2018

1. Issue or feature description

I usually have my computer running in the "Intel (Power Saving Mode)" (from NVIDIA X Server Settings). This means that I use my integrated Intel GPU, rather than my Nvidia GPU by default. Switching requires logging out and in (or simply rebooting), which is annoying when you have a lot of work open.

I understand that dynamic switching is still not supported under Linux (any progress there would be very welcome!), but I figured that especially when using docker, it should be somewhat straightforward to fire up to GPU just for exposing it to your docker container. From a technical point of view, would this be possible (seems much less complicated than transferring the frame buffer which is needed when dynamically switching GPUs as in macOS and Windows)? For me and my colleagues, it would be very valuable to have the GPU accessible (for CUDA/CuDNN, TensorFlow, etc.) via nvidia-docker, without compromising on battery time all the time.

@3XX0
Copy link
Member

3XX0 commented Jan 19, 2018

The "Intel (Power Saving Mode)" you are referring to is not officially supported by NVIDIA and is something OS vendors add to their distribution. Having said that, it doesn't matter for CUDA and you can leverage your GPU inside your container while you're driving your display through the Intel iGPU.

See https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#do-you-support-optimus-ie-nvidia-dgpu--intel-igpu

@koenlek
Copy link
Author

koenlek commented Jan 20, 2018

Dear @3XX0, thanks for your reply. It convinced me that apparently I should be able to get it working... I tried all kinds of things, but I couldn't get it to work.

If I use Intel (via "NVIDIA X Server Settings") and run (after a fresh boot):

nvidia-smi # fails
export LD_LIBRARY_PATH=/usr/lib/nvidia-390:$LD_LIBRARY_PATH
nvidia-smi # works

It will give me a valid output:

Sat Jan 20 21:02:11 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.12                 Driver Version: 390.12                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K1100M       Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   43C    P0    N/A /  N/A |      0MiB /  2002MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But as soon as I run this same command in docker (from the same terminal with the "patched" LD_LIBRARY_PATH:

docker run --runtime=nvidia -it --rm nvidia/cuda nvidia-smi 

It fails with this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=11581 /var/lib/docker/overlay2/63b31ee4e7d26063c64097c1fc2596d4c511b0e75f9ad6fea334b85d6e6422d2/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libnvidia-fatbinaryloader.so.390.12: cannot open shared object file: no such file or directory\\\\n\\\"\"": unknown.

I'm using the latest version of docker (17.12.0-ce, build c97c6d6) and nvidia-docker2 (2.0.2+docker17.12.0-1)

I also tried forcing the GPU to enable:

sudo tee /proc/acpi/bbswitch <<< ON

But that didn't fix accessing my gpu in a container (i.e. docker run --runtime=nvidia -it --rm nvidia/cuda nvidia-smi still failed with the same error).

I also tried installing bumblebee, but via optirun I get the exact same error.

@3XX0
Copy link
Member

3XX0 commented Jan 20, 2018

Right, the problem is that the Intel setting messes with the ldcache and nvidia-docker won't be able to find the driver libraries anymore.

Try editing /etc/nvidia-container-runtime/config.toml and change environment to:
environment = ["LD_LIBRARY_PATH=/usr/lib/nvidia-390"] see if it changes anything.

@koenlek
Copy link
Author

koenlek commented Jan 20, 2018

That seems to have gotten me closer. If I now run docker run --runtime=nvidia -it --rm nvidia/cuda nvidia-smi it get a different error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

@3XX0
Copy link
Member

3XX0 commented Jan 21, 2018

Can you enable the debug mode in the configuration file as well and paste the logs here.
I'm pretty sure we need to add a new option in nvidia-container-runtime but I need confirmation

@koenlek
Copy link
Author

koenlek commented Jan 21, 2018

I uncommented #debug = "/var/log/nvidia-container-runtime-hook.log" in /etc/nvidia-container-runtime/config.toml and reran docker run --runtime=nvidia -it --rm nvidia/cuda nvidia-smi. The contents of /var/log/nvidia-container-runtime-hook.log are now:


-- WARNING, the following logs are for debugging purposes only --

I0121 08:12:15.549369 10797 nvc.c:250] initializing library context (version=1.0.0, build=4a618459e8ba522d834bb2b4c665847fae8ce0ad)
I0121 08:12:15.550156 10797 nvc.c:170] loading kernel module nvidia
I0121 08:12:15.550225 10797 nvc.c:182] loading kernel module nvidia_uvm
I0121 08:12:15.550279 10797 nvc.c:190] loading kernel module nvidia_modeset
I0121 08:12:15.550315 10797 nvc.c:225] using ldcache /etc/ld.so.cache
I0121 08:12:15.550321 10797 nvc.c:226] using unprivileged user 65534:65534
I0121 08:12:15.551283 10803 driver.c:134] starting driver service
I0121 08:12:15.760044 10797 nvc_container.c:299] configuring container with 'compute utility supervised'
I0121 08:12:15.760240 10797 nvc_container.c:315] setting pid to 10761
I0121 08:12:15.760249 10797 nvc_container.c:316] setting rootfs to /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged
I0121 08:12:15.760254 10797 nvc_container.c:317] setting owner to 0:0
I0121 08:12:15.760259 10797 nvc_container.c:318] setting bins directory to /usr/bin
I0121 08:12:15.760273 10797 nvc_container.c:319] setting libs directory to /usr/lib/x86_64-linux-gnu
I0121 08:12:15.760277 10797 nvc_container.c:320] setting libs32 directory to /usr/lib/i386-linux-gnu
I0121 08:12:15.760282 10797 nvc_container.c:321] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0121 08:12:15.760287 10797 nvc_container.c:322] setting mount namespace to /proc/10761/ns/mnt
I0121 08:12:15.760291 10797 nvc_container.c:324] setting devices cgroup to /sys/fs/cgroup/devices/docker/229f78ebec8274234e9c9e31ce980325f53de2d83ad18a7dd4d9390e973b2605
I0121 08:12:15.760298 10797 nvc_info.c:409] requesting driver information with ''
I0121 08:12:15.760885 10797 nvc_info.c:142] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.12
I0121 08:12:15.761102 10797 nvc_info.c:142] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.390.12
I0121 08:12:15.761288 10797 nvc_info.c:142] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.390.12
I0121 08:12:15.761337 10797 nvc_info.c:142] selecting /usr/lib/i386-linux-gnu/libcuda.so.390.12
W0121 08:12:15.761372 10797 nvc_info.c:282] missing library libnvidia-ml.so
W0121 08:12:15.761378 10797 nvc_info.c:282] missing library libnvidia-cfg.so
W0121 08:12:15.761382 10797 nvc_info.c:282] missing library libnvidia-ptxjitcompiler.so
W0121 08:12:15.761387 10797 nvc_info.c:282] missing library libnvidia-fatbinaryloader.so
W0121 08:12:15.761391 10797 nvc_info.c:282] missing library libnvidia-compiler.so
W0121 08:12:15.761396 10797 nvc_info.c:282] missing library libvdpau_nvidia.so
W0121 08:12:15.761400 10797 nvc_info.c:282] missing library libnvidia-encode.so
W0121 08:12:15.761405 10797 nvc_info.c:282] missing library libnvcuvid.so
W0121 08:12:15.761410 10797 nvc_info.c:282] missing library libnvidia-eglcore.so
W0121 08:12:15.761414 10797 nvc_info.c:282] missing library libnvidia-glcore.so
W0121 08:12:15.761419 10797 nvc_info.c:282] missing library libnvidia-tls.so
W0121 08:12:15.761423 10797 nvc_info.c:282] missing library libnvidia-glsi.so
W0121 08:12:15.761428 10797 nvc_info.c:282] missing library libnvidia-fbc.so
W0121 08:12:15.761432 10797 nvc_info.c:282] missing library libnvidia-ifr.so
W0121 08:12:15.761437 10797 nvc_info.c:282] missing library libGLX_nvidia.so
W0121 08:12:15.761441 10797 nvc_info.c:282] missing library libEGL_nvidia.so
W0121 08:12:15.761446 10797 nvc_info.c:282] missing library libGLESv2_nvidia.so
W0121 08:12:15.761450 10797 nvc_info.c:282] missing library libGLESv1_CM_nvidia.so
W0121 08:12:15.761455 10797 nvc_info.c:286] missing compat32 library libnvidia-ml.so
W0121 08:12:15.761459 10797 nvc_info.c:286] missing compat32 library libnvidia-cfg.so
W0121 08:12:15.761464 10797 nvc_info.c:286] missing compat32 library libnvidia-ptxjitcompiler.so
W0121 08:12:15.761469 10797 nvc_info.c:286] missing compat32 library libnvidia-fatbinaryloader.so
W0121 08:12:15.761473 10797 nvc_info.c:286] missing compat32 library libnvidia-compiler.so
W0121 08:12:15.761478 10797 nvc_info.c:286] missing compat32 library libvdpau_nvidia.so
W0121 08:12:15.761492 10797 nvc_info.c:286] missing compat32 library libnvidia-encode.so
W0121 08:12:15.761497 10797 nvc_info.c:286] missing compat32 library libnvcuvid.so
W0121 08:12:15.761501 10797 nvc_info.c:286] missing compat32 library libnvidia-eglcore.so
W0121 08:12:15.761506 10797 nvc_info.c:286] missing compat32 library libnvidia-glcore.so
W0121 08:12:15.761510 10797 nvc_info.c:286] missing compat32 library libnvidia-tls.so
W0121 08:12:15.761515 10797 nvc_info.c:286] missing compat32 library libnvidia-glsi.so
W0121 08:12:15.761519 10797 nvc_info.c:286] missing compat32 library libnvidia-fbc.so
W0121 08:12:15.761524 10797 nvc_info.c:286] missing compat32 library libnvidia-ifr.so
W0121 08:12:15.761528 10797 nvc_info.c:286] missing compat32 library libGLX_nvidia.so
W0121 08:12:15.761533 10797 nvc_info.c:286] missing compat32 library libEGL_nvidia.so
W0121 08:12:15.761537 10797 nvc_info.c:286] missing compat32 library libGLESv2_nvidia.so
W0121 08:12:15.761542 10797 nvc_info.c:286] missing compat32 library libGLESv1_CM_nvidia.so
I0121 08:12:15.761583 10797 nvc_info.c:217] selecting /usr/bin/nvidia-smi
I0121 08:12:15.761600 10797 nvc_info.c:217] selecting /usr/bin/nvidia-debugdump
W0121 08:12:15.761617 10797 nvc_info.c:308] missing binary nvidia-persistenced
W0121 08:12:15.761621 10797 nvc_info.c:308] missing binary nvidia-cuda-mps-control
W0121 08:12:15.761626 10797 nvc_info.c:308] missing binary nvidia-cuda-mps-server
I0121 08:12:15.761633 10797 nvc_info.c:341] listing device /dev/nvidiactl
I0121 08:12:15.761638 10797 nvc_info.c:341] listing device /dev/nvidia-uvm
I0121 08:12:15.761642 10797 nvc_info.c:341] listing device /dev/nvidia-uvm-tools
W0121 08:12:15.761650 10797 nvc_info.c:253] missing ipc /var/run/nvidia-persistenced/socket
W0121 08:12:15.761656 10797 nvc_info.c:253] missing ipc /tmp/nvidia-mps
I0121 08:12:15.761661 10797 nvc_info.c:463] requesting device information with ''
I0121 08:12:15.768223 10797 nvc_info.c:491] listing device /dev/nvidia0 (GPU-5058c0ab-360a-f310-cadf-dfd3e2b29ca0 at 00000000:02:00.0)
I0121 08:12:15.768323 10797 nvc_mount.c:225] mounting tmpfs at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/proc/driver/nvidia
I0121 08:12:15.768689 10797 nvc_mount.c:66] mounting /usr/lib/nvidia-390/bin/nvidia-smi at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/usr/bin/nvidia-smi
I0121 08:12:15.768742 10797 nvc_mount.c:66] mounting /usr/lib/nvidia-390/bin/nvidia-debugdump at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/usr/bin/nvidia-debugdump
I0121 08:12:15.768870 10797 nvc_mount.c:66] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.390.12 at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/usr/lib/x86_64-linux-gnu/libcuda.so.390.12
I0121 08:12:15.768917 10797 nvc_mount.c:66] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.12 at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.12
I0121 08:12:15.768934 10797 nvc_mount.c:343] creating symlink /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.390.12
I0121 08:12:15.768992 10797 nvc_mount.c:98] mounting /dev/nvidiactl at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/dev/nvidiactl
I0121 08:12:15.769021 10797 nvc_mount.c:318] whitelisting device node 195:255
I0121 08:12:15.769063 10797 nvc_mount.c:98] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/dev/nvidia-uvm
I0121 08:12:15.769083 10797 nvc_mount.c:318] whitelisting device node 242:0
I0121 08:12:15.769117 10797 nvc_mount.c:98] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/dev/nvidia-uvm-tools
I0121 08:12:15.769141 10797 nvc_mount.c:318] whitelisting device node 242:1
I0121 08:12:15.769196 10797 nvc_mount.c:98] mounting /dev/nvidia0 at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/dev/nvidia0
I0121 08:12:15.769244 10797 nvc_mount.c:282] mounting /proc/driver/nvidia/gpus/0000:02:00.0 at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged/proc/driver/nvidia/gpus/0000:02:00.0
I0121 08:12:15.769264 10797 nvc_mount.c:318] whitelisting device node 195:0
I0121 08:12:15.769280 10797 nvc_ldcache.c:325] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/69fef432f8d880cdd01149ca781830bda715689b44780a5c1da204beadd47e8c/merged
I0121 08:12:15.851124 10797 nvc.c:286] shutting down library context
I0121 08:12:15.851701 10803 driver.c:169] terminating driver service
I0121 08:12:15.892397 10797 driver.c:208] driver service terminated successfully

@3XX0
Copy link
Member

3XX0 commented Jan 23, 2018

I implemented what you need in libnvidia-container. Once nvidia-container-runtime adds support for it, you should be able to leverage it from nvidia-docker.

@koenlek
Copy link
Author

koenlek commented Jan 26, 2018

@3XX0 Thanks a lot for your efforst! Is there an issue opened at https://github.com/NVIDIA/nvidia-container-runtime already to point out that adding such support is desired? Or should I do that?

@3XX0
Copy link
Member

3XX0 commented Mar 6, 2018

I believe this is fixed in the latest release. You will need to backup your /etc/ld.so.cache before switching to Power Saving Mode and use the backup copy in /etc/nvidia-container-runtime/config.toml

@Luke035
Copy link

Luke035 commented Mar 8, 2018

+1

@3XX0 3XX0 closed this as completed Mar 8, 2018
@lazyuser
Copy link

lazyuser commented Apr 2, 2018

To expand on what @3XX0 wrote in a comment above, I have fixed this issue by running

ldconfig -C /etc/ld.so.cache.for-nvidia-docker /usr/lib/nvidia-384 /usr/lib32/nvidia-384

and then making sure my /etc/nvidia-container-runtime/config.toml contains

environment = ["LD_LIBRARY_PATH=/usr/lib/nvidia-384"]
ldcache = "/etc/ld.so.cache.for-nvidia-docker"

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants