Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add a custom docker image with GPU support #290

Closed
AuthorShin opened this issue Aug 19, 2022 · 17 comments
Closed

[Feature] Add a custom docker image with GPU support #290

AuthorShin opened this issue Aug 19, 2022 · 17 comments

Comments

@AuthorShin
Copy link

Description of the feature

Hello
Is there any chance that you could add the GPU widget to the docker image on hub.docker.com with custom tag?
Since you said : "Docker images do not include the necessary tools (mainly, because I don't want to bloat the image for everyone)."
You can add it with a custom tag to the hub.docker.com so anyone looking to have the GPU widget with docker image they'll just pull the image with GPU tag (for example) and others who doesn't need it just pull the image with the latest tag.
This way everyone will be happy :)

Additional context

No response

@MauriceNino
Copy link
Owner

Hey there, thanks for creating the issue. Unfortunately, there are a few other problems I have with the GPU support:

  1. Different GPUs need different tools and I don't think I can install all in one container
  2. You would have to do a lot of manual work anyway, since you need to install tools on the host as well, to pass the GPU into a container
  3. It would make the build (CI) a lot more complicated and probably make the already long times per run even longer
  4. I created Dashdot mainly for my own purposes and I have no need for the GPU module, so I have no real interest in implementing it - and I suspect that it would be a lot of work

If you want to use the GPU widget, I suggest running from source. If that does not work, and you want to get it running in Docker, it would be really cool if you could report back all the steps you took, so we can put it in the docs :)

I will keep this open as a feature request. In case anyone wants to spend some time working on it, feel free to PM me on Discord!

@AuthorShin
Copy link
Author

I thought it's something ready and already made that you have in your drawer :D
Sure I will try to tinker with it on some weekend and see what we see.

@MauriceNino MauriceNino changed the title [Feature] Adding the GPU widget to the docker image on hub.docker.com with custom tag [Feature] Add a custom docker image with GPU support Aug 19, 2022
@simonl169
Copy link

I'd also like to use it for a nvidia gpu.
Already tried some stuff, but it will always break the container, when I start it...

@lukasmrtvy
Copy link

lukasmrtvy commented Mar 29, 2023

@MauriceNino Somehow this is working for me, but got an empty graph https://i.ibb.co/zNqXYYG/dash2.png

FROM nvidia/cuda:12.1.0-base-ubuntu22.04

RUN apt update && apt install -y git curl pciutils dmidecode && \
    curl -fsSL https://deb.nodesource.com/setup_19.x |  bash - && apt-get install -y nodejs && \
    npm install --global yarn && \
    git clone https://github.com/MauriceNino/dashdot && \
    cd dashdot && \
    yarn && \
    yarn build:prod && \
    rm -rf /var/lib/apt/lists/*

WORKDIR dashdot/

CMD ["yarn", "start"]
docker run .. --runtime=nvidia .. dash

systeminformation output:

cat << EOF > main.js
const si = require('systeminformation');

si.graphics()
  .then(data => console.log(data))
  .catch(error => console.error(error));
EOF


cat << EOF > Dockerfile
FROM nvidia/cuda:12.1.0-base-ubuntu22.04

COPY main.js /app/main.js

WORKDIR /app

RUN apt update && apt install -y curl pciutils dmidecode && \
    curl -fsSL https://deb.nodesource.com/setup_19.x |  bash - && apt-get install -y nodejs && \
    npm install systeminformation 

CMD node main.js 
EOF

docker build -t graph-print . && docker run --rm -it --runtime=nvidia graph-print
{
  controllers: [
    {
      vendor: '',
      model: '',
      bus: '',
      busAddress: '00:01.0',
      vram: 4,
      vramDynamic: false,
      pciID: ''
    },
    {
      vendor: 'NVIDIA Corporation',
      model: 'GP106GL [Quadro P2200] ',
      bus: 'Onboard',
      busAddress: '01:00.0',
      vram: 5120,
      vramDynamic: false,
      driverVersion: '515.86.01',
      subDeviceId: '0x131B10DE',
      name: 'Quadro P2200',
      pciBus: '00000000:01:00.0',
      fanSpeed: 69,
      memoryTotal: 5120,
      memoryFree: 5059,
      temperatureGpu: 37,
      powerDraw: 21.96,
      powerLimit: 75,
      clockCore: 1012,
      clockMemory: 5005
    }
  ],
  displays: []
}
  • 4MB device is probably BMC VGA
  • Some ENV variable to hide this kind of device would be useful ( BUS_ADDRESS_FILTER ? )
  • producing dashdot docker image with nvidia name suffix would be useful too, its worth a shot ( of course it can slow down the CI, but this might be a parallel step and no one would be mad if this would be "slower" than building main image, btw base image is ~30Mb, its not such a big deal I would say )

@MauriceNino
Copy link
Owner

Hi @lukasmrtvy, thanks for trying it out and reporting back.

Why are you using --runtime=nvidia? The guides I checked, used --gpus all to pass the Nvidia GPU to docker.

Also, the problem with your setup is, that the GPU entry in your SI test would need the properties utilizationGpu and utilizationMemory to report back usage (both are missing for you).

I have tried your test container, by installing the GPU support according to [this guide]https://www.howtogeek.com/devops/how-to-use-an-nvidia-gpu-with-docker-containers/) and got the following output:

[
  {
    driverVersion: '526.47',
    subDeviceId: '0x87C11043',
    name: 'NVIDIA GeForce RTX 3070',
    pciBus: '00000000:01:00.0',
    fanSpeed: 0,
    memoryTotal: 8192,
    memoryUsed: 1026,
    memoryFree: 7005,
    utilizationGpu: 7,
    utilizationMemory: 13,
    temperatureGpu: 48,
    temperatureMemory: undefined,
    powerDraw: 23.31,
    powerLimit: 240,
    clockCore: 210,
    clockMemory: 405
  }
]

As you can see, the needed props exist there.
Unfortunately, I can't get it running due to different problems (running Linux in WSL2), so I can't really work on the docker integration, as I have no test bench.

As to your questions:

  1. A device filter is a good idea, I will see that I implement that in the next few days (if you could open an issue for that, so I can't forget it, that would be great)
  2. Creating a docker image is a problem, because CI times out at 1hr (and honestly, runs that long should be forbidden) and the ARM builds take forever and would then take even longer. If I could run the build in GitHub CI, I would not care so much, but my tests with it showed runs longer than 1hr, making it unable to complete in time. So I am currently running the builds on my private hardware in Drone CI. This seems to be faster (~20m per run currently), but ARM builds are still notably slower.

@caesay
Copy link
Contributor

caesay commented Jul 22, 2023

Here's what I've tried so far to get this working on Unraid (unsuccessfully).

On Unraid, the general procedure is to install the Nvidia-Driver plugin, which in turn installs necessary drivers and the Nvidia Container Toolkit

Once the container toolkit is installed, you can make your Nvidia GPU available to docker containers by adding the --runtime nvidia to your docker run command (seen here in a previous comment).

Additionally, you need to add the container toolkit environment variables NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES

Using this setup, most docker containers can access the nvidia-smi utility out of the box and access installed cards.

For example, running nvidia-smi in a completely empty docker container based on the ubuntu:latest image as seen below works just fine.

docker run
  ....
  -e 'NVIDIA_VISIBLE_DEVICES'='GPU-{mygpuid}'
  -e 'NVIDIA_DRIVER_CAPABILITIES'='all'
  --runtime=nvidia
  nvidia-smi

However when using the same docker run arguments (runtime and variables) with the mauricenino/dashdot image, the nvidia-smi command does not exist within the docker container. If trying to run that command with your image I get:
sh: nvidia-smi: not found

I don't really know enough about Docker to know what to troubleshoot next. I have been wondering if there is something fundamentally different about the base image you're using which prevents nvidia runtime from working?

@caesay
Copy link
Contributor

caesay commented Jul 22, 2023

I was able to get it working by creating my own image using the dockerfile shared by @lukasmrtvy. I do not have the same issue with missing statistics, it appears to be working as expected.

@MauriceNino
Copy link
Owner

@HoreaM managed to get it working too with the following config.

docker-compose:

  dashdot-gpu:
    image: dashdot-gpu:latest
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
              - gpu
    privileged: true
    ports:
      - 7678:3001
    environment:
      DASHDOT_ENABLE_CPU_TEMPS: 'true'
      DASHDOT_ALWAYS_SHOW_PERCENTAGES: 'true'
      DASHDOT_SPEED_TEST_INTERVAL: '1440'
      DASHDOT_ENABLE_STORAGE_SPLIT_VIEW: 'true'
      DASHDOT_WIDGET_LIST: 'os,storage,network,cpu,ram,gpu'
      DASHDOT_FS_DEVICE_FILTER: 'sdd'
      DASHDOT_OVERRIDE_NETWORK_SPEED_UP: '500000000'
      DASHDOT_OVERRIDE_NETWORK_SPEED_DOWN: '500000000'
    volumes:
      - /:/mnt/host:ro

Dockerfile (its a modified version of the official one, just running cuda/ubuntu as a base instead of alpine):

# BASE #
FROM nvidia/cuda:12.2.0-base-ubuntu20.04 AS base
 
WORKDIR /app
ARG TARGETPLATFORM
ENV DASHDOT_RUNNING_IN_DOCKER=true
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES="compute,video,utility"
ENV TZ=Europe/Bucharest
 
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
 
RUN \
  /bin/echo ">> installing dependencies" &&\
  apt-get update &&\
  apt-get install -y \
    wget \
    mdadm \
    dmidecode \
    util-linux \
    pciutils \
    curl \
    lm-sensors \
   speedtest-cli &&\
  if [ "$TARGETPLATFORM" = "linux/amd64" ] || [ "$(uname -m)" = "x86_64" ]; \
    then \
      /bin/echo ">> installing dependencies (amd64)" &&\
      wget -qO- https://install.speedtest.net/app/cli/ookla-speedtest-1.1.1-linux-x86_64.tgz \
        | tar xmoz -C /usr/bin speedtest; \
  elif [ "$TARGETPLATFORM" = "linux/arm64" ] || [ "$(uname -m)" = "aarch64" ]; \
    then \
      /bin/echo ">> installing dependencies (arm64)" &&\
      wget -qO- https://install.speedtest.net/app/cli/ookla-speedtest-1.1.1-linux-aarch64.tgz \
        | tar xmoz -C /usr/bin speedtest &&\
      apk --no-cache add raspberrypi; \
  elif [ "$TARGETPLATFORM" = "linux/arm/v7" ]; \
    then \
      /bin/echo ">> installing dependencies (arm/v7)" &&\
      wget -qO- https://install.speedtest.net/app/cli/ookla-speedtest-1.1.1-linux-armhf.tgz \
        | tar xmoz -C /usr/bin speedtest &&\
      apk --no-cache add raspberrypi; \
  else /bin/echo "Unsupported platform"; exit 1; \
  fi
 
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add -
RUN echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list
RUN curl -sL https://deb.nodesource.com/setup_19.x | bash -
 
RUN \
  /bin/echo ">>installing yarn" &&\
  apt-get update &&\
  apt-get install -y \
  yarn
 
# DEV #
FROM base AS dev
 
EXPOSE 3001
EXPOSE 3000
 
RUN \
  /bin/echo -e ">> installing dependencies (dev)" &&\
  apt-get install -y \
    git &&\
  git config --global --add safe.directory /app
 
# BUILD #
FROM base as build
 
ARG BUILDHASH
ARG VERSION
 
RUN \
  /bin/echo -e ">> installing dependencies (build)" &&\
  apt-get install -y \
    git \
    make \
    clang \
    build-essential &&\
  git config --global --add safe.directory /app &&\
  /bin/echo -e "{\"version\":\"$VERSION\",\"buildhash\":\"$BUILDHASH\"}" > /app/version.json
 
RUN \
  /bin/echo -e ">> clean-up" &&\
  apt-get clean && \
  rm -rf \
    /tmp/* \
	/var/tmp/*
 
COPY . ./
 
RUN \
  yarn --immutable --immutable-cache &&\
  yarn build:prod
 
# PROD #
FROM base as prod
 
EXPOSE 3001
 
COPY --from=build /app/package.json .
COPY --from=build /app/version.json .
COPY --from=build /app/.yarn/releases/ .yarn/releases/
COPY --from=build /app/dist/apps/api dist/apps/api
COPY --from=build /app/dist/apps/cli dist/apps/cli
COPY --from=build /app/dist/apps/view dist/apps/view
 
CMD ["yarn", "start"]

This requires building your own image right now, but I would like to add it to the main repo down the line. Just didn't get time to implement it yet.

Can you confirm that one working for you as well @caesay?

@caesay
Copy link
Contributor

caesay commented Jul 22, 2023

I dropped your Dockerfile into Portainer on my server to build an image and got the following build failure:

...
...
Step 22/33 : RUN   /bin/echo -e ">> clean-up" &&  apt-get clean &&   rm -rf     /tmp/* 	/var/tmp/*
 ---> Running in ae7986553966
>> clean-up
Removing intermediate container ae7986553966
 ---> c68bc9e2835f
Step 23/33 : COPY . ./
 ---> ca116c40c3d2
Step 24/33 : RUN   yarn --immutable --immutable-cache &&  yarn build:prod
 ---> Running in aa462cd0b41c
yarn install v1.22.19
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
[4/4] Building fresh packages...
success Saved lockfile.
Done in 0.03s.
yarn run v1.22.19
error Couldn't find a package.json file in "/app"
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
The command '/bin/sh -c yarn --immutable --immutable-cache &&  yarn build:prod' returned a non-zero code: 1

@HoreaM
Copy link
Contributor

HoreaM commented Jul 23, 2023

I dropped your Dockerfile into Portainer on my server to build an image and got the following build failure:

...
...
Step 22/33 : RUN   /bin/echo -e ">> clean-up" &&  apt-get clean &&   rm -rf     /tmp/* 	/var/tmp/*
 ---> Running in ae7986553966
>> clean-up
Removing intermediate container ae7986553966
 ---> c68bc9e2835f
Step 23/33 : COPY . ./
 ---> ca116c40c3d2
Step 24/33 : RUN   yarn --immutable --immutable-cache &&  yarn build:prod
 ---> Running in aa462cd0b41c
yarn install v1.22.19
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
[4/4] Building fresh packages...
success Saved lockfile.
Done in 0.03s.
yarn run v1.22.19
error Couldn't find a package.json file in "/app"
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
The command '/bin/sh -c yarn --immutable --immutable-cache &&  yarn build:prod' returned a non-zero code: 1

Sorry, I'm not really an expert in this, but the only thing that comes to mind is you probably forgot to do a git clone of this project before. The way I did it was: I first did a git clone https://github.com/MauriceNino/dashdot.git and then replaced the Dockerfile in the project with the one above.

@svarog-0
Copy link

svarog-0 commented Oct 14, 2023

Hi, any ides on how to do this for intel igpu with i915 driver? Usually for plex it's enough to do

devices:
      - /dev/dri:/dev/dri

@voc0der
Copy link

voc0der commented Oct 24, 2023

Subscribing to this... for whenever this becomes mainstream, maybe a docker image tag? like dashdot:nvidia-gpu?

Not sure I wanna build this manually all the time since I use watchtower, so just crossing my fingers here.

@mattiasghodsian
Copy link

Subscribing to this... for whenever this becomes mainstream, maybe a docker image tag? like dashdot:nvidia-gpu?

Not sure I wanna build this manually all the time since I use watchtower, so just crossing my fingers here.

+1

github-actions bot added a commit that referenced this issue Jan 21, 2024
# [5.8.0](v5.7.0...v5.8.0) (2024-01-21)

### Features

* add separate image for nvidia gpu support ([#1010](#1010)) ([319120d](319120d)), closes [#290](#290)
Copy link
Contributor

🎉 This issue has been resolved in version 5.8.0

Please check the changelog for more details.

@PilaScat
Copy link

PilaScat commented Mar 1, 2024

So nothing for Intel GPUs? Or it will work with nvidia image?

@MauriceNino
Copy link
Owner

@PilaScat No nothing for Intel GPUs right now. If you know how to make it work, please create a PR for it. I have no machine with an Intel GPU for testing, so I can't implement it, unfortunately. Same goes for AMD.

@foresterfx
Copy link

I ran some builds with docker locally.
Docker args: --device=/dev/dri:/dev/dri
I also added these to the alpine docker file's RUN apk update:

    linux-firmware-i915 \
    mesa-dri-gallium

I don't get memory or load data populated, though.
image

I did a bit of research, and NVTOP (non-alpine multi-brand gpu monitoring software) mentioned that Intel is working on exposing more hardware information through HWMON, but that was about two years ago that their readme updated to say that.
NVTOP

Intel patchwork for what it's worth

I ran a bunch of different commands on the system files visible from alpine. Not really much of value that I could see for gpu stats. I tried stressing the iGPU while repeatedly checking those files to see if the current frequency values would change, but they wouldn't.

Here's the output for what that's worth:

/ # lspci -v
00:02.0 VGA compatible controller: Intel Corporation AlderLake-S GT1 (rev 0c) (prog-if 00 [VGA controller])
        DeviceName: Onboard - Video
        Subsystem: Gigabyte Technology Co., Ltd Device d000
        Flags: bus master, fast devsel, latency 0, IRQ 149, IOMMU group 0
        Memory at 41000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 50000000 (64-bit, prefetchable) [size=256M]
        I/O ports at 3000 [size=64]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
        Capabilities: [d0] Power Management version 2
        Capabilities: [100] Process Address Space ID (PASID)
        Capabilities: [200] Address Translation Service (ATS)
        Capabilities: [300] Page Request Interface (PRI)
        Capabilities: [320] Single Root I/O Virtualization (SR-IOV)
        Kernel driver in use: i915
/ # lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation AlderLake-S GT1 (rev 0c)
/ # cat /sys/class/drm/card0/device/uevent
DRIVER=i915
PCI_CLASS=30000
PCI_ID=8086:4680
PCI_SUBSYS_ID=1458:D000
PCI_SLOT_NAME=0000:00:02.0
MODALIAS=pci:v00008086d00004680sv00001458sd0000D000bc03sc00i00
/ # ls /sys/class/drm/card0/
card0-DP-1         dev                error              gt_RP1_freq_mhz    gt_boost_freq_mhz  gt_min_freq_mhz    subsystem
card0-HDMI-A-1     device             gt                 gt_RPn_freq_mhz    gt_cur_freq_mhz    metrics            uevent
card0-HDMI-A-2     engine             gt_RP0_freq_mhz    gt_act_freq_mhz    gt_max_freq_mhz    power
/ # ls /sys/class/drm/card0/metrics
/ # cat /sys/class/drm/card0/gt_cur_freq_mhz 
1400
/ # cat /sys/class/drm/card0/gt_max_freq_mhz 
1450
/ # ls /sys/class/drm/
card0           card0-DP-1      card0-HDMI-A-1  card0-HDMI-A-2  renderD128      version
/ # cat /sys/class/drm/card0/gt_boost_freq_mhz 
1450
/ # cat /sys/class/drm/card0/gt_min_freq_mhz 
300
/ # cat /sys/class/drm/card0/gt_RP0_freq_mhz 
1450
/ # cat /sys/class/drm/card0/gt_RP1_freq_mhz 
700
/ # cat /sys/class/drm/card0/gt_RP1_freq_mhz 
700
/ # ls /sys/class/drm/renderD128/
dev        device     power      subsystem  uevent
/ # cat /sys/class/drm/renderD128/uevent
MAJOR=226
MINOR=128
DEVNAME=dri/renderD128
DEVTYPE=drm_minor
/ # cat /sys/class/drm/renderD128/device/uevent
DRIVER=i915
PCI_CLASS=30000
PCI_ID=8086:4680
PCI_SUBSYS_ID=1458:D000
PCI_SLOT_NAME=0000:00:02.0
MODALIAS=pci:v00008086d00004680sv00001458sd0000D000bc03sc00i00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests