Could not run RunnerDeployment with organization setting #60

rezmuh · 2020-06-21T03:45:00Z

Hi, I just installed this for my organization and I already saw that the runners are running:

› kubectl get runner                                                                                                                                                 
NAME                        ORGANIZATION   REPOSITORY   LABELS   STATUS
github-runner-dn442-25m4h   lifepal                              Running
github-runner-dn442-btkzd   lifepal                              Running
github-runner-dn442-pxgh6   lifepal                              Running
github-runner-dn442-xzd2k   lifepal                              Running

And the runners are registered within the organization as well as shown below.

And this is how I installed the runners:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: github-runner
spec:
  replicas: 4
  template:
    spec:
      organization: <my org>

However, the actions could not be run. This was the error I got from the action log

It looks like I must have missed a setting somewhere?

The text was updated successfully, but these errors were encountered:

mumoshu · 2020-06-21T04:26:01Z

@rezmuh Hey! Maybe there's any issue in your workflow definition? Would u mind sharing it?

To me, a docker run like this one, in a job on a self-hosted runner, just works: https://github.com/mumoshu/runnertest/blob/master/.github/workflows/test.yaml#L18

rezmuh · 2020-06-21T13:46:32Z

Hi, this is the workflow definition I tried to run.

name: pull-request

on:
  pull_request

env:
  CARGO_TERM_COLOR: always

jobs:
  build:
    runs-on: self-hosted
    container: rust:1.43.0

    services:
      db:
        image: postgres:10.1-alpine
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: postgres
        ports:
        - 5432:5432
        # Add a health check
        options: --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5

    steps:
    - uses: actions/checkout@v2
      with:
        ssh-key: ${{ secrets.SSH_KEY }}
        submodules: 'recursive'
    - uses: actions/cache@v2
      with:
        path: |
          ~/.cargo/registry
          ~/.cargo/git
          target
        key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
    - name: Install rustfmt
      run: rustup component add rustfmt --toolchain 1.43.0-x86_64-unknown-linux-gnu
    - name: Build binary
      run: cargo build
    - name: Run tests
      run: cargo test

rezmuh · 2020-06-21T13:48:16Z

Here's a longer version of the log. Looks like the issue is on actions/cache ?

mumoshu · 2020-06-21T23:39:18Z

@rezmuh Thanks! Perhaps it has something to do with GH Actions' services support?

Would you mind running:

ls -l /var/run/docker.sock before any other job in your workflow
chmod 666 on /var/run/docker.sock before any other job in your workflow

rezmuh · 2020-06-22T03:40:12Z

still the same. I even removed the services just to see if it would build without it. But it still stopped at the same step. However, I found another error from the annotations:

Perhaps it has something to do with how my Kubernetes is setup? Does it require certain type of network that my kubernetes doesn't support? If it any helps, my Kubernetes cluster is on EKS (version v1.13) and was installed through eksctl.

mumoshu · 2020-06-22T05:56:16Z

@rezmuh Would you mind confirming that you do know that any changes to workflow yaml must be in the master branch to work? (In other words, I was wondering if you tried to test it with a workflow run triggered via a pull request that changes your workflow definitino

rezmuh · 2020-06-22T06:01:13Z

@mumoshu no, i wasn't aware that changes to workflow yaml must be in master branch because i tried creating a pull-request workflow that is only triggered during pull-request and it worked. even though master branch doesn't have that workflow file yet.

rezmuh · 2020-06-22T10:39:05Z

btw, i just tested to create a self-hosted runner on a separate EC2 instance and I got the same error. I fixed it by adding a Docker daemon and add the user group into a docker group to be able to run the workflow.

So I presume this runner has Docker in Docker inside Kubernetes? Then I guess the issue is that the pod / runner could not connect to Docker in Docker?

mumoshu · 2020-06-22T11:50:47Z

Thanks. I think we're close.

Yes, it's a dind inside a K8s pod. Each one-shot runner pod has a sidecar container that's named docker which is running a docker daemon.

Even though I'm still unable to reproduce your exact issue, the user/group settings seem not "correct" between the runner and the dockerd containers.

In docker container I see:

/ # id $(whoami)
uid=0(root) gid=0(root) groups=0(root),0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
/ # ls -lah /var/run/docker.sock
srw-rw----    1 root     root           0 Jun 22 11:39 /var/run/docker.sock

Where in runner container I can see:

$ id $(whoami)
uid=1000(runner) gid=1000(runner) groups=1000(runner),27(sudo)
$ ls -lah /var/run/docker.sock
srw-rw---- 1 root root 0 Jun 22 11:38 /var/run/docker.sock

For me, it seems to be "accidentally" working due to that the docker.sock has rw at the group level, and both root user in docker container and runner user in runner container belongs to 27. But 27 is video in the docker container and sudo in the runner container, which implies this just a coincidence.

What're outputs of id $(whoami) on your env? Probably that's the key.

rezmuh · 2020-06-22T11:53:49Z

let me try to reproduce this in a few hours. However, is it possible to make runner user as a part of docker group? would that solve the issue?

mumoshu · 2020-06-22T12:03:53Z

@rezmuh Yeah probably. Is there any "reserved" gid for the docker group that I can use? We need to explicitly specify and share the same gid for docker group between both containers.

mumoshu · 2020-06-22T12:36:45Z

Too bad that adding docker group to the dind container would require us to fork the dind image 😢

mumoshu · 2020-06-22T12:42:20Z

Well, how about adding the runner user to root(0) group? It might be "mostly" safe as long as we don't use a privileged container.

runner@mumoshu-runnertest-hzvfg-r4xq6:/$ sudo usermod -aG root runner
runner@mumoshu-runnertest-hzvfg-r4xq6:/$ id runner
uid=1000(runner) gid=1000(runner) groups=1000(runner),0(root),27(sudo)

More concretely, how about adding usermod -aG root runner to our runner Dockerfile here?

https://github.com/summerwind/actions-runner-controller/blob/master/runner/Dockerfile#L46

cc/ @summerwind

rezmuh · 2020-06-23T04:16:04Z

sorry for the late response. weird that I see the ids very similar to yours:

on docker container:

# id $(whoami)
uid=0(root) gid=0(root) groups=0(root),0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
# ls -lah /var/run/docker.sock
srw-rw----    1 root     root           0 Jun 23 04:08 /var/run/docker.sock

and on the runner container:

# id $(whoami)
uid=0(root) gid=0(root) groups=0(root),0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
ls -lah /var/run/docker.sock
srw-rw---- 1 root root 0 Jun 23 04:08 /var/run/docker.sock

rezmuh · 2020-06-23T04:23:32Z

Btw here's some errors I saw from docker container log:

https://gist.github.com/rezmuh/ea44373ff09112fe2368265c7d9e0c5b

Not sure if it helps

mumoshu · 2020-06-23T06:01:17Z

@rezmuh Thanks! That helps.

So our user commands like docker run within job step run seem to work as expected.

I'm now curious if this has something to do with container: rust:1.43.0 used for your job. (You're still using the rust container to run steps, right?

I tried to reproduce your error on EKS nodes using this workflow definition, but has no luck.

For me it throws a different error:

Give the above information, do you have any idea on how I can reproduce your issue?

mumoshu · 2020-06-23T06:30:37Z

JFYI, with services only, it works on EKS nodes for me:

https://github.com/mumoshu/runnertest/blob/master/.github/workflows/test.yaml#L1

mumoshu · 2020-06-23T06:32:43Z

@rezmuh Which version of actions-runner-controller are you using?

rezmuh · 2020-06-23T09:05:29Z

Not sure if I follow the question. But I installed it using:

$ kubectl apply -f https://github.com/summerwind/actions-runner-controller/releases/latest/download/actions-runner-controller.yaml

Then I used Github Personal Access Token, then I used a RunnerDeployment to deploy (as shown in the beginning of the issue body)

But i don't think the issue comes from rust:1.143.0 container only though. I tested this yesterday with a different workflow in different repository (doesn't have services but does use actions built with Docker) and it failed with the same error

mumoshu · 2020-06-24T00:49:43Z

@rezmuh Thanks. I tried to reproduce it using both the latest release of actions-runner-controller and a prerelease version of it, but had no luck.

I'm using a personal access token, and a RunnerDeploment, too. So probably that's not the issue.

All I can say is that this can happen when our pods or clusters have different configurations.

How's your EKS cluster created? WIth eksctl, cdk, or anything else? Can you still reproduce the problem on K8s 1.15 or greater?

I'm using eksctl 0.22.0 and K8s 1.16.

rezmuh · 2020-06-24T03:51:29Z

i created using eksctl but on an older version 0.12.0 and kube 1.12. Let me try to upgrade both and see if it works

rezmuh · 2020-06-24T12:06:14Z

So, i upgraded my cluster using eksctl 0.22.0 and K8s 1.16. But now I see different errors, like below:

This happens on multiple repositories (one is using Rust, the other one is using Nodejs)

rezmuh · 2020-06-24T14:25:55Z

I also tried to create a new cluster in EKS with the latest version but got the same error :(

rezmuh · 2020-06-24T17:00:36Z

Btw, i have figured this out now. Looks like I needed to create a custom image on for the action runners. So this now works fine.

mumoshu · 2020-06-25T23:42:03Z

@rezmuh Yes, those various "no such file or directory" errors are due to that your custom job image specified via container lacks required executables used by pre-made actions like actions/checkout@v2.

So apparently your original issue has been caused by older versions of K8s, which was good to know! Thank you so much for your patience and help!

erikkn · 2020-11-14T11:36:12Z

Hey @rezmuh & @mumoshu I was hoping to pick your brain on this since I am having the same issue as rezmuh mentioned here: #60 (comment)
Wasn't sure whether I should open a net issue, but since I am having the exact same issue I figured that it might be best to keep everything in here.

I am also using a custom image for my runner and my workflow has the container attribute set. When I try to run my job I get this error; Please note that pulling and running the container is working fine (DinD part):

/usr/bin/docker exec  7b29d4c942123bf7715192b49827897b24041d30ba06b093a2028148cc5ca77f sh -c "cat /etc/*release | grep ^ID"
OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: \"/__e/node12/bin/node\": stat /__e/node12/bin/node: no such file or directory": unknown

When I replace my custom runner image with the summerwind/actions-runner image, everything works fine, which gives me the impression that the error is not caused by the container that I specified in my workflow.

mumoshu · 2020-11-14T11:39:27Z

@erikkn Hey! Have you tried installing nodejs runtime (node) onto your custom image?

erikkn · 2020-11-15T10:58:20Z

Hey @mumoshu! Thank you for your reply :)!

Yes, I did install the node package in my runner image, but unfortunately this did not solve my problem; I tried to do some manual debugging and when I was looking into the summerwind/actions-runner image I noticed that the node package is not installed in that package.
Please note that I am 'not' using DinD, just a sidecarContainer, which uses the docker:dind container, just like the summerwind/actions-runner does with the proper volumeMounts.

erikkn · 2020-11-15T11:03:02Z

In the light of full transparency, here is my Docker sidecarContainer stdout:

time="2020-11-15T10:52:10.018015952Z" level=info msg="Loading containers: done."
time="2020-11-15T10:52:10.047691220Z" level=info msg="Docker daemon" commit=4484c46 graphdriver(s)=overlay2 version=19.03.13
time="2020-11-15T10:52:10.048050310Z" level=info msg="Daemon has completed initialization"
time="2020-11-15T10:52:10.095094243Z" level=info msg="API listen on [::]:2376"
time="2020-11-15T10:52:10.095365779Z" level=info msg="API listen on /var/run/docker.sock"
time="2020-11-15T10:52:34.918740107Z" level=info msg="shim containerd-shim started" address=/containerd-shim/ef95dd4484eec1cf8e230e0fe83c2ee23f152a4f23531bbfb97f093aabdfa716.sock debug=false pid=446
time="2020-11-15T10:52:36.035255010Z" level=error msg="stream copy error: reading from a closed fifo"
time="2020-11-15T10:52:36.035625469Z" level=error msg="stream copy error: reading from a closed fifo"
time="2020-11-15T10:52:36.040737025Z" level=error msg="Error running exec 8abef485c6c27eefc11437d7b9f7a1814bc7e03309cc5ab92408c75697192b7d in container: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused \"exec: \\\"/__e/node12/bin/node\\\": stat /__e/node12/bin/node: no such file or directory\": unknown"
time="2020-11-15T10:52:36.261430760Z" level=error msg="stream copy error: reading from a closed fifo"
time="2020-11-15T10:52:36.261527617Z" level=error msg="stream copy error: reading from a closed fifo"
time="2020-11-15T10:52:36.265450572Z" level=error msg="Error running exec df7363480c392836241889688419a68fa52588a7229b759c9b5d62b934c351c0 in container: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused \"exec: \\\"/__e/node12/bin/node\\\": stat /__e/node12/bin/node: no such file or directory\": unknown"
time="2020-11-15T10:52:36.378291317Z" level=info msg="shim reaped" id=b15e076e911f481a0d61c404dac81896bf4f3d2314a72ef65c91b59c4c036c3a
time="2020-11-15T10:52:36.388600667Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

I stumbled upon this (actions/checkout#334) issue, but the file /__e/node12/bin/node command they mention at the bottom for test, errors out in the official summerwind/actions-runner as well.

mumoshu · 2020-11-16T01:02:09Z

@erikkn Thanks for the report! I'm now reading actions/checkout#334 (comment) and wondering if we need to copy all the deps included in the actions runner's release archive into our runner image, and provide missing shared libraries.

Probably it takes a few days for me to finish that. Until then, it would be great if you could keep using checkout v1

erikkn · 2020-11-16T08:10:57Z

Thanks a lot! I am quite new to this scene, but I am very keen to help out where I can, please let me know if there is anything.

erikkn · 2020-11-16T19:03:09Z

Running into the same issue with the cache step as commented here: #190 (comment)

Let me know if you want to sync-up about those libraries / or if you have a list somewhere so I can try to help you tackle this.

mumoshu · 2020-11-19T00:15:35Z

Linking this to @erikkn's awesome PR #203. I'll take a deeper look soon 🙏

rezmuh closed this as completed Jun 25, 2020

sandeepraj-chandrakant-bhandari-db mentioned this issue Feb 18, 2021

Error while using DinD #332

Closed

TingluoHuang pushed a commit that referenced this issue Jan 12, 2023

Added update role permission to the listener (#60)

8ddadf4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not run RunnerDeployment with organization setting #60

Could not run RunnerDeployment with organization setting #60

rezmuh commented Jun 21, 2020 •

edited

Loading

mumoshu commented Jun 21, 2020

rezmuh commented Jun 21, 2020

rezmuh commented Jun 21, 2020

mumoshu commented Jun 21, 2020

rezmuh commented Jun 22, 2020

mumoshu commented Jun 22, 2020

rezmuh commented Jun 22, 2020

rezmuh commented Jun 22, 2020

mumoshu commented Jun 22, 2020 •

edited

Loading

rezmuh commented Jun 22, 2020

mumoshu commented Jun 22, 2020

mumoshu commented Jun 22, 2020

mumoshu commented Jun 22, 2020

rezmuh commented Jun 23, 2020

rezmuh commented Jun 23, 2020

mumoshu commented Jun 23, 2020

mumoshu commented Jun 23, 2020

mumoshu commented Jun 23, 2020

rezmuh commented Jun 23, 2020

mumoshu commented Jun 24, 2020 •

edited

Loading

rezmuh commented Jun 24, 2020

rezmuh commented Jun 24, 2020

rezmuh commented Jun 24, 2020

rezmuh commented Jun 24, 2020

mumoshu commented Jun 25, 2020 •

edited

Loading

erikkn commented Nov 14, 2020 •

edited

Loading

mumoshu commented Nov 14, 2020

erikkn commented Nov 15, 2020

erikkn commented Nov 15, 2020

mumoshu commented Nov 16, 2020

erikkn commented Nov 16, 2020

erikkn commented Nov 16, 2020 •

edited

Loading

mumoshu commented Nov 19, 2020

Could not run RunnerDeployment with organization setting #60

Could not run RunnerDeployment with organization setting #60

Comments

rezmuh commented Jun 21, 2020 • edited Loading

mumoshu commented Jun 21, 2020

rezmuh commented Jun 21, 2020

rezmuh commented Jun 21, 2020

mumoshu commented Jun 21, 2020

rezmuh commented Jun 22, 2020

mumoshu commented Jun 22, 2020

rezmuh commented Jun 22, 2020

rezmuh commented Jun 22, 2020

mumoshu commented Jun 22, 2020 • edited Loading

rezmuh commented Jun 22, 2020

mumoshu commented Jun 22, 2020

mumoshu commented Jun 22, 2020

mumoshu commented Jun 22, 2020

rezmuh commented Jun 23, 2020

rezmuh commented Jun 23, 2020

mumoshu commented Jun 23, 2020

mumoshu commented Jun 23, 2020

mumoshu commented Jun 23, 2020

rezmuh commented Jun 23, 2020

mumoshu commented Jun 24, 2020 • edited Loading

rezmuh commented Jun 24, 2020

rezmuh commented Jun 24, 2020

rezmuh commented Jun 24, 2020

rezmuh commented Jun 24, 2020

mumoshu commented Jun 25, 2020 • edited Loading

erikkn commented Nov 14, 2020 • edited Loading

mumoshu commented Nov 14, 2020

erikkn commented Nov 15, 2020

erikkn commented Nov 15, 2020

mumoshu commented Nov 16, 2020

erikkn commented Nov 16, 2020

erikkn commented Nov 16, 2020 • edited Loading

mumoshu commented Nov 19, 2020

rezmuh commented Jun 21, 2020 •

edited

Loading

mumoshu commented Jun 22, 2020 •

edited

Loading

mumoshu commented Jun 24, 2020 •

edited

Loading

mumoshu commented Jun 25, 2020 •

edited

Loading

erikkn commented Nov 14, 2020 •

edited

Loading

erikkn commented Nov 16, 2020 •

edited

Loading