Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not run RunnerDeployment with organization setting #60

Closed
rezmuh opened this issue Jun 21, 2020 · 33 comments
Closed

Could not run RunnerDeployment with organization setting #60

rezmuh opened this issue Jun 21, 2020 · 33 comments

Comments

@rezmuh
Copy link

rezmuh commented Jun 21, 2020

Hi, I just installed this for my organization and I already saw that the runners are running:

› kubectl get runner                                                                                                                                                 
NAME                        ORGANIZATION   REPOSITORY   LABELS   STATUS
github-runner-dn442-25m4h   lifepal                              Running
github-runner-dn442-btkzd   lifepal                              Running
github-runner-dn442-pxgh6   lifepal                              Running
github-runner-dn442-xzd2k   lifepal                              Running

And the runners are registered within the organization as well as shown below.

Screen Shot 2020-06-21 at 10 43 17

And this is how I installed the runners:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: github-runner
spec:
  replicas: 4
  template:
    spec:
      organization: <my org>

However, the actions could not be run. This was the error I got from the action log

Screen Shot 2020-06-21 at 10 39 13

It looks like I must have missed a setting somewhere?

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 21, 2020

@rezmuh Hey! Maybe there's any issue in your workflow definition? Would u mind sharing it?

To me, a docker run like this one, in a job on a self-hosted runner, just works: https://github.com/mumoshu/runnertest/blob/master/.github/workflows/test.yaml#L18

@rezmuh
Copy link
Author

rezmuh commented Jun 21, 2020

Hi, this is the workflow definition I tried to run.

name: pull-request

on:
  pull_request

env:
  CARGO_TERM_COLOR: always

jobs:
  build:
    runs-on: self-hosted
    container: rust:1.43.0

    services:
      db:
        image: postgres:10.1-alpine
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: postgres
        ports:
        - 5432:5432
        # Add a health check
        options: --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5

    steps:
    - uses: actions/checkout@v2
      with:
        ssh-key: ${{ secrets.SSH_KEY }}
        submodules: 'recursive'
    - uses: actions/cache@v2
      with:
        path: |
          ~/.cargo/registry
          ~/.cargo/git
          target
        key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
    - name: Install rustfmt
      run: rustup component add rustfmt --toolchain 1.43.0-x86_64-unknown-linux-gnu
    - name: Build binary
      run: cargo build
    - name: Run tests
      run: cargo test

@rezmuh
Copy link
Author

rezmuh commented Jun 21, 2020

Screen Shot 2020-06-21 at 20 47 27

Here's a longer version of the log. Looks like the issue is on actions/cache ?

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 21, 2020

@rezmuh Thanks! Perhaps it has something to do with GH Actions' services support?

Would you mind running:

  • ls -l /var/run/docker.sock before any other job in your workflow
  • chmod 666 on /var/run/docker.sock before any other job in your workflow

@rezmuh
Copy link
Author

rezmuh commented Jun 22, 2020

still the same. I even removed the services just to see if it would build without it. But it still stopped at the same step. However, I found another error from the annotations:

Screen Shot 2020-06-22 at 10 37 27

Perhaps it has something to do with how my Kubernetes is setup? Does it require certain type of network that my kubernetes doesn't support? If it any helps, my Kubernetes cluster is on EKS (version v1.13) and was installed through eksctl.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 22, 2020

@rezmuh Would you mind confirming that you do know that any changes to workflow yaml must be in the master branch to work? (In other words, I was wondering if you tried to test it with a workflow run triggered via a pull request that changes your workflow definitino

@rezmuh
Copy link
Author

rezmuh commented Jun 22, 2020

@mumoshu no, i wasn't aware that changes to workflow yaml must be in master branch because i tried creating a pull-request workflow that is only triggered during pull-request and it worked. even though master branch doesn't have that workflow file yet.

@rezmuh
Copy link
Author

rezmuh commented Jun 22, 2020

btw, i just tested to create a self-hosted runner on a separate EC2 instance and I got the same error. I fixed it by adding a Docker daemon and add the user group into a docker group to be able to run the workflow.

So I presume this runner has Docker in Docker inside Kubernetes? Then I guess the issue is that the pod / runner could not connect to Docker in Docker?

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 22, 2020

Thanks. I think we're close.

Yes, it's a dind inside a K8s pod. Each one-shot runner pod has a sidecar container that's named docker which is running a docker daemon.

Even though I'm still unable to reproduce your exact issue, the user/group settings seem not "correct" between the runner and the dockerd containers.

In docker container I see:

/ # id $(whoami)
uid=0(root) gid=0(root) groups=0(root),0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
/ # ls -lah /var/run/docker.sock
srw-rw----    1 root     root           0 Jun 22 11:39 /var/run/docker.sock

Where in runner container I can see:

$ id $(whoami)
uid=1000(runner) gid=1000(runner) groups=1000(runner),27(sudo)
$ ls -lah /var/run/docker.sock
srw-rw---- 1 root root 0 Jun 22 11:38 /var/run/docker.sock

For me, it seems to be "accidentally" working due to that the docker.sock has rw at the group level, and both root user in docker container and runner user in runner container belongs to 27. But 27 is video in the docker container and sudo in the runner container, which implies this just a coincidence.

What're outputs of id $(whoami) on your env? Probably that's the key.

@rezmuh
Copy link
Author

rezmuh commented Jun 22, 2020

let me try to reproduce this in a few hours. However, is it possible to make runner user as a part of docker group? would that solve the issue?

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 22, 2020

@rezmuh Yeah probably. Is there any "reserved" gid for the docker group that I can use? We need to explicitly specify and share the same gid for docker group between both containers.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 22, 2020

Too bad that adding docker group to the dind container would require us to fork the dind image 😢

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 22, 2020

Well, how about adding the runner user to root(0) group? It might be "mostly" safe as long as we don't use a privileged container.

runner@mumoshu-runnertest-hzvfg-r4xq6:/$ sudo usermod -aG root runner
runner@mumoshu-runnertest-hzvfg-r4xq6:/$ id runner
uid=1000(runner) gid=1000(runner) groups=1000(runner),0(root),27(sudo)

More concretely, how about adding usermod -aG root runner to our runner Dockerfile here?

https://github.com/summerwind/actions-runner-controller/blob/master/runner/Dockerfile#L46

cc/ @summerwind

@rezmuh
Copy link
Author

rezmuh commented Jun 23, 2020

sorry for the late response. weird that I see the ids very similar to yours:

on docker container:

# id $(whoami)
uid=0(root) gid=0(root) groups=0(root),0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
# ls -lah /var/run/docker.sock
srw-rw----    1 root     root           0 Jun 23 04:08 /var/run/docker.sock

and on the runner container:

# id $(whoami)
uid=0(root) gid=0(root) groups=0(root),0(root),1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy),20(dialout),26(tape),27(video)
ls -lah /var/run/docker.sock
srw-rw---- 1 root root 0 Jun 23 04:08 /var/run/docker.sock

@rezmuh
Copy link
Author

rezmuh commented Jun 23, 2020

Btw here's some errors I saw from docker container log:

https://gist.github.com/rezmuh/ea44373ff09112fe2368265c7d9e0c5b

Not sure if it helps

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 23, 2020

@rezmuh Thanks! That helps.

So our user commands like docker run within job step run seem to work as expected.

I'm now curious if this has something to do with container: rust:1.43.0 used for your job. (You're still using the rust container to run steps, right?

I tried to reproduce your error on EKS nodes using this workflow definition, but has no luck.

For me it throws a different error:

image

Give the above information, do you have any idea on how I can reproduce your issue?

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 23, 2020

JFYI, with services only, it works on EKS nodes for me:

https://github.com/mumoshu/runnertest/blob/master/.github/workflows/test.yaml#L1

image

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 23, 2020

@rezmuh Which version of actions-runner-controller are you using?

@rezmuh
Copy link
Author

rezmuh commented Jun 23, 2020

Not sure if I follow the question. But I installed it using:

$ kubectl apply -f https://github.com/summerwind/actions-runner-controller/releases/latest/download/actions-runner-controller.yaml

Then I used Github Personal Access Token, then I used a RunnerDeployment to deploy (as shown in the beginning of the issue body)

But i don't think the issue comes from rust:1.143.0 container only though. I tested this yesterday with a different workflow in different repository (doesn't have services but does use actions built with Docker) and it failed with the same error

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 24, 2020

@rezmuh Thanks. I tried to reproduce it using both the latest release of actions-runner-controller and a prerelease version of it, but had no luck.

I'm using a personal access token, and a RunnerDeploment, too. So probably that's not the issue.

All I can say is that this can happen when our pods or clusters have different configurations.

How's your EKS cluster created? WIth eksctl, cdk, or anything else? Can you still reproduce the problem on K8s 1.15 or greater?

I'm using eksctl 0.22.0 and K8s 1.16.

@rezmuh
Copy link
Author

rezmuh commented Jun 24, 2020

i created using eksctl but on an older version 0.12.0 and kube 1.12. Let me try to upgrade both and see if it works

@rezmuh
Copy link
Author

rezmuh commented Jun 24, 2020

So, i upgraded my cluster using eksctl 0.22.0 and K8s 1.16. But now I see different errors, like below:

Screen Shot 2020-06-24 at 19 04 25

This happens on multiple repositories (one is using Rust, the other one is using Nodejs)

@rezmuh
Copy link
Author

rezmuh commented Jun 24, 2020

I also tried to create a new cluster in EKS with the latest version but got the same error :(

@rezmuh
Copy link
Author

rezmuh commented Jun 24, 2020

Btw, i have figured this out now. Looks like I needed to create a custom image on for the action runners. So this now works fine.

@rezmuh rezmuh closed this as completed Jun 25, 2020
@mumoshu
Copy link
Collaborator

mumoshu commented Jun 25, 2020

@rezmuh Yes, those various "no such file or directory" errors are due to that your custom job image specified via container lacks required executables used by pre-made actions like actions/checkout@v2.

So apparently your original issue has been caused by older versions of K8s, which was good to know! Thank you so much for your patience and help!

@erikkn
Copy link
Contributor

erikkn commented Nov 14, 2020

Hey @rezmuh & @mumoshu I was hoping to pick your brain on this since I am having the same issue as rezmuh mentioned here: #60 (comment)
Wasn't sure whether I should open a net issue, but since I am having the exact same issue I figured that it might be best to keep everything in here.

I am also using a custom image for my runner and my workflow has the container attribute set. When I try to run my job I get this error; Please note that pulling and running the container is working fine (DinD part):

/usr/bin/docker exec  7b29d4c942123bf7715192b49827897b24041d30ba06b093a2028148cc5ca77f sh -c "cat /etc/*release | grep ^ID"
OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: \"/__e/node12/bin/node\": stat /__e/node12/bin/node: no such file or directory": unknown

When I replace my custom runner image with the summerwind/actions-runner image, everything works fine, which gives me the impression that the error is not caused by the container that I specified in my workflow.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 14, 2020

@erikkn Hey! Have you tried installing nodejs runtime (node) onto your custom image?

@erikkn
Copy link
Contributor

erikkn commented Nov 15, 2020

Hey @mumoshu! Thank you for your reply :)!

Yes, I did install the node package in my runner image, but unfortunately this did not solve my problem; I tried to do some manual debugging and when I was looking into the summerwind/actions-runner image I noticed that the node package is not installed in that package.
Please note that I am 'not' using DinD, just a sidecarContainer, which uses the docker:dind container, just like the summerwind/actions-runner does with the proper volumeMounts.

@erikkn
Copy link
Contributor

erikkn commented Nov 15, 2020

In the light of full transparency, here is my Docker sidecarContainer stdout:

time="2020-11-15T10:52:10.018015952Z" level=info msg="Loading containers: done."
time="2020-11-15T10:52:10.047691220Z" level=info msg="Docker daemon" commit=4484c46 graphdriver(s)=overlay2 version=19.03.13
time="2020-11-15T10:52:10.048050310Z" level=info msg="Daemon has completed initialization"
time="2020-11-15T10:52:10.095094243Z" level=info msg="API listen on [::]:2376"
time="2020-11-15T10:52:10.095365779Z" level=info msg="API listen on /var/run/docker.sock"
time="2020-11-15T10:52:34.918740107Z" level=info msg="shim containerd-shim started" address=/containerd-shim/ef95dd4484eec1cf8e230e0fe83c2ee23f152a4f23531bbfb97f093aabdfa716.sock debug=false pid=446
time="2020-11-15T10:52:36.035255010Z" level=error msg="stream copy error: reading from a closed fifo"
time="2020-11-15T10:52:36.035625469Z" level=error msg="stream copy error: reading from a closed fifo"
time="2020-11-15T10:52:36.040737025Z" level=error msg="Error running exec 8abef485c6c27eefc11437d7b9f7a1814bc7e03309cc5ab92408c75697192b7d in container: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused \"exec: \\\"/__e/node12/bin/node\\\": stat /__e/node12/bin/node: no such file or directory\": unknown"
time="2020-11-15T10:52:36.261430760Z" level=error msg="stream copy error: reading from a closed fifo"
time="2020-11-15T10:52:36.261527617Z" level=error msg="stream copy error: reading from a closed fifo"
time="2020-11-15T10:52:36.265450572Z" level=error msg="Error running exec df7363480c392836241889688419a68fa52588a7229b759c9b5d62b934c351c0 in container: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused \"exec: \\\"/__e/node12/bin/node\\\": stat /__e/node12/bin/node: no such file or directory\": unknown"
time="2020-11-15T10:52:36.378291317Z" level=info msg="shim reaped" id=b15e076e911f481a0d61c404dac81896bf4f3d2314a72ef65c91b59c4c036c3a
time="2020-11-15T10:52:36.388600667Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

I stumbled upon this (actions/checkout#334) issue, but the file /__e/node12/bin/node command they mention at the bottom for test, errors out in the official summerwind/actions-runner as well.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 16, 2020

@erikkn Thanks for the report! I'm now reading actions/checkout#334 (comment) and wondering if we need to copy all the deps included in the actions runner's release archive into our runner image, and provide missing shared libraries.

Probably it takes a few days for me to finish that. Until then, it would be great if you could keep using checkout v1

@erikkn
Copy link
Contributor

erikkn commented Nov 16, 2020

Thanks a lot! I am quite new to this scene, but I am very keen to help out where I can, please let me know if there is anything.

@erikkn
Copy link
Contributor

erikkn commented Nov 16, 2020

Running into the same issue with the cache step as commented here: #190 (comment)

Let me know if you want to sync-up about those libraries / or if you have a list somewhere so I can try to help you tackle this.

@mumoshu
Copy link
Collaborator

mumoshu commented Nov 19, 2020

Linking this to @erikkn's awesome PR #203. I'll take a deeper look soon 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants