Ephemeral runners and cancelled jobs #1853

npalm · 2022-03-15T08:29:13Z

Description

For non ephemeral runners the status of the workflow job is checked, and only for queued jobs scaling is done. For ephemeral runners this check is not applied because the assumption was that every job needs a runners.

We found out when we start scaling with a couple of 100 runners this ideas was not working as expected. When we got a large number of cancelled jobs. For example based on a job time out. The events are still on queue. This is typically the case when we have reach the max of runners. The lambda's will crate all the runenrs. But they will remain idle since jobs are canclled. This is not a problem with a few cancelled jobs. But when having hugh amount of cancelled jobs, this could casue a large fleet of useless runners.

Solution

We have tested a modified scale up lambda, where we applied the the check for the job in the same way as for non ephemeral runners. In our case this was solving the problem. However, since there is not correlation between job and runner this approach could lead that events are not used for scaling in cases they should lead to scaling. As mitigation we have a very small fleet of runners in the pool to keep track of those missed events.

alexellis · 2022-09-21T10:38:32Z

Hi @npalm

I'm working on an adjacent solution using Firecracker and pools of agents, but solely with ephemeral runners to ensure complete isolation and a fresh environment for each run.

We're building similar solutions at a conceptual level, but with a very different technical approach. If you'd like to compare notes feel free to send me an email? See a demo / find out more: https://github.com/self-actuated/actuated

My question for you is: if you were only using emphemeral runners and creating new VMs for each workflow job event, how do you handle a cancelled workflow run? Let's say that your run created 20 jobs, so 20 VMs were started.

If each job is allocated to a runner, starts executing, then the run is cancelled, then the runner exits and everything is cleaned up.

But the challenge is if that run and its 20 jobs are cancelled before being allocated to a runner. At that point we have 20 VMs running and no good way to knowing to shut them down or to reap them.

Alex

npalm mentioned this issue Mar 15, 2022

feat: Add option for ephemeral to check builds status before scaling #1854

Merged

npalm closed this as completed Mar 23, 2022

alexellis mentioned this issue Sep 21, 2022

Tie a specific workflow job to a specific ephemeral runner via labels actions/runner#2147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ephemeral runners and cancelled jobs #1853

Ephemeral runners and cancelled jobs #1853

npalm commented Mar 15, 2022

alexellis commented Sep 21, 2022

Ephemeral runners and cancelled jobs #1853

Ephemeral runners and cancelled jobs #1853

Comments

npalm commented Mar 15, 2022

Description

Solution

alexellis commented Sep 21, 2022