[4.x] Prevent horizon:purge from killing too many processes #820

rafalglowacz · 2020-04-07T11:13:05Z

Summary

This fixes the problem described in #769. To recap: if there's more than one Laravel application on one server, each with their own Horizon instance, running php artisan horizon:purge for one of the instances will kill valid Horizon processes of other instances.

Workaround

The issue is supposed to be alleviated by using a supervisor daemon which will restart all processes after they get killed, but

it seems like a workaround
when you read the code behind horizon:purge, its intent seems to be just killing invalid processes, e.g. the $description is "Terminate any rogue Horizon processes". That's confusing and makes it hard to determine what's actually going on
the documentation doesn't say the daemon is required, so in my case I have a sysadmin who doesn't want to install it and insists on fixing the underlying issue and the client company agrees with him.

Proposed fix

MasterSupervisor: the hash

My company gave me some time to work on fixing the issue. The fix is based around adding an extra bit of information to the master supervisor's base name. It's a 4-character hash, similar to the random one already appended to base name, but this one instead of being random is based on the current directory, so will remain constant for the current Laravel application.

ProcessInspector: determining current processes

Instead of just doing pgrep -f [h]orizon which will also find processes of other Laravel apps, we only look for processes that include the current master supervisor base name. All supervisors and workers include it, but master supervisors themselves don't, so ProcessInspector@current will not return all the results we need. That's why we now also have ProcessInspector@mastersWithoutSupervisors - it returns ids of master supervisor processes that don't have supervisor processes underneath them. It can't determine whether the masters are from current Laravel app or not, but they don't have the expected child processes, so this time it's not a problem.

Testing the PR

In order to see the described behavior before and after the PR, you can use the following steps:

set up two Laravel apps with Horizon on the same machine and run Horizon on both of them
find out the master supervisor PID (e.g. using ps -fx) of one of them and kill it with with kill -9
start a new master supervisor in place of the one you killed (needed for the monitoring information to update)
run php artisan horizon:purge on the app where you killed the master. Before the PR, it will kill not just the rouge processes of this instance, but also the valid processes of the other one. After the PR, it will just kill the rouge processes of the current instance

That's it, let me know what you think!

This change modifies the master supervisor basename to include the hash of the current directory. The process inspector that determines the processes to kill uses this hash to only look for processes that were created by the Horizon command running in the current Laravel instance's directory, not by all Laravel instances running on the server. This fixes #769.

…-instances-v4' into H/rafalglowacz/horizon-purge-killing-processes-of-other-instances

rafalglowacz · 2020-04-07T11:48:46Z

I had to make a merge and a few style fixes. Should I make a rebase now or will that come at the end if the PR gets approved?

taylorotwell · 2020-04-08T13:23:55Z

I will note that only one master Horizon process is meant to be running per server.

themsaid · 2020-04-08T15:11:48Z

@rafalglowacz Any idea why not just the changes in this PR? #822

taylorotwell · 2020-04-09T14:31:15Z

@rafalglowacz can you give us some feedback on #822

rafalglowacz · 2020-04-09T15:37:13Z

I'll check it out tomorrow. I thought I'd be able to have a look today, but something else came up.

rafalglowacz · 2020-04-10T11:51:27Z

I tried the code in #822 and the effect was the same as before the changes - the Horizon processes of another instance got killed. When the process inspector in #822 looks for current processes, it first gets the processes including the path hash, but one line later it adds all masters. So it won't individually kill every worker of other Laravel apps, but it will kill their masters, so the end result is the same.

I guess you want to keep it as simple as possible. I checked if I can use #822 and just add mastersWithoutSupervisors from my PR to prevent killing valid masters. That works but then rogue supervisors don't get terminated. Hence the changes in ProcessInspector@monitoring and the new method ProcessInspector@monitoredMastersWithSupervisors in my PR.

themsaid · 2020-04-10T11:53:40Z

@rafalglowacz But it won't kill other masters since these masters are stored in Redis and we have a reference to these, so even masters for other projects will be returned and thus will be considered as valid.

rafalglowacz · 2020-04-10T12:08:53Z

Are you referring to this bit in ProcessInspector@monitoring?

->merge(
    Arr::pluck(app(MasterSupervisorRepository::class)->all(), 'pid')
)->all();

Other projects will use separate Redis databases, so they won't be returned by this method called inside the current project.

themsaid · 2020-04-10T14:17:06Z

@rafalglowacz and where in this PR do you collect these processes that use a different redis database?

rafalglowacz · 2020-04-10T14:59:53Z

Ideally each project would be able to kill its own rogue processes. The path hash takes care of that for workers and supervisors, but that leaves the master supervisors. That's why ProcessInspector@mastersWithoutSupervisors collects all masters that don't have any child processes (supervisors). So that's the only place dealing with processes of other projects, i.e. processes that use a different database.

themsaid · 2020-04-10T15:01:25Z

@rafalglowacz what about rogue masters that has supervisors. We should be able to purge those anyway.

…s-of-other-instances

rafalglowacz · 2020-04-13T07:31:53Z

I'm not familiar with that case. How would it manifest itself? Would there be a master process with supervisor processes, but no info about them in horizon:master:* and horizon:supervisor:* in Redis?

themsaid · 2020-04-13T07:36:35Z

You say each app may use a different redis connection/database. That's why I originally shared that purging processes 100% is not possible since people can configure each app to read from a different connection so we won't be able to collect every valid process running on the server.

I keep recommending not to run multiple queue heavy apps on the same server because if redis crashes by 1 app the queue for the rest of apps will stop. That's why the purge command doesn't really care if it's killing other valid processes, they'll start again anyway but having them running on a single server is not recommended in the first place.

Anyway, I think your PR isn't really fixing it completely since a rogue master with supervisors will still be running. Killing processes of other apps isn't a problem really, the supervisors will start these processes again so no harm is done.

rafalglowacz · 2020-04-13T08:42:06Z

The problem my company originally had was only with rogue processes like in #178

i see these process as root process, not inside the master php artisan horizon process.

So as far as I can see, if the master process is still there and the other processes are inside it, it means one of two things:

The master and its supervisors are working fine, workers are spawned and the monitoring data in Redis gets updated. I guess they can't be considered rogue in that case?
The processes exist but somehow (I didn't see that IRL) they're not updating horizon:master:* and/or horizon:supervisor:* in Redis. In that case those values will expire and the processes can be picked up by horizon:purge.

So in what case is there a regression compared to the current horizon:purge command?

koenvu · 2020-04-21T09:34:08Z

This PR breaks our horizon routine, which is as follows:

Have a supervisor job monitor Horizon by starting it with php /home/username/current/artisan horizon. current is a symlink to a release directory, like /home/username/releases/20200421113020.
In our "zero downtime" deployment process we call horizon:terminate after switching the current symlink. We do this of course to make sure horizon is running on the new codebase. If we were to do it before changing the symlink, supervisor could start a new horizon process from the old directory, so we have to do it after changing the symlink.
Because __DIR__ resolves symlinks, horizon:terminate is using a different path to calculate the basename than what was used to start the master supervisor.

Even when using a custom nameResolver the path is still appended. For now we have reverted to v4.2.1, but I'm wondering what your thoughts on this breaking change are!

[4.x] Revert #820

pgrenaud · 2020-05-08T19:13:34Z

@koenvu THANK YOU! I just spend 2 hours on the phone with a colleague trying to figure out what the hell was wrong with our horizon. "Why are we not able to terminate horizon on this app, while all the other ones are working fine." Looking at the basename method in the code and then thinking "wait, this should not actually work anywhere", to finally realize that we were running a different version to then find this. What a journey!

Rafał Głowacz added 3 commits March 11, 2020 09:52

fix style issues

03fdb2d

driesvints changed the title ~~Prevent horizon:purge from killing too many processes~~ [4.x] Prevent horizon:purge from killing too many processes Apr 7, 2020

Rafał Głowacz added 3 commits April 7, 2020 11:35

Merge branch 'H/rafalglowacz/horizon-purge-killing-processes-of-other…

b0ffe1e

…-instances-v4' into H/rafalglowacz/horizon-purge-killing-processes-of-other-instances

fix style issues

b3293ac

fix style issue

b036c2c

themsaid mentioned this pull request Apr 8, 2020

[4.x] Prevent horizon:purge from killing too many processes #822

Closed

Merge branch '4.x' into H/rafalglowacz/horizon-purge-killing-processe…

12cd9da

…s-of-other-instances

taylorotwell merged commit bce9659 into laravel:4.x Apr 14, 2020

themsaid mentioned this pull request Apr 21, 2020

[4.x] Revert #820 #831

Merged

taylorotwell added a commit that referenced this pull request Apr 21, 2020

Merge pull request #831 from themsaid/pr15030

03feeb6

[4.x] Revert #820

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.x] Prevent horizon:purge from killing too many processes #820

[4.x] Prevent horizon:purge from killing too many processes #820

rafalglowacz commented Apr 7, 2020

rafalglowacz commented Apr 7, 2020

taylorotwell commented Apr 8, 2020

themsaid commented Apr 8, 2020 •

edited

Loading

taylorotwell commented Apr 9, 2020

rafalglowacz commented Apr 9, 2020

rafalglowacz commented Apr 10, 2020

themsaid commented Apr 10, 2020

rafalglowacz commented Apr 10, 2020

themsaid commented Apr 10, 2020

rafalglowacz commented Apr 10, 2020

themsaid commented Apr 10, 2020

rafalglowacz commented Apr 13, 2020

themsaid commented Apr 13, 2020 •

edited

Loading

rafalglowacz commented Apr 13, 2020

koenvu commented Apr 21, 2020 •

edited

Loading

pgrenaud commented May 8, 2020

[4.x] Prevent horizon:purge from killing too many processes #820

[4.x] Prevent horizon:purge from killing too many processes #820

Conversation

rafalglowacz commented Apr 7, 2020

rafalglowacz commented Apr 7, 2020

taylorotwell commented Apr 8, 2020

themsaid commented Apr 8, 2020 • edited Loading

taylorotwell commented Apr 9, 2020

rafalglowacz commented Apr 9, 2020

rafalglowacz commented Apr 10, 2020

themsaid commented Apr 10, 2020

rafalglowacz commented Apr 10, 2020

themsaid commented Apr 10, 2020

rafalglowacz commented Apr 10, 2020

themsaid commented Apr 10, 2020

rafalglowacz commented Apr 13, 2020

themsaid commented Apr 13, 2020 • edited Loading

rafalglowacz commented Apr 13, 2020

koenvu commented Apr 21, 2020 • edited Loading

pgrenaud commented May 8, 2020

themsaid commented Apr 8, 2020 •

edited

Loading

themsaid commented Apr 13, 2020 •

edited

Loading

koenvu commented Apr 21, 2020 •

edited

Loading