Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.x] Prevent horizon:purge from killing too many processes #820

Merged
merged 7 commits into from
Apr 14, 2020
Merged

[4.x] Prevent horizon:purge from killing too many processes #820

merged 7 commits into from
Apr 14, 2020

Conversation

rafalglowacz
Copy link

Summary

This fixes the problem described in #769. To recap: if there's more than one Laravel application on one server, each with their own Horizon instance, running php artisan horizon:purge for one of the instances will kill valid Horizon processes of other instances.

Workaround

The issue is supposed to be alleviated by using a supervisor daemon which will restart all processes after they get killed, but

  1. it seems like a workaround
  2. when you read the code behind horizon:purge, its intent seems to be just killing invalid processes, e.g. the $description is "Terminate any rogue Horizon processes". That's confusing and makes it hard to determine what's actually going on
  3. the documentation doesn't say the daemon is required, so in my case I have a sysadmin who doesn't want to install it and insists on fixing the underlying issue and the client company agrees with him.

Proposed fix

MasterSupervisor: the hash

My company gave me some time to work on fixing the issue. The fix is based around adding an extra bit of information to the master supervisor's base name. It's a 4-character hash, similar to the random one already appended to base name, but this one instead of being random is based on the current directory, so will remain constant for the current Laravel application.

ProcessInspector: determining current processes

Instead of just doing pgrep -f [h]orizon which will also find processes of other Laravel apps, we only look for processes that include the current master supervisor base name. All supervisors and workers include it, but master supervisors themselves don't, so ProcessInspector@current will not return all the results we need. That's why we now also have ProcessInspector@mastersWithoutSupervisors - it returns ids of master supervisor processes that don't have supervisor processes underneath them. It can't determine whether the masters are from current Laravel app or not, but they don't have the expected child processes, so this time it's not a problem.

Testing the PR

In order to see the described behavior before and after the PR, you can use the following steps:

  • set up two Laravel apps with Horizon on the same machine and run Horizon on both of them
  • find out the master supervisor PID (e.g. using ps -fx) of one of them and kill it with with kill -9
  • start a new master supervisor in place of the one you killed (needed for the monitoring information to update)
  • run php artisan horizon:purge on the app where you killed the master. Before the PR, it will kill not just the rouge processes of this instance, but also the valid processes of the other one. After the PR, it will just kill the rouge processes of the current instance

That's it, let me know what you think!

Rafał Głowacz added 3 commits March 11, 2020 09:52
This change modifies the master supervisor basename to include the hash of the current directory. The process inspector that determines the processes to kill uses this hash to only look for processes that were created by the Horizon command running in the current Laravel instance's directory, not by all Laravel instances running on the server.

This fixes #769.
This change modifies the master supervisor basename to include the hash of the current directory. The process inspector that determines the processes to kill uses this hash to only look for processes that were created by the Horizon command running in the current Laravel instance's directory, not by all Laravel instances running on the server.

This fixes #769.
@driesvints driesvints changed the title Prevent horizon:purge from killing too many processes [4.x] Prevent horizon:purge from killing too many processes Apr 7, 2020
Rafał Głowacz added 3 commits April 7, 2020 11:35
…-instances-v4' into H/rafalglowacz/horizon-purge-killing-processes-of-other-instances
@rafalglowacz
Copy link
Author

I had to make a merge and a few style fixes. Should I make a rebase now or will that come at the end if the PR gets approved?

@taylorotwell
Copy link
Member

I will note that only one master Horizon process is meant to be running per server.

@themsaid
Copy link
Member

themsaid commented Apr 8, 2020

@rafalglowacz Any idea why not just the changes in this PR? #822

@taylorotwell
Copy link
Member

@rafalglowacz can you give us some feedback on #822

@rafalglowacz
Copy link
Author

I'll check it out tomorrow. I thought I'd be able to have a look today, but something else came up.

@rafalglowacz
Copy link
Author

I tried the code in #822 and the effect was the same as before the changes - the Horizon processes of another instance got killed. When the process inspector in #822 looks for current processes, it first gets the processes including the path hash, but one line later it adds all masters. So it won't individually kill every worker of other Laravel apps, but it will kill their masters, so the end result is the same.

I guess you want to keep it as simple as possible. I checked if I can use #822 and just add mastersWithoutSupervisors from my PR to prevent killing valid masters. That works but then rogue supervisors don't get terminated. Hence the changes in ProcessInspector@monitoring and the new method ProcessInspector@monitoredMastersWithSupervisors in my PR.

@themsaid
Copy link
Member

@rafalglowacz But it won't kill other masters since these masters are stored in Redis and we have a reference to these, so even masters for other projects will be returned and thus will be considered as valid.

@rafalglowacz
Copy link
Author

Are you referring to this bit in ProcessInspector@monitoring?

->merge(
    Arr::pluck(app(MasterSupervisorRepository::class)->all(), 'pid')
)->all();

Other projects will use separate Redis databases, so they won't be returned by this method called inside the current project.

@themsaid
Copy link
Member

@rafalglowacz and where in this PR do you collect these processes that use a different redis database?

@rafalglowacz
Copy link
Author

Ideally each project would be able to kill its own rogue processes. The path hash takes care of that for workers and supervisors, but that leaves the master supervisors. That's why ProcessInspector@mastersWithoutSupervisors collects all masters that don't have any child processes (supervisors). So that's the only place dealing with processes of other projects, i.e. processes that use a different database.

@themsaid
Copy link
Member

@rafalglowacz what about rogue masters that has supervisors. We should be able to purge those anyway.

@rafalglowacz
Copy link
Author

I'm not familiar with that case. How would it manifest itself? Would there be a master process with supervisor processes, but no info about them in horizon:master:* and horizon:supervisor:* in Redis?

@themsaid
Copy link
Member

themsaid commented Apr 13, 2020

You say each app may use a different redis connection/database. That's why I originally shared that purging processes 100% is not possible since people can configure each app to read from a different connection so we won't be able to collect every valid process running on the server.

I keep recommending not to run multiple queue heavy apps on the same server because if redis crashes by 1 app the queue for the rest of apps will stop. That's why the purge command doesn't really care if it's killing other valid processes, they'll start again anyway but having them running on a single server is not recommended in the first place.

Anyway, I think your PR isn't really fixing it completely since a rogue master with supervisors will still be running. Killing processes of other apps isn't a problem really, the supervisors will start these processes again so no harm is done.

@rafalglowacz
Copy link
Author

The problem my company originally had was only with rogue processes like in #178

i see these process as root process, not inside the master php artisan horizon process.

So as far as I can see, if the master process is still there and the other processes are inside it, it means one of two things:

  1. The master and its supervisors are working fine, workers are spawned and the monitoring data in Redis gets updated. I guess they can't be considered rogue in that case?
  2. The processes exist but somehow (I didn't see that IRL) they're not updating horizon:master:* and/or horizon:supervisor:* in Redis. In that case those values will expire and the processes can be picked up by horizon:purge.

So in what case is there a regression compared to the current horizon:purge command?

@taylorotwell taylorotwell merged commit bce9659 into laravel:4.x Apr 14, 2020
@koenvu
Copy link

koenvu commented Apr 21, 2020

This PR breaks our horizon routine, which is as follows:

  • Have a supervisor job monitor Horizon by starting it with php /home/username/current/artisan horizon. current is a symlink to a release directory, like /home/username/releases/20200421113020.
  • In our "zero downtime" deployment process we call horizon:terminate after switching the current symlink. We do this of course to make sure horizon is running on the new codebase. If we were to do it before changing the symlink, supervisor could start a new horizon process from the old directory, so we have to do it after changing the symlink.
  • Because __DIR__ resolves symlinks, horizon:terminate is using a different path to calculate the basename than what was used to start the master supervisor.

Even when using a custom nameResolver the path is still appended. For now we have reverted to v4.2.1, but I'm wondering what your thoughts on this breaking change are!

@themsaid themsaid mentioned this pull request Apr 21, 2020
taylorotwell added a commit that referenced this pull request Apr 21, 2020
@pgrenaud
Copy link

pgrenaud commented May 8, 2020

@koenvu THANK YOU! I just spend 2 hours on the phone with a colleague trying to figure out what the hell was wrong with our horizon. "Why are we not able to terminate horizon on this app, while all the other ones are working fine." Looking at the basename method in the code and then thinking "wait, this should not actually work anywhere", to finally realize that we were running a different version to then find this. What a journey!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants