Horizon Orphan Process #178

dbpolito · 2017-10-03T18:03:38Z

I'm running horizon on a latest forge machine as documented... the daemon with php artisan horizon and on deployments php artisan horizon:terminate but time to time i need to manually run php artisan horizon:purge.

This is the output i just got after after hours of last release:

$ php artisan horizon:purge
Observed Orphan: 17043
Observed Orphan: 30634
Observed Orphan: 31084
Observed Orphan: 31085

I can confirm it's orphans by running htop on tree mode (press f5) and i see these process as root process, not inside the master php artisan horizon process.

And also, every time i run purge, it ALWAYS wrongly see 1 process as orphan:

$ php artisan horizon:purge
Observed Orphan: 17059

I haven't found a pattern yet why this is happening.

The text was updated successfully, but these errors were encountered:

agustinprod · 2017-10-03T20:32:55Z

I'm having the same issue. I woke up to my main server using all the ram and 4 gb of swap disk. Running purge killed 78 processes. I'm running horizon on a Ubuntu 16.04 with php 7.1.9-1

themsaid · 2017-10-05T13:22:50Z

Hello everyone,

Ideally you shouldn't have any rogue processes running horizon, but for exceptions like you shared we built the purge command for you to run on regular basis if your sever is having many rogue processes, this is a safe approach until we're able to investigate the issue more.

So I recommend that you schedule the command to run every day.

On the other hand, can you please share some information about your setup? I'm trying to understand what might cause rogue processes.

dillingham · 2017-10-05T15:57:52Z

Forge Daemon command: php artisan horizon directory: /home/forge/sitename.com/current
In envoyer when I just added php artisan horizon:purge to my deployment hook, it purged 20+
I began investigating when I noticed old code being used even though I horizon:terminate

agustinprod · 2017-10-06T17:23:45Z

Our envoy deploy script uses horizon terminate to kill the processes, the high number of orphan procesess maybe are related to this. We added a horizon purge too at the deployement script. @themsaid if you find difficult to reproduce this bug, we could arrange ssh access to my vps.

jkudish · 2017-10-11T20:41:16Z

Sorry to hijack the thread, but I'm seeing weird behaviour with Horizon running old code after I've deployed fresh code. Wondering if it's related to the orphans.

I've got my stuff setup through envoyer/forge.

I've setup the following hook in Envoyer, I currently have it running it after Purge Old Releases -- is that where others set it up too or does it make more sense to have it at a different spot?

For now, as a workaround I have been manually restarting the deamon in forge whenever I deploy, but this is annoying and defeats the purpose of having Envoyer in the first place:

I also notice that if I ssh into the server and run php artisan horizon:purge it always find at least 1 orpha process, but sometimes up to a dozen.

Any ideas @themsaid ?

taylorotwell · 2017-10-11T20:55:51Z

It's hard to track down why a process would go orphaned. You can technically run horizon:purge after every deploy if you wanted to.

jkudish · 2017-10-11T22:46:34Z

Hi @taylorotwell :)

I do already, as per the screenshot above, but they still seem to go orphaned all the time. I don't really know what that means or if it's a real issue though. But it not running my latest code after a deploy is more of a problem.

dbpolito · 2017-10-16T21:44:45Z

Well, this basically happens when you run php artisan horizon:terminate while you have running jobs...

After one deploy:

$ php artisan horizon:purge
Observed Orphan: 548
Observed Orphan: 1879

$ php artisan horizon:purge
Observed Orphan: 548
Observed Orphan: 1898

$ php artisan horizon:purge
Observed Orphan: 1941

Interesting thing is that it displayed 548 twice.

After another deploy:

$ php artisan horizon:purge
Observed Orphan: 1857
Observed Orphan: 1862
Observed Orphan: 2301
$ php artisan horizon:purge
Observed Orphan: 2320

My job takes an average of 30 seconds, so maybe the problem is how much time terminate waits for the jobs to complete to kill them...

peterlupu · 2017-11-01T23:21:07Z

@jkudish I'm experience similar things on my server.
When I deploy, at the end, I run php artisan horizon:terminate.
Sometimes when running terminate, Sending TERM Signal To Process: **** is not displayed. Even when ran manually from the console.

I've checked this and something seems off:

Made a Job (no params in constructor, so they're not serialized) which checks an array from a class for a number.
If it cannot find said number, I throw and Exception.
I deploy the code with an array of [1,2,3] and run the job checking for the number 5.
It fails with the above Exception.
I deploy my code again, adding 5 to the array, so the job would stop failing.
But that doesn't happen, I end up having to run php artisan horizon:terminate a few more times before it picks up on the change. Not sure if it's time related.

mfn · 2017-11-07T09:46:10Z

Hello everyone 😄

First off: great product!

However we deployed Horizon yesterday to production and it didn't take long until we realized we have orphaned processes :-/

Regarding the "always wrongly 1 orphan"

@dbpolito (at al)

And also, every time i run purge, it ALWAYS wrongly see 1 process as orphan:

I can reproduce this. I always get shown a PID which is untraceable as to what process it belongs. I did do some debugging (e.g. using pgrep -f … -a or even ps auxw snapshotting) but never "saw" a process which had this particular ID.

I also used https://github.com/a2o/snoopy to log which processes are spawned and this particular PID never showed up. E.g. in on case I got reported an orphaned PID 24575 but that specific PID wasn't tracked; but other invocations were:

Nov  7 10:23:05 dev snoopy[24568]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -P 21632
Nov  7 10:23:05 dev snoopy[24570]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -P 21631
Nov  7 10:23:05 dev snoopy[24572]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon -a
Nov  7 10:23:05 dev snoopy[24574]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon:purge -a
Nov  7 10:23:05 dev snoopy[24576]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon
Nov  7 10:23:05 dev snoopy[24578]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon:purge

I don't think it's a coincidence that all pgrep invocations skip a PID when the next one is executed; like it is consuming a PID internally.

This PID is received from the call to:

$this->exec->run('pgrep -f horizon')

as part of \Laravel\Horizon\ProcessInspector::current.

But even when I do run var_dump( $this->exec->run('pgrep -f horizon a'); I get a different PID in the end. I.e. two consecutive runs of pgrep -f horizon a and pgrep -f horizon` produce a different last PID in the result which ends up being "the one orphan" I always see.

OTOH I noticed that due to using pgrep -f horizon the killing of processes is very greedy. I'm using supervisord and have configured my horizon master process to log into a file named horizon.log.
I was tailing this file with tail -f horizon.log and ./artisan horizon:purge will kill this command.

So depending on what other processes you have running on the system which have the name horizon in the name/arguments, this may be dangerous and you can experience random process killing. I guess the changes are slim but it never hurts to know this.

Our orphaned process problem

Yesterday we realized we had two orphaned processes. Their characteristics:

horizon:work processes
references --supervisor=… which was not present anymore
their parent PID was init and not any other PHP process (not horizon:supervisor nor horizon master)

We're running Ubuntu 14.04 LTS with supervisord 3.0b2-1. Here's how a healthy process tree looks like:

init(1)-+-acpid(1230)
        |-supervisord(1398)-+-php(1613)-+-php7.1(1771)-+-php7.1(1915)
        |                   |           |              `-php7.1(18509)
        |                   |           |-php7.1(1772)-+-php7.1(1914)
        |                   |           |              `-php7.1(1918)
        |                   |           |-php7.1(1773)---php7.1(1912)
        |                   |           |-php7.1(1774)---php7.1(1930)
        |                   |           |-php7.1(1775)-+-php7.1(1913)
        |                   |           |              `-php7.1(1917)
        |                   |           |-php7.1(1776)-+-php7.1(1916)
        |                   |           |              `-php7.1(1920)
        |                   |           |-php7.1(1777)---php7.1(1921)
        |                   |           |-php7.1(1778)-+-php7.1(1923)
        |                   |           |              |-php7.1(1924)
        |                   |           |              |-php7.1(1926)
        |                   |           |              `-php7.1(1927)
        |                   |           `-php7.1(1779)---php7.1(1928)
        |                   |-php(28235)
        |                   |-php(28236)
        |                   |-php(28269)
        |                   `-php(28409)

Here how the rogue processes did appear (I removed the surroundings):

init(1)-+-acpid(1230)
        |-php(10430)
        |-php(10494)

I know these were the correct processes because with ps auxw I saw their arguments matching the information I described above.

When does re-parenting happen? E.g. https://unix.stackexchange.com/a/152400/7924

You can not start a process as the child of the shell, and then "reparent" it so another process becomes it's parent.
init with PID 1 is an exception, processes can become it's child as it collects processes that lost their original parent process.

So why can/does this happen in case of supervisor/horizon/etc.?

Some guesses (?):

the horizon supervisor process of these workers did not wait until they were finished
(because maybe the mater horizon process did not wait or give not enough time to the supervisor?)
((because maybe the supervisord did not give enough time to the horizon master?))
(((maybe the workers were busy this very moment and couldn't finish but the supervisor exited?)))

Lots of guessing as you see. For now we will also add the purge command but seeing that there's something off it's clear a code fix would be desired. I can assist with more information / debugging once I get new orphaned processes.

PS: we're also using https://github.com/wa0x6e/Cake-Resque to drive background job from a CakePHP project. It has its own set of problems but we never experienced such kind if orphaned. Or: at least not at that rate.

ghobaty · 2017-11-08T15:37:30Z

We are seeing the same weird issue where Horizon keeps running old code, even after using php artisan horizon:terminate and php artisan horizon:purge (using forge daemon).
Any ideas what is going wrong and how to fix this?
@jkudish could you find any working workarounds?

peterlupu · 2017-11-10T08:57:03Z

@elghobaty Try also running php artisan queue:restart. Also, see #213.

jkudish · 2017-11-15T20:44:08Z

@elghobaty I have my deploy restarting php-fpm and then php artisan horizon:terminate and php artisan horizon:purge and i haven't had any issues lately

Slickspacestech · 2018-01-25T07:25:38Z

I've noticed in our staging environment the same thing, orphaned horizon workers, stale code etc.

I find if I use horizon:terminate while supervisor is running it seems to cause it, where as if I stop the horizon supervisor, then pull, then start again it's okay

barryvdh · 2018-01-30T15:34:20Z

Should running 'purge' still let the workers finish their active jobs?

I'm also seeing a lot of orphaned process, running purge kills them cleanly. I'm using Deployer (so different release directories, symlink the active one to 'current' and using Forge to manage the Deamon)

royduin · 2018-01-30T15:53:49Z

Same issue here! And good question @barryvdh, the processes should finish their job, are they with php artisan horizon:purge?

This is Taylor his article about deployments with Horizon on Forge: https://medium.com/@taylorotwell/deploying-horizon-to-laravel-forge-fc9e01b74d84, php artisan horizon:terminate should be enough...

bbashy · 2018-01-31T10:58:18Z

I don't get this command

$ while true; do
while> php artisan horizon:purge
while> done
Observed Orphan: 28070
Observed Orphan: 28080
Observed Orphan: 28090
Observed Orphan: 28100
Observed Orphan: 28110
Observed Orphan: 28120
Observed Orphan: 28130
Observed Orphan: 28140
Observed Orphan: 28150
Observed Orphan: 28160
Observed Orphan: 28170
Observed Orphan: 28180
Observed Orphan: 28190
Observed Orphan: 28200
Observed Orphan: 28210
Observed Orphan: 28220
Observed Orphan: 28230
Observed Orphan: 28240
Observed Orphan: 28250
Observed Orphan: 28260
Observed Orphan: 28270
Observed Orphan: 28280
Observed Orphan: 28290
Observed Orphan: 28300
Observed Orphan: 28310
Observed Orphan: 28320
Observed Orphan: 28330
Observed Orphan: 28344
Observed Orphan: 28354
Observed Orphan: 28364
Observed Orphan: 28374
Observed Orphan: 28384
^C%

marianvlad · 2018-02-01T02:04:16Z

I have the same problem. I'm using Daemon and every time I'm deploying I need to kill processes manually pkill -f horizon because deleting the daemon still keeps some processes up with the old code. I'm using horizon on another server and I don't need to do that. Looks like it's something wrong with ubuntu or with php function for killing processes because artisan horizon:terminate it doesn't work every time.

taylorotwell · 2018-02-09T03:20:19Z

Yeah I would like to figure out a root cause or maybe easy recreation of the issue so we could track it down. I don't have a good understand of why it is happening since all horizon:work processes would be started from a supervisor and its unclear to me how the parent supervisor could die without also killing all its child processes.

taylorotwell · 2018-02-09T16:10:38Z

Hi all,

We have tagged 1.2.0 with a slightly different approach to terminating processes. Can some of you try it and report if you see any improvements?

marianvlad · 2018-02-09T16:11:35Z

I will try right now.

taylorotwell · 2018-02-09T16:12:04Z

@marianvlad thanks

bbashy · 2018-02-09T16:12:42Z

Trying now

🎉

Updating laravel/horizon (v1.1.1 => v1.2.0): Downloading (100%)

taylorotwell · 2018-02-09T16:13:57Z

If this doesn't fix it I'll probably be putting a nice little bounty on this 😄

themsaid · 2018-02-09T16:17:30Z

Just a notice, the horizon:purge command doesn't work as expected, so if you run it and get a single rogue process ignore it, it's not an indicator. Only if you get multiple process it means there are actual orphans.

marianvlad · 2018-02-09T16:21:38Z

I ran horizon:terminate 3 times an then I deleted the daemon. In htop I have these https://i.imgur.com/t9RFWuE.png

themsaid · 2018-02-09T16:23:18Z

@marianvlad these could be from before, run the horizon:purge command, start the daemon, then monitor from there.

barryvdh · 2018-02-09T16:24:22Z

Can't we just send the queue:restart signal from horizon terminate? That seems to work reliable.

taylorotwell · 2018-02-09T16:36:32Z

@barryvdh that could be an option. To be honest, I've forgotten why it didn't work like that in the first place but Horizon changed a bit while I was in the process of writing it.

We can wait until we get a bit more feedback on the 1.2.0 termination process and then consider other options if we're not seeing improvement.

taylorotwell · 2018-02-09T17:10:12Z

@marianvlad did you make sure you started with zero horizon processes? Any more feedback?

taylorotwell · 2018-02-12T18:37:26Z

Hey all,

Just wanted to note we have made some further tweaks on the latest 1.2.2 release, including using the queue:restart cache logic approach in addition to a couple other fixes.

Please let us know how this release works for you. We rely on your feedback ❤️

marianvlad · 2018-02-12T18:59:35Z

I did multiple tests on 1.2.0 on a clean project everything worked perfectly except in my real project.

My setup looks like this:

config/queue.phh

'redis' => [
    'driver' => 'redis',
    'connection' => 'default',
    'queue' => 'default',
    'retry_after' => 7200, // because my jobs takes time
    'block_for' => null, // without this thing, I use 5.5
],

config/horizon.php

'production' => [
    'supervisor-1' => [
        'connection' => 'redis',
        'queue' => ['default'],
        'balance' => false,
        'processes' => 3,
        'tries' => 1,
    ],
    'supervisor-2' => [
        'connection' => 'redis',
        'queue' => ['another-test'],
        'balance' => false,
        'processes' => 3,
        'tries' => 1,
    ],
    'supervisor-3' => [
        'connection' => 'redis',
        'queue' => ['test'],
        'balance' => false,
        'processes' => 5,
        'tries' => 1,
    ],
],

A sample of my job:

<?php

namespace App\Jobs;

use Bunch\Of\Classes; //

class PostJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public $timeout = 3600;

    protected $domain;
    protected $video;
    protected $temp;

    /**
     * Create a new job instance.
     *
     * @return void
     */
    public function __construct(Domain $domain, Video $video)
    {
        $this->domain = $domain;
        $this->video = $video;

        // this temp is not visible inside failed method if I put in handle() method
        $this->temp = $this->tempPath();
    }

    /**
     * Execute the job.
     *
     * @return void
     */
    public function handle()
    {
        if (! $condition) {
            // some checks
            $this->cleanTemp(); // clean directory
            return;
        }

        $melody = resolve(Melody::class)
            ->setDomain($this->domain->endpoint)
            ->setKey('key-blah-blah');

        // Check if domain is up before
        if ($melody->ping()) {
            $storage = $this->domain->storages->random();

            $twig = resolve('twig');

            $template = $this->domain->template();
            $thumbnail = $this->video->thumbnail_url;

            if (! $thumbnail && ! urlHealth($thumbnail)) {
                $this->thumbnail();

                $thumbnail = asset("storage/temp/".basename($this->temp)."/thumbnail.jpg");
            }

            // Crate video preview
            $preview = (new PreviewService($this->domain, $this->video, $this->temp))->handle();

            $twigData = FormatVideo::twigAttributes($this->video);

            // Upload the video
            $upload = (new UploadService(
                $storage,
                $this->video->videoPath(),
                str_slug($twigData['username']) . '_' . str_slug($this->video->title . ' ' . $twigData['date']).'.'.last(explode('.', $this->video->videoPath()))
            ))->handle();

            if (! $upload) {
                $this->cleanTemp();
                return;
            }

            $storage->increaseUpload($this->video->size);

            $title = $twig->createTemplate($template['title'])->render($twigData);
            $description = str_limit($twig->createTemplate($template['description'])->render($twigData), 3000);
            $tags = FormatVideo::getTagsByType($this->video, $template, $twigData);

            $melody = $melody
                ->uploadVideo($title, $description, [
                    'media_type' => 'local',
                    'thumbnail'  => str_replace(' ', '+', $thumbnail),
                    'mp4'        => $preview,
                    'length'     => gmdate('H:i:s', $this->video->duration),
                    'category'   => $this->category(),
                    'tags'       => $tags,
                    'metas' => [
                        ['key' => 'source_url', 'value' => $this->video->source_url],
                        ['key' => 'download_link', 'value' => $upload[0]]
                    ]
                ]);

            if (array_has($melody, 'type') && $melody['type'] == 'success') {
                $this->cleanTemp();

                $this->video->finishPosting($this->domain, [
                    'source_id' => $melody['id'],
                    'url' => $melody['url'],
                    'title' => $title,
                    'upload' => $upload[0],
                    'storage_id' => $storage->id
                ]);

                $this->domain->increment('total_posted');
            } else {
                // Delete uploaded file
                (new DeleteUploadedService($storage, $upload[1]))->handle();
                $this->cleanTemp();
            }
        } else {
            $this->video->resetPosted($this->domain);
            $this->cleanTemp();
        }
    }

    public function failed(Exception $exception)
    {
        if ($exception instanceof ServerException) {
            $this->video->resetPosted($this->domain);
        }

        if ($exception instanceof ProcessFailedException) {
            $output = $exception->getProcess()->getErrorOutput();
            
            if (str_contains($output, 'Retry limit reached')) {
                if (! str_contains($output, "Unexpected state: 'processing'")) {
                    $this->video->resetPosted($this->domain);
                }
            }
        }

        $this->cleanTemp();
    }

    protected function cleanTemp()
    {
        app('files')->deleteDirectory($this->temp);
    }
}

DeleteUploadedService, UploadService are just classes with Symfony\Process calling a docker container.

themsaid · 2018-02-12T19:03:13Z

@marianvlad please upgrade to 1.2.2 and then report your findings.

ghost · 2018-02-12T22:09:26Z

Our team just tried with 1.2.2 and we're still having the same issue of the workers being restarted, however we observe differences between this and running normally via terminal window.

dragosperca · 2018-02-13T08:16:21Z

If killing rogue/child processes sometimes work, sometimes doesn't - a solution would be to introduce a (worker) self-kill option:

when horizon starts, it creates a unique token stored in Redis (into a known key) and passed as a parameter to all workers
when a worker is executing the endless loop to look for jobs, it checks that the key with which it was initialised is the same as the one found in Redis
if it is, it continues the loop. If it doesn't, this means that instance is a rogue process and needs to exit processing & self-kill

This would introduce an extra lookup in Redis for each loop, but the cost is a small one to pay for not having rogue processes.

Also to consider is when the main horizon process is stopped, but rogue processes continue to exist. I suppose an extra check here would be a heart beat issued by the supervisor. If this heart beat is older than a few seconds, workers can safely die.

barryvdh · 2018-02-13T08:18:58Z

Isn't that exactly what queue:restart did and is now done with the last commits:

horizon/src/Console/TerminateCommand.php

Line 48 in 62c60ef

    
           $this->laravel['cache']->forever('illuminate:queue:restart', $this->currentTime());

dragosperca · 2018-02-13T11:33:29Z

I should have checked the last commits, but I believe there's a small difference. This cache check is implemented in the "queue:restart" command, and if this one is not executed, rogue processes will not die.

My thinking was to make this a bit more invasive, executed within the main loop cycle. It would link a worker process to a know token, and workers would die automatically if their parent no longer exists or the token changed.

taylorotwell · 2018-02-13T14:35:00Z

@TruckersMP-Kat we would need way more information to diagnose anything.

taylorotwell · 2018-02-13T14:35:11Z

@dragosperca are you currently using 1.2.2?

dragosperca · 2018-02-13T15:12:13Z

i am using Horizon on a project, and thought about using it for another larger, critical mission one. I don't experience the current issue, I was just reading all the issues and trying to understand failure points. My comment was just an idea to avoid rogue processes.

Before Horizon, we were using the standard worker & supervisor setup and did a similar thing you implemented in 1.2.2, that is, a marker that when set, would make workers die. Since supervisor was there to start them if any process died, we had a "self-kill and revive" process. We implemented it in the main (worker) loop though.

This marker would be set when new code would be deployed to the server.

Anyway, just want to say thank you for the great work you've been doing with Laravel and the ecosystem around it. We all love it!

taylorotwell · 2018-02-13T17:30:30Z

OK Thanks. We're still looking for any feedback from people running Horizon 1.2.2.

barryvdh · 2018-02-13T17:49:43Z

On my testserver this seems to be working correctly. I'll deploy it to staging/production tomorrow hopefully and see how it turns out.

Does it matter from which version the termininate command is run, in a zero-downtime environment ala Envoyer (so different release directories). I assume it's safe to just run it from the next release dir (to become current), just before symlinking?

marianvlad · 2018-02-13T17:51:54Z

Anyone with problems using failed() method inside job to do things like update something in database?

themsaid · 2018-02-13T17:53:29Z

@barryvdh I don't think it makes a difference.

@marianvlad different topic, different issue please :) This issue is already large and it'd be easier if we just keep it focused on one matter.

taylorotwell · 2018-02-17T00:00:10Z

Anyone? 😄

mfn · 2018-02-17T05:37:32Z

Sorry, soon; hope to upgrade on Monday!

We're basically quite constantly hit by it the every other day, but this week we gave 5.6 a priority first ;-) (PS: Thanks for the new logging infrastructure 👍 ).

aocneanu · 2018-02-17T05:48:25Z

I updated yesterday Horizon to 1.2.2 on five workers that are doing the same kind of queued jobs. A minute ago I ran terminate and purge on all the five workers, and on three of them I had one orphan process (one on each), the other two were clean. Before the update I had 5 to 10 orphans daily on every worker.

tomschlick · 2018-02-19T19:03:05Z

Just deployed 1.2.2 to our production environment at work. Will report back what we find over the next few days.

marianvlad · 2018-02-24T16:09:36Z

@taylorotwell After 1.2.2, everything works perfectly. No more orphan jobs and old code.

taylorotwell · 2018-02-26T13:18:46Z

@tomschlick you see anything? @marianvlad thanks!

tomschlick · 2018-02-26T14:34:05Z

@taylorotwell just checked our logs from last week and we saw two instances of orphans

Sending TERM Signal To Process: 9750
Observed Orphan: 8781
Observed Orphan: 8862
Observed Orphan: 10459

Sending TERM Signal To Process: 10498
Observed Orphan: 8781
Observed Orphan: 8862
Observed Orphan: 12571

Weirdly enough two of the processes had the same process id, even though the deploys took place 15 minutes apart. Horizon appeared to terminate them and restart correctly so not sure how that's possible 🤷‍♂️

themsaid · 2018-02-26T14:51:48Z

Just a notice, the horizon:purge command doesn't work as expected, so if you run it and get a single rogue process ignore it, it's not an indicator. Only if you get multiple process it means there are actual orphans.

So in this case the two orphans 8781 and 8862 are the actual orphans, however it seems that even the purge command didn't kill them so that could mean they're really stuck on a long process that the next loop didn't run yet.

What's your timeout value?

tomschlick · 2018-02-26T14:56:19Z

Timeout value is 1800 for most of our workers.

dbpolito · 2018-02-27T18:48:15Z

I'm not able to reproduce this issue anymore... So it looks like fixed to me... ❤️

dbpolito · 2018-02-27T18:49:41Z

As i created the ticket and seems it got fixed, i'm closing this one... We can start new tickets and mention this one if necessary.

vyuldashev · 2018-05-11T11:57:11Z

We are encountering strange issues. Sometimes one queue is stuck and does not process any jobs. Even when supervisor is stopped there are horizon:work processes in the list. horizon:purge also does not help as it does not find any.

Here is our config and the problem is only with default:

»> config('horizon')
=> [
     "use" => "queue",
     "prefix" => "horizon:",
     "waits" => [
       "redis:default" => 300,
     ],
     "trim" => [
       "recent" => 60,
       "failed" => 10080,
     ],
     "environments" => [
       "production" => [
         "supervisor-1" => [
           "connection" => "redis",
           "queue" => [
             "default",
           ],
           "balance" => "simple",
           "processes" => 10,
           "tries" => 0,
         ],
         "supervisor-2" => [
           "connection" => "redis",
           "queue" => [
             "sms",
             "phone_data",
           ],
           "balance" => "auto",
           "processes" => 10,
           "tries" => 0,
         ],
       ],
       "local" => [
         "supervisor-1" => [
           "connection" => "redis",
           "queue" => [
             "default",
           ],
           "balance" => "simple",
           "processes" => 3,
           "tries" => 3,
         ],
       ],
     ],
   ]

dbpolito closed this as completed Feb 27, 2018

ghobaty mentioned this issue Mar 13, 2018

Fix Purge always reporting 1 orphan process issue #311

Merged

jorgenb mentioned this issue Sep 18, 2019

kill not really killing orphaned Horizon procs #666

Closed

rafalglowacz mentioned this issue Apr 13, 2020

[4.x] Prevent horizon:purge from killing too many processes #820

Merged

This issue was closed.

Horizon Orphan Process #178

Horizon Orphan Process #178

Comments

dbpolito commented Oct 3, 2017

agustinprod commented Oct 3, 2017

themsaid commented Oct 5, 2017

dillingham commented Oct 5, 2017 • edited Loading

agustinprod commented Oct 6, 2017

jkudish commented Oct 11, 2017

taylorotwell commented Oct 11, 2017

jkudish commented Oct 11, 2017

dbpolito commented Oct 16, 2017

peterlupu commented Nov 1, 2017

mfn commented Nov 7, 2017

Regarding the "always wrongly 1 orphan"

Our orphaned process problem

ghobaty commented Nov 8, 2017

peterlupu commented Nov 10, 2017

jkudish commented Nov 15, 2017

Slickspacestech commented Jan 25, 2018

barryvdh commented Jan 30, 2018

royduin commented Jan 30, 2018

bbashy commented Jan 31, 2018 • edited Loading

marianvlad commented Feb 1, 2018

taylorotwell commented Feb 9, 2018

taylorotwell commented Feb 9, 2018

marianvlad commented Feb 9, 2018

taylorotwell commented Feb 9, 2018

bbashy commented Feb 9, 2018

taylorotwell commented Feb 9, 2018

themsaid commented Feb 9, 2018

marianvlad commented Feb 9, 2018

themsaid commented Feb 9, 2018

barryvdh commented Feb 9, 2018 • edited Loading

taylorotwell commented Feb 9, 2018

taylorotwell commented Feb 9, 2018

taylorotwell commented Feb 12, 2018 • edited Loading

marianvlad commented Feb 12, 2018 • edited Loading

themsaid commented Feb 12, 2018

ghost commented Feb 12, 2018

dragosperca commented Feb 13, 2018

barryvdh commented Feb 13, 2018

dragosperca commented Feb 13, 2018

taylorotwell commented Feb 13, 2018

taylorotwell commented Feb 13, 2018

dragosperca commented Feb 13, 2018

taylorotwell commented Feb 13, 2018

barryvdh commented Feb 13, 2018

marianvlad commented Feb 13, 2018

themsaid commented Feb 13, 2018

taylorotwell commented Feb 17, 2018

mfn commented Feb 17, 2018 • edited Loading

aocneanu commented Feb 17, 2018

tomschlick commented Feb 19, 2018

marianvlad commented Feb 24, 2018

taylorotwell commented Feb 26, 2018

tomschlick commented Feb 26, 2018

themsaid commented Feb 26, 2018

tomschlick commented Feb 26, 2018

dbpolito commented Feb 27, 2018

dbpolito commented Feb 27, 2018

vyuldashev commented May 11, 2018

dillingham commented Oct 5, 2017 •

edited

Loading

bbashy commented Jan 31, 2018 •

edited

Loading

barryvdh commented Feb 9, 2018 •

edited

Loading

taylorotwell commented Feb 12, 2018 •

edited

Loading

marianvlad commented Feb 12, 2018 •

edited

Loading

mfn commented Feb 17, 2018 •

edited

Loading