Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horizon Orphan Process #178

Closed
dbpolito opened this issue Oct 3, 2017 · 59 comments
Closed

Horizon Orphan Process #178

dbpolito opened this issue Oct 3, 2017 · 59 comments

Comments

@dbpolito
Copy link
Contributor

dbpolito commented Oct 3, 2017

I'm running horizon on a latest forge machine as documented... the daemon with php artisan horizon and on deployments php artisan horizon:terminate but time to time i need to manually run php artisan horizon:purge.

This is the output i just got after after hours of last release:

$ php artisan horizon:purge
Observed Orphan: 17043
Observed Orphan: 30634
Observed Orphan: 31084
Observed Orphan: 31085

I can confirm it's orphans by running htop on tree mode (press f5) and i see these process as root process, not inside the master php artisan horizon process.


And also, every time i run purge, it ALWAYS wrongly see 1 process as orphan:

$ php artisan horizon:purge
Observed Orphan: 17059

I haven't found a pattern yet why this is happening.

@agustinprod
Copy link

I'm having the same issue. I woke up to my main server using all the ram and 4 gb of swap disk. Running purge killed 78 processes. I'm running horizon on a Ubuntu 16.04 with php 7.1.9-1

@themsaid
Copy link
Member

themsaid commented Oct 5, 2017

Hello everyone,

Ideally you shouldn't have any rogue processes running horizon, but for exceptions like you shared we built the purge command for you to run on regular basis if your sever is having many rogue processes, this is a safe approach until we're able to investigate the issue more.

So I recommend that you schedule the command to run every day.

On the other hand, can you please share some information about your setup? I'm trying to understand what might cause rogue processes.

@dillingham
Copy link

dillingham commented Oct 5, 2017

Forge Daemon command: php artisan horizon directory: /home/forge/sitename.com/current
In envoyer when I just added php artisan horizon:purge to my deployment hook, it purged 20+
I began investigating when I noticed old code being used even though I horizon:terminate

@agustinprod
Copy link

Our envoy deploy script uses horizon terminate to kill the processes, the high number of orphan procesess maybe are related to this. We added a horizon purge too at the deployement script. @themsaid if you find difficult to reproduce this bug, we could arrange ssh access to my vps.

@jkudish
Copy link

jkudish commented Oct 11, 2017

Sorry to hijack the thread, but I'm seeing weird behaviour with Horizon running old code after I've deployed fresh code. Wondering if it's related to the orphans.

I've got my stuff setup through envoyer/forge.

I've setup the following hook in Envoyer, I currently have it running it after Purge Old Releases -- is that where others set it up too or does it make more sense to have it at a different spot?

image

For now, as a workaround I have been manually restarting the deamon in forge whenever I deploy, but this is annoying and defeats the purpose of having Envoyer in the first place:

image

I also notice that if I ssh into the server and run php artisan horizon:purge it always find at least 1 orpha process, but sometimes up to a dozen.

Any ideas @themsaid ?

@taylorotwell
Copy link
Member

It's hard to track down why a process would go orphaned. You can technically run horizon:purge after every deploy if you wanted to.

@jkudish
Copy link

jkudish commented Oct 11, 2017

Hi @taylorotwell :)

I do already, as per the screenshot above, but they still seem to go orphaned all the time. I don't really know what that means or if it's a real issue though. But it not running my latest code after a deploy is more of a problem.

@dbpolito
Copy link
Contributor Author

Well, this basically happens when you run php artisan horizon:terminate while you have running jobs...

After one deploy:

$ php artisan horizon:purge
Observed Orphan: 548
Observed Orphan: 1879

$ php artisan horizon:purge
Observed Orphan: 548
Observed Orphan: 1898

$ php artisan horizon:purge
Observed Orphan: 1941

Interesting thing is that it displayed 548 twice.

After another deploy:

$ php artisan horizon:purge
Observed Orphan: 1857
Observed Orphan: 1862
Observed Orphan: 2301
$ php artisan horizon:purge
Observed Orphan: 2320

My job takes an average of 30 seconds, so maybe the problem is how much time terminate waits for the jobs to complete to kill them...

@peterlupu
Copy link
Contributor

@jkudish I'm experience similar things on my server.
When I deploy, at the end, I run php artisan horizon:terminate.
Sometimes when running terminate, Sending TERM Signal To Process: **** is not displayed. Even when ran manually from the console.

I've checked this and something seems off:

  1. Made a Job (no params in constructor, so they're not serialized) which checks an array from a class for a number.
  2. If it cannot find said number, I throw and Exception.
  3. I deploy the code with an array of [1,2,3] and run the job checking for the number 5.
  4. It fails with the above Exception.
  5. I deploy my code again, adding 5 to the array, so the job would stop failing.
  6. But that doesn't happen, I end up having to run php artisan horizon:terminate a few more times before it picks up on the change. Not sure if it's time related.

@mfn
Copy link
Contributor

mfn commented Nov 7, 2017

Hello everyone 😄

First off: great product!

However we deployed Horizon yesterday to production and it didn't take long until we realized we have orphaned processes :-/

Regarding the "always wrongly 1 orphan"

@dbpolito (at al)

And also, every time i run purge, it ALWAYS wrongly see 1 process as orphan:

I can reproduce this. I always get shown a PID which is untraceable as to what process it belongs. I did do some debugging (e.g. using pgrep -f … -a or even ps auxw snapshotting) but never "saw" a process which had this particular ID.

I also used https://github.com/a2o/snoopy to log which processes are spawned and this particular PID never showed up. E.g. in on case I got reported an orphaned PID 24575 but that specific PID wasn't tracked; but other invocations were:

Nov  7 10:23:05 dev snoopy[24568]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -P 21632
Nov  7 10:23:05 dev snoopy[24570]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -P 21631
Nov  7 10:23:05 dev snoopy[24572]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon -a
Nov  7 10:23:05 dev snoopy[24574]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon:purge -a
Nov  7 10:23:05 dev snoopy[24576]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon
Nov  7 10:23:05 dev snoopy[24578]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon:purge

I don't think it's a coincidence that all pgrep invocations skip a PID when the next one is executed; like it is consuming a PID internally.

This PID is received from the call to:

$this->exec->run('pgrep -f horizon')

as part of \Laravel\Horizon\ProcessInspector::current.

But even when I do run var_dump( $this->exec->run('pgrep -f horizon a'); I get a different PID in the end. I.e. two consecutive runs of pgrep -f horizon a and pgrep -f horizon` produce a different last PID in the result which ends up being "the one orphan" I always see.

OTOH I noticed that due to using pgrep -f horizon the killing of processes is very greedy. I'm using supervisord and have configured my horizon master process to log into a file named horizon.log.
I was tailing this file with tail -f horizon.log and ./artisan horizon:purge will kill this command.

So depending on what other processes you have running on the system which have the name horizon in the name/arguments, this may be dangerous and you can experience random process killing. I guess the changes are slim but it never hurts to know this.

Our orphaned process problem

Yesterday we realized we had two orphaned processes. Their characteristics:

  • horizon:work processes
  • references --supervisor=… which was not present anymore
  • their parent PID was init and not any other PHP process (not horizon:supervisor nor horizon master)

We're running Ubuntu 14.04 LTS with supervisord 3.0b2-1. Here's how a healthy process tree looks like:

init(1)-+-acpid(1230)
        |-supervisord(1398)-+-php(1613)-+-php7.1(1771)-+-php7.1(1915)
        |                   |           |              `-php7.1(18509)
        |                   |           |-php7.1(1772)-+-php7.1(1914)
        |                   |           |              `-php7.1(1918)
        |                   |           |-php7.1(1773)---php7.1(1912)
        |                   |           |-php7.1(1774)---php7.1(1930)
        |                   |           |-php7.1(1775)-+-php7.1(1913)
        |                   |           |              `-php7.1(1917)
        |                   |           |-php7.1(1776)-+-php7.1(1916)
        |                   |           |              `-php7.1(1920)
        |                   |           |-php7.1(1777)---php7.1(1921)
        |                   |           |-php7.1(1778)-+-php7.1(1923)
        |                   |           |              |-php7.1(1924)
        |                   |           |              |-php7.1(1926)
        |                   |           |              `-php7.1(1927)
        |                   |           `-php7.1(1779)---php7.1(1928)
        |                   |-php(28235)
        |                   |-php(28236)
        |                   |-php(28269)
        |                   `-php(28409)

Here how the rogue processes did appear (I removed the surroundings):

init(1)-+-acpid(1230)
        |-php(10430)
        |-php(10494)

I know these were the correct processes because with ps auxw I saw their arguments matching the information I described above.

When does re-parenting happen? E.g. https://unix.stackexchange.com/a/152400/7924

You can not start a process as the child of the shell, and then "reparent" it so another process becomes it's parent.
init with PID 1 is an exception, processes can become it's child as it collects processes that lost their original parent process.

So why can/does this happen in case of supervisor/horizon/etc.?

Some guesses (?):

  • the horizon supervisor process of these workers did not wait until they were finished
  • (because maybe the mater horizon process did not wait or give not enough time to the supervisor?)
  • ((because maybe the supervisord did not give enough time to the horizon master?))
  • (((maybe the workers were busy this very moment and couldn't finish but the supervisor exited?)))

Lots of guessing as you see. For now we will also add the purge command but seeing that there's something off it's clear a code fix would be desired. I can assist with more information / debugging once I get new orphaned processes.

PS: we're also using https://github.com/wa0x6e/Cake-Resque to drive background job from a CakePHP project. It has its own set of problems but we never experienced such kind if orphaned. Or: at least not at that rate.

@ghobaty
Copy link
Contributor

ghobaty commented Nov 8, 2017

We are seeing the same weird issue where Horizon keeps running old code, even after using php artisan horizon:terminate and php artisan horizon:purge (using forge daemon).
Any ideas what is going wrong and how to fix this?
@jkudish could you find any working workarounds?

@peterlupu
Copy link
Contributor

@elghobaty Try also running php artisan queue:restart. Also, see #213.

@jkudish
Copy link

jkudish commented Nov 15, 2017

@elghobaty I have my deploy restarting php-fpm and then php artisan horizon:terminate and php artisan horizon:purge and i haven't had any issues lately

@Slickspacestech
Copy link

I've noticed in our staging environment the same thing, orphaned horizon workers, stale code etc.

I find if I use horizon:terminate while supervisor is running it seems to cause it, where as if I stop the horizon supervisor, then pull, then start again it's okay

@barryvdh
Copy link
Contributor

Should running 'purge' still let the workers finish their active jobs?

I'm also seeing a lot of orphaned process, running purge kills them cleanly. I'm using Deployer (so different release directories, symlink the active one to 'current' and using Forge to manage the Deamon)

@royduin
Copy link

royduin commented Jan 30, 2018

Same issue here! And good question @barryvdh, the processes should finish their job, are they with php artisan horizon:purge?

This is Taylor his article about deployments with Horizon on Forge: https://medium.com/@taylorotwell/deploying-horizon-to-laravel-forge-fc9e01b74d84, php artisan horizon:terminate should be enough...

@bbashy
Copy link

bbashy commented Jan 31, 2018

I don't get this command

$ while true; do
while> php artisan horizon:purge
while> done
Observed Orphan: 28070
Observed Orphan: 28080
Observed Orphan: 28090
Observed Orphan: 28100
Observed Orphan: 28110
Observed Orphan: 28120
Observed Orphan: 28130
Observed Orphan: 28140
Observed Orphan: 28150
Observed Orphan: 28160
Observed Orphan: 28170
Observed Orphan: 28180
Observed Orphan: 28190
Observed Orphan: 28200
Observed Orphan: 28210
Observed Orphan: 28220
Observed Orphan: 28230
Observed Orphan: 28240
Observed Orphan: 28250
Observed Orphan: 28260
Observed Orphan: 28270
Observed Orphan: 28280
Observed Orphan: 28290
Observed Orphan: 28300
Observed Orphan: 28310
Observed Orphan: 28320
Observed Orphan: 28330
Observed Orphan: 28344
Observed Orphan: 28354
Observed Orphan: 28364
Observed Orphan: 28374
Observed Orphan: 28384
^C%

@marianvlad
Copy link

I have the same problem. I'm using Daemon and every time I'm deploying I need to kill processes manually pkill -f horizon because deleting the daemon still keeps some processes up with the old code. I'm using horizon on another server and I don't need to do that. Looks like it's something wrong with ubuntu or with php function for killing processes because artisan horizon:terminate it doesn't work every time.

@taylorotwell
Copy link
Member

Yeah I would like to figure out a root cause or maybe easy recreation of the issue so we could track it down. I don't have a good understand of why it is happening since all horizon:work processes would be started from a supervisor and its unclear to me how the parent supervisor could die without also killing all its child processes.

@taylorotwell
Copy link
Member

Hi all,

We have tagged 1.2.0 with a slightly different approach to terminating processes. Can some of you try it and report if you see any improvements?

@marianvlad
Copy link

I will try right now.

@taylorotwell
Copy link
Member

@marianvlad thanks

@bbashy
Copy link

bbashy commented Feb 9, 2018

Trying now

🎉

Updating laravel/horizon (v1.1.1 => v1.2.0): Downloading (100%)

@taylorotwell
Copy link
Member

If this doesn't fix it I'll probably be putting a nice little bounty on this 😄

@themsaid
Copy link
Member

themsaid commented Feb 9, 2018

Just a notice, the horizon:purge command doesn't work as expected, so if you run it and get a single rogue process ignore it, it's not an indicator. Only if you get multiple process it means there are actual orphans.

@marianvlad
Copy link

I ran horizon:terminate 3 times an then I deleted the daemon. In htop I have these https://i.imgur.com/t9RFWuE.png

@themsaid
Copy link
Member

themsaid commented Feb 9, 2018

@marianvlad these could be from before, run the horizon:purge command, start the daemon, then monitor from there.

@barryvdh
Copy link
Contributor

barryvdh commented Feb 9, 2018

Can't we just send the queue:restart signal from horizon terminate? That seems to work reliable.

@taylorotwell
Copy link
Member

@barryvdh that could be an option. To be honest, I've forgotten why it didn't work like that in the first place but Horizon changed a bit while I was in the process of writing it.

We can wait until we get a bit more feedback on the 1.2.0 termination process and then consider other options if we're not seeing improvement.

@taylorotwell
Copy link
Member

@marianvlad did you make sure you started with zero horizon processes? Any more feedback?

@taylorotwell
Copy link
Member

taylorotwell commented Feb 12, 2018

Hey all,

Just wanted to note we have made some further tweaks on the latest 1.2.2 release, including using the queue:restart cache logic approach in addition to a couple other fixes.

Please let us know how this release works for you. We rely on your feedback ❤️

@marianvlad
Copy link

marianvlad commented Feb 12, 2018

I did multiple tests on 1.2.0 on a clean project everything worked perfectly except in my real project.

My setup looks like this:

config/queue.phh

'redis' => [
    'driver' => 'redis',
    'connection' => 'default',
    'queue' => 'default',
    'retry_after' => 7200, // because my jobs takes time
    'block_for' => null, // without this thing, I use 5.5
],

config/horizon.php

'production' => [
    'supervisor-1' => [
        'connection' => 'redis',
        'queue' => ['default'],
        'balance' => false,
        'processes' => 3,
        'tries' => 1,
    ],
    'supervisor-2' => [
        'connection' => 'redis',
        'queue' => ['another-test'],
        'balance' => false,
        'processes' => 3,
        'tries' => 1,
    ],
    'supervisor-3' => [
        'connection' => 'redis',
        'queue' => ['test'],
        'balance' => false,
        'processes' => 5,
        'tries' => 1,
    ],
],

A sample of my job:

<?php

namespace App\Jobs;

use Bunch\Of\Classes; //

class PostJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public $timeout = 3600;

    protected $domain;
    protected $video;
    protected $temp;

    /**
     * Create a new job instance.
     *
     * @return void
     */
    public function __construct(Domain $domain, Video $video)
    {
        $this->domain = $domain;
        $this->video = $video;

        // this temp is not visible inside failed method if I put in handle() method
        $this->temp = $this->tempPath();
    }

    /**
     * Execute the job.
     *
     * @return void
     */
    public function handle()
    {
        if (! $condition) {
            // some checks
            $this->cleanTemp(); // clean directory
            return;
        }

        $melody = resolve(Melody::class)
            ->setDomain($this->domain->endpoint)
            ->setKey('key-blah-blah');

        // Check if domain is up before
        if ($melody->ping()) {
            $storage = $this->domain->storages->random();

            $twig = resolve('twig');

            $template = $this->domain->template();
            $thumbnail = $this->video->thumbnail_url;

            if (! $thumbnail && ! urlHealth($thumbnail)) {
                $this->thumbnail();

                $thumbnail = asset("storage/temp/".basename($this->temp)."/thumbnail.jpg");
            }

            // Crate video preview
            $preview = (new PreviewService($this->domain, $this->video, $this->temp))->handle();

            $twigData = FormatVideo::twigAttributes($this->video);

            // Upload the video
            $upload = (new UploadService(
                $storage,
                $this->video->videoPath(),
                str_slug($twigData['username']) . '_' . str_slug($this->video->title . ' ' . $twigData['date']).'.'.last(explode('.', $this->video->videoPath()))
            ))->handle();

            if (! $upload) {
                $this->cleanTemp();
                return;
            }

            $storage->increaseUpload($this->video->size);

            $title = $twig->createTemplate($template['title'])->render($twigData);
            $description = str_limit($twig->createTemplate($template['description'])->render($twigData), 3000);
            $tags = FormatVideo::getTagsByType($this->video, $template, $twigData);

            $melody = $melody
                ->uploadVideo($title, $description, [
                    'media_type' => 'local',
                    'thumbnail'  => str_replace(' ', '+', $thumbnail),
                    'mp4'        => $preview,
                    'length'     => gmdate('H:i:s', $this->video->duration),
                    'category'   => $this->category(),
                    'tags'       => $tags,
                    'metas' => [
                        ['key' => 'source_url', 'value' => $this->video->source_url],
                        ['key' => 'download_link', 'value' => $upload[0]]
                    ]
                ]);

            if (array_has($melody, 'type') && $melody['type'] == 'success') {
                $this->cleanTemp();

                $this->video->finishPosting($this->domain, [
                    'source_id' => $melody['id'],
                    'url' => $melody['url'],
                    'title' => $title,
                    'upload' => $upload[0],
                    'storage_id' => $storage->id
                ]);

                $this->domain->increment('total_posted');
            } else {
                // Delete uploaded file
                (new DeleteUploadedService($storage, $upload[1]))->handle();
                $this->cleanTemp();
            }
        } else {
            $this->video->resetPosted($this->domain);
            $this->cleanTemp();
        }
    }

    public function failed(Exception $exception)
    {
        if ($exception instanceof ServerException) {
            $this->video->resetPosted($this->domain);
        }

        if ($exception instanceof ProcessFailedException) {
            $output = $exception->getProcess()->getErrorOutput();
            
            if (str_contains($output, 'Retry limit reached')) {
                if (! str_contains($output, "Unexpected state: 'processing'")) {
                    $this->video->resetPosted($this->domain);
                }
            }
        }

        $this->cleanTemp();
    }

    protected function cleanTemp()
    {
        app('files')->deleteDirectory($this->temp);
    }
}

DeleteUploadedService, UploadService are just classes with Symfony\Process calling a docker container.

@themsaid
Copy link
Member

@marianvlad please upgrade to 1.2.2 and then report your findings.

@ghost
Copy link

ghost commented Feb 12, 2018

Our team just tried with 1.2.2 and we're still having the same issue of the workers being restarted, however we observe differences between this and running normally via terminal window.

@dragosperca
Copy link

If killing rogue/child processes sometimes work, sometimes doesn't - a solution would be to introduce a (worker) self-kill option:

  • when horizon starts, it creates a unique token stored in Redis (into a known key) and passed as a parameter to all workers
  • when a worker is executing the endless loop to look for jobs, it checks that the key with which it was initialised is the same as the one found in Redis
  • if it is, it continues the loop. If it doesn't, this means that instance is a rogue process and needs to exit processing & self-kill

This would introduce an extra lookup in Redis for each loop, but the cost is a small one to pay for not having rogue processes.

Also to consider is when the main horizon process is stopped, but rogue processes continue to exist. I suppose an extra check here would be a heart beat issued by the supervisor. If this heart beat is older than a few seconds, workers can safely die.

@barryvdh
Copy link
Contributor

Isn't that exactly what queue:restart did and is now done with the last commits:

$this->laravel['cache']->forever('illuminate:queue:restart', $this->currentTime());

@dragosperca
Copy link

I should have checked the last commits, but I believe there's a small difference. This cache check is implemented in the "queue:restart" command, and if this one is not executed, rogue processes will not die.

My thinking was to make this a bit more invasive, executed within the main loop cycle. It would link a worker process to a know token, and workers would die automatically if their parent no longer exists or the token changed.

@taylorotwell
Copy link
Member

@TruckersMP-Kat we would need way more information to diagnose anything.

@taylorotwell
Copy link
Member

@dragosperca are you currently using 1.2.2?

@dragosperca
Copy link

i am using Horizon on a project, and thought about using it for another larger, critical mission one. I don't experience the current issue, I was just reading all the issues and trying to understand failure points. My comment was just an idea to avoid rogue processes.

Before Horizon, we were using the standard worker & supervisor setup and did a similar thing you implemented in 1.2.2, that is, a marker that when set, would make workers die. Since supervisor was there to start them if any process died, we had a "self-kill and revive" process. We implemented it in the main (worker) loop though.

This marker would be set when new code would be deployed to the server.

Anyway, just want to say thank you for the great work you've been doing with Laravel and the ecosystem around it. We all love it!

@taylorotwell
Copy link
Member

OK Thanks. We're still looking for any feedback from people running Horizon 1.2.2.

@barryvdh
Copy link
Contributor

On my testserver this seems to be working correctly. I'll deploy it to staging/production tomorrow hopefully and see how it turns out.

Does it matter from which version the termininate command is run, in a zero-downtime environment ala Envoyer (so different release directories). I assume it's safe to just run it from the next release dir (to become current), just before symlinking?

@marianvlad
Copy link

Anyone with problems using failed() method inside job to do things like update something in database?

@themsaid
Copy link
Member

@barryvdh I don't think it makes a difference.

@marianvlad different topic, different issue please :) This issue is already large and it'd be easier if we just keep it focused on one matter.

@taylorotwell
Copy link
Member

Anyone? 😄

@mfn
Copy link
Contributor

mfn commented Feb 17, 2018

Sorry, soon; hope to upgrade on Monday!

We're basically quite constantly hit by it the every other day, but this week we gave 5.6 a priority first ;-) (PS: Thanks for the new logging infrastructure 👍 ).

@aocneanu
Copy link

I updated yesterday Horizon to 1.2.2 on five workers that are doing the same kind of queued jobs. A minute ago I ran terminate and purge on all the five workers, and on three of them I had one orphan process (one on each), the other two were clean. Before the update I had 5 to 10 orphans daily on every worker.

@tomschlick
Copy link

Just deployed 1.2.2 to our production environment at work. Will report back what we find over the next few days.

@marianvlad
Copy link

@taylorotwell After 1.2.2, everything works perfectly. No more orphan jobs and old code.

@taylorotwell
Copy link
Member

@tomschlick you see anything? @marianvlad thanks!

@tomschlick
Copy link

@taylorotwell just checked our logs from last week and we saw two instances of orphans

Sending TERM Signal To Process: 9750
Observed Orphan: 8781
Observed Orphan: 8862
Observed Orphan: 10459
Sending TERM Signal To Process: 10498
Observed Orphan: 8781
Observed Orphan: 8862
Observed Orphan: 12571

Weirdly enough two of the processes had the same process id, even though the deploys took place 15 minutes apart. Horizon appeared to terminate them and restart correctly so not sure how that's possible 🤷‍♂️

@themsaid
Copy link
Member

Just a notice, the horizon:purge command doesn't work as expected, so if you run it and get a single rogue process ignore it, it's not an indicator. Only if you get multiple process it means there are actual orphans.

So in this case the two orphans 8781 and 8862 are the actual orphans, however it seems that even the purge command didn't kill them so that could mean they're really stuck on a long process that the next loop didn't run yet.

What's your timeout value?

@tomschlick
Copy link

Timeout value is 1800 for most of our workers.

@dbpolito
Copy link
Contributor Author

I'm not able to reproduce this issue anymore... So it looks like fixed to me... ❤️

@dbpolito
Copy link
Contributor Author

As i created the ticket and seems it got fixed, i'm closing this one... We can start new tickets and mention this one if necessary.

@vyuldashev
Copy link

We are encountering strange issues. Sometimes one queue is stuck and does not process any jobs. Even when supervisor is stopped there are horizon:work processes in the list. horizon:purge also does not help as it does not find any.

Here is our config and the problem is only with default:

»> config('horizon')
=> [
     "use" => "queue",
     "prefix" => "horizon:",
     "waits" => [
       "redis:default" => 300,
     ],
     "trim" => [
       "recent" => 60,
       "failed" => 10080,
     ],
     "environments" => [
       "production" => [
         "supervisor-1" => [
           "connection" => "redis",
           "queue" => [
             "default",
           ],
           "balance" => "simple",
           "processes" => 10,
           "tries" => 0,
         ],
         "supervisor-2" => [
           "connection" => "redis",
           "queue" => [
             "sms",
             "phone_data",
           ],
           "balance" => "auto",
           "processes" => 10,
           "tries" => 0,
         ],
       ],
       "local" => [
         "supervisor-1" => [
           "connection" => "redis",
           "queue" => [
             "default",
           ],
           "balance" => "simple",
           "processes" => 3,
           "tries" => 3,
         ],
       ],
     ],
   ]

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests