Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis database grew too big, causing general Redis troubles #271

Closed
mfn opened this issue Jan 4, 2018 · 8 comments
Closed

Redis database grew too big, causing general Redis troubles #271

mfn opened this issue Jan 4, 2018 · 8 comments
Labels

Comments

@mfn
Copy link
Contributor

mfn commented Jan 4, 2018

Today we had the following issue:

  • our Redis clients couldn't sporadically connect to Redis itself
  • upon inspection we found out the redis database holding horizon had > 1 million keys
  • it's size accounted to ~960MB (ie. 99%) of our whole Redis database

Errors we received were:

  • Connection closed
  • Redis::pconnect(): connect() failed: Connection timed out
  • RedisException: read error on connection

(note: this error were recorded from non-Laravel PHP based applications; i.e. as explained the below, the Horizon database size seemed to affect Redis as a whole).

What we did:

  • stopped horizon via supervisor
  • ran flushdb in the horizon database (Achtung: be sure you've selected the right database)
    Running that command took 40 seconds
  • started horizon via supervisor

In our case the Horizon Redis database was number 2, here's the output from redis internal info command:

db2:keys=1192438,expires=22,avg_ttl=89314947

The avg_ttl looks suspiciously high.

Usually when inspecting such problems, I use the keys * command but I didn't dare to run it as it would block redis completely during running it and we couldn't do this in production.

As such, at this time we don't have any information what keys where in there.

We can definitely exclude other applications having written to the same database; it's exclusively used by Horizon.

Our configuration:

  • 13 queues
  • 38 workers
  • ~60 jobs per Minute
  • we've (had) enabled monitoring for each job (each job has a default queue)
  • horizon:snapshot running every 5 minutes

Does anyone have a clue what could cause this?
We were approximately running this since 2 months in production now.

@mfn
Copy link
Contributor Author

mfn commented Jan 5, 2018

After this "reset" yesterday, todays INFO output for that database looks much saner:

db2:keys=4312,expires=4265,avg_ttl=1648948

When I look at the output from yesterday again, something feels very off here: keys=1192438,expires=22 => Only 22 keys expected to expire, from over 1Million (If I read that output correct).

@mfn
Copy link
Contributor Author

mfn commented Jan 6, 2018

Todays INFO output:

db2:keys=889,expires=848,avg_ttl=1530403

I'm going to watch this a few days.

One thing I remember we did before the problem: we've enabled lots of "Monitor Tags". Basically one for every job-type we have.

We didnd't do this yet after we purged the database.

Can this be connected?

@fgilio
Copy link

fgilio commented Jul 29, 2018

I arrived here after having all out "Monitor Tags" disappear, and I think it might be related to this.

We're processing hundreds of jobs per minute and the Monitoring tab would show thousands (some around 100k) of entries in the Jobs column, and now there's no tag being monitored.

EDIT: Maybe Horizon could add a counter type of monitor. So it'd only increment the counter, instead of keeping a record of all the jobs. At least that's what I needed in this case.

@mfn
Copy link
Contributor Author

mfn commented Jul 30, 2018

We've never enabled "tagging" for each job and the problem never appeared. I didn't bother to investigate further as we really didn't need the detailed metrics (it was just "nice to have").

@ndberg
Copy link

ndberg commented Sep 13, 2018

Had the same problem occuring already three times. The horizon redis database is always growing. Has anybody a solution for this?

@mfn
Copy link
Contributor Author

mfn commented Sep 13, 2018

@ndberg after we disabled tagging, we never had this problem again. But OTOH: I don't recall creating so many jobs at once again either (>1 mio)

@ndberg
Copy link

ndberg commented Sep 13, 2018

So I should test it with disabling tagging.. I have used tags for alle Jobs, and I have a similar environment as you, with less queues and workers:

3 queues
5 workers
~1'000 Jobs / day
enabled monitoring for each job (each job has a default queue)
horizon:snapshot running every 5 minutes

@driesvints
Copy link
Member

This could be solved by #333. I'll keep this open for now so we don't lose track of it.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants