RQ: implement reliable timeout #4305

rauchy · 2019-11-13T21:00:35Z

RQ's timeout implementation is based on a signal invoked in the "work horse" process, which might be blocked by the executing code. We should implement a more reliable timeout -- probably by handling it in the parent process.

Relevant RQ issues: rq/rq#323, rq/rq#1142

arikfr · 2019-10-28T07:07:46Z

RQ's timeout functionality is called in the work horse process:

https://github.com/rq/rq/blob/e43bce4467c3e1800d75d9cedf75ab6e7e01fe8c/rq/worker.py#L815

rauchy · 2019-11-13T13:30:30Z

As suggested in rq/rq#323, the most reliable approach here would be to handle this both using a limit inside the work horse (as it is currently implemented) and falling back to a limit in the worker. This is effectively implementing a mechanism that resembles Celery's hard & soft limits.

One problem I see is that if we implement the hard limit using an alarm, we'll have to give up on work horse monitoring by the worker, as it uses an alarm as well, and a process can sign up for one alarm at a time. We can get around this using a threading timer. WDYT?

arikfr · 2019-11-13T13:32:15Z

we'll have to give up on work horse monitoring by worker

What do you refer to here?

rauchy · 2019-11-13T13:34:26Z

This line will set up an alarm that will trigger a HorseMonitorTimeoutException after (probably) 30s and will help maintain an active heartbeat. Using Unix alarms, we'll have to give up on that functionality. The only way I can see to keep both intact is a threading timer for the job timeout.

arikfr · 2019-11-13T13:39:40Z

@rauchy we can have a single alarm signal used for both things, no?

rauchy · 2019-11-13T13:46:04Z

You can only schedule a signal alarm. Are you suggesting to catch HorseMonitorTimeoutException and check if job.timeout has already passed? The downside to that is that we are bound to intervals of job_monitoring_interval, so it could be raised anywhere between 0-30s (typically) after the timeout has passed, which might be too soon. Then again, we can simply wait for a delta of 15 seconds beyond the timeout so it'll be killed on the next iteration.

arikfr · 2019-11-13T13:49:52Z

Are you suggesting to catch HorseMonitorTimeoutException and check if job.timeout has already passed?

Exactly.

The downside is indeed a real one. We can experiment with lowering the interval for the monitor thing, but need to make sure it has no unwanted side effects.

…alization clearer

arikfr · 2019-11-14T13:37:35Z

redash/worker.py

@@ -93,3 +97,68 @@ def add_periodic_tasks(sender, **kwargs):
    for params in extensions.periodic_tasks.values():
        # Add it to Celery's periodic task registry, too.
        sender.add_periodic_task(**params)
+
+
+class HardTimeLimitingWorker(Worker):


How about we move this into its own module in the redash.tasks package? The redash.schedule module belongs there as well.

I would move all the rq stuff from redash.worker into redash.tasks and eventually remove this module when we say goodbye to Celery.

Also worth adding some documentation on why we added this class.

arikfr

Just one small thing.

arikfr · 2019-11-17T10:23:55Z

redash/tasks/__init__.py

@@ -3,3 +3,5 @@
                      refresh_schemas, cleanup_query_results, empty_schedules)
 from .alerts import check_alerts_for_query
 from .failure_report import send_aggregated_errors
+from .hard_limiting_worker import *
+from .schedule import *


Nit: please don't do * imports.

arikfr · 2019-11-17T10:29:47Z

One more thing for this PR: we need to make sure that all the tasks have sensible timeout setting.

rauchy · 2019-11-26T13:20:09Z

One more thing for this PR: we need to make sure that all the tasks have sensible timeout setting.

All jobs have the default timeout of 3 minutes at the moment, except for sync_user_details which maxes out at 60 seconds. These feel like sensible values for me. WDYT?

Obviously, when we get to query executions, these will be dynamic.

arikfr · 2019-12-05T07:28:55Z

Do we need to merge this or #4413 covers this too?

rauchy · 2019-12-10T10:36:24Z

I based #4413 on this because of the similarities in the custom worker. Given that #4413 will take some more time for testing, if you think this is ready to ship - let's merge it on its own.

bartonology · 2019-12-10T20:58:26Z

redash/tasks/hard_limiting_worker.py

+    grace_period = 15
+
+    def soft_limit_exceeded(self, job):
+        seconds_under_monitor = (utcnow() - self.monitor_started).seconds


Small consideration. The timedelta docs say that .seconds is:
0 <= seconds < 3600*24 (the number of seconds in one day)
If jobs are supported that run longer than a day, probably should use .total_seconds() instead.

Suggested change

seconds_under_monitor = (utcnow() - self.monitor_started).seconds

seconds_under_monitor = (utcnow() - self.monitor_started).total_seconds()

arikfr · 2020-01-19T07:13:26Z

#4413 is merged, so I guess we can close this one?

arikfr · 2020-01-21T09:24:24Z

Ping, @rauchy.

rauchy · 2020-01-21T09:47:31Z

Yes, this is all included in master by now.

arikfr added the Backend label Oct 28, 2019

arikfr mentioned this pull request Oct 28, 2019

Move query execution to RQ #4307

Closed

weekly-digest bot mentioned this pull request Nov 4, 2019

Weekly Digest (28 October, 2019 - 4 November, 2019) #4334

Closed

Omer Lachish added 2 commits November 13, 2019 22:59

enforce hard limits on non-responsive work horses by workers

ed925d5

move differences from Worker to helper methods to help make the speci…

859fe2a

…alization clearer

rauchy requested a review from arikfr November 13, 2019 21:32

arikfr reviewed Nov 14, 2019

View reviewed changes

Omer Lachish added 6 commits November 14, 2019 20:06

move HardLimitingWorker to redash/tasks

d120100

Merge branch 'master' into hard-time-limit

86b9075

move schedule.py to /tasks

1fa6abf

explain the motivation for HardLimitingWorker

1251b9b

pleasing CodeClimate

4ae624b

pleasing CodeClimate

9cfd453

arikfr reviewed Nov 17, 2019

View reviewed changes

weekly-digest bot mentioned this pull request Nov 18, 2019

Weekly Digest (11 November, 2019 - 18 November, 2019) #4368

Closed

avoid star imports

768f0f6

weekly-digest bot mentioned this pull request Dec 2, 2019

Weekly Digest (25 November, 2019 - 2 December, 2019) #4417

Closed

weekly-digest bot mentioned this pull request Dec 9, 2019

Weekly Digest (2 December, 2019 - 9 December, 2019) #4427

Closed

bartonology reviewed Dec 10, 2019

View reviewed changes

weekly-digest bot mentioned this pull request Dec 16, 2019

Weekly Digest (9 December, 2019 - 16 December, 2019) #4451

Closed

weekly-digest bot mentioned this pull request Jan 20, 2020

Weekly Digest (13 January, 2020 - 20 January, 2020) #4564

Closed

rauchy closed this Jan 21, 2020

jezdez deleted the hard-time-limit branch January 22, 2020 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RQ: implement reliable timeout #4305

RQ: implement reliable timeout #4305

rauchy commented Nov 13, 2019 •

edited by arikfr

Loading

arikfr commented Oct 28, 2019

rauchy commented Nov 13, 2019 •

edited

Loading

arikfr commented Nov 13, 2019

rauchy commented Nov 13, 2019 •

edited

Loading

arikfr commented Nov 13, 2019

rauchy commented Nov 13, 2019

arikfr commented Nov 13, 2019

arikfr Nov 14, 2019

arikfr Nov 14, 2019

arikfr left a comment

arikfr Nov 17, 2019

arikfr commented Nov 17, 2019

rauchy commented Nov 26, 2019

arikfr commented Dec 5, 2019

rauchy commented Dec 10, 2019

bartonology Dec 10, 2019

arikfr commented Jan 19, 2020

arikfr commented Jan 21, 2020

rauchy commented Jan 21, 2020

	seconds_under_monitor = (utcnow() - self.monitor_started).seconds
	seconds_under_monitor = (utcnow() - self.monitor_started).total_seconds()

RQ: implement reliable timeout #4305

RQ: implement reliable timeout #4305

Conversation

rauchy commented Nov 13, 2019 • edited by arikfr Loading

arikfr commented Oct 28, 2019

rauchy commented Nov 13, 2019 • edited Loading

arikfr commented Nov 13, 2019

rauchy commented Nov 13, 2019 • edited Loading

arikfr commented Nov 13, 2019

rauchy commented Nov 13, 2019

arikfr commented Nov 13, 2019

arikfr Nov 14, 2019

Choose a reason for hiding this comment

arikfr Nov 14, 2019

Choose a reason for hiding this comment

arikfr left a comment

Choose a reason for hiding this comment

arikfr Nov 17, 2019

Choose a reason for hiding this comment

arikfr commented Nov 17, 2019

rauchy commented Nov 26, 2019

arikfr commented Dec 5, 2019

rauchy commented Dec 10, 2019

bartonology Dec 10, 2019

Choose a reason for hiding this comment

arikfr commented Jan 19, 2020

arikfr commented Jan 21, 2020

rauchy commented Jan 21, 2020

rauchy commented Nov 13, 2019 •

edited by arikfr

Loading

rauchy commented Nov 13, 2019 •

edited

Loading

rauchy commented Nov 13, 2019 •

edited

Loading