Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set celery task limits and fix worker timeout #1486

Merged
merged 14 commits into from
Jun 7, 2024

Conversation

aarontp
Copy link
Member

@aarontp aarontp commented May 30, 2024

Description of the change

  • Sets various retry and timeout options for celery task setup
  • Updates the worker timeout to kill all child processes if execution time limit is reached
  • Adds counters for soft timeouts and uncaught exceptions
  • Also fixes issue with server timeout which was causing timed out tasks not to be cleaned up properly.
  • Removes unused SINGLE_RUN feature and config option.
  • Changes worker start args from '--pool=solo to --concurrency=1 as a solo pool executes tasks directly instead of forking and cannot process celery timeouts.

After this change there will be four different timeouts that can occur:

  • Turbinia worker timeout: This timeout happens when executing anything through TurbiniaTask.execute() and uses the timeout values set in the Job config in the turbinia config file.
  • Celery Soft timeout: This will send a catchable celery.exceptions.SoftTimeLimitExceeded exception to the worker and should be caught in the TurbiniaTask.execute() or TurbiniaTask.run_wrapper() methods depending on the state of the Task. This is set slightly longer than the Turbinia workout timeout to allow the worker to clean-up and give results when possible. The buffer is defined by task_manager.CELERY_SOFT_TIMEOUT_BUFFER.
  • Celery Hard timeout: After some buffer time after the other timeouts (task_manager.CELERY_HARD_TIMEOUT_BUFFER), celery is configured to kill the worker running the task. This is not catchable so we will not get any results back from the worker so the server will also need to time out the task.
  • Turbinia Server task timeout: If the server does not get any results back from the worker it will time out the task and consider it unrecoverable. This happens task_manager.SERVER_TASK_TIMEOUT_BUFFER seconds after the Task timeout. This buffer is large because the timer starts from when the task is enqueued and can't account for any scheduling and other wait time. This is to avoid timing out valid tasks due to worker congestion.

Applicable issues

Additional information

Checklist

  • All tests were successful.
  • Unit tests added.
  • Documentation updated.

@jleaniz
Copy link
Collaborator

jleaniz commented May 31, 2024

@aarontp removing '--pool=solo' from worker.py:253 should enable time limits and avoid the blocking issues we were experiencing.

@aarontp
Copy link
Member Author

aarontp commented May 31, 2024

@aarontp removing '--pool=solo' from worker.py:253 should enable time limits and avoid the blocking issues we were experiencing.

Indeed, that worked, thanks!

@aarontp aarontp marked this pull request as ready for review June 4, 2024 02:49
@aarontp aarontp requested a review from jleaniz June 4, 2024 02:49
Copy link
Collaborator

@jleaniz jleaniz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, leaving this for you to make any changes based on our chat. Thanks for the PR!

turbinia/worker.py Show resolved Hide resolved
turbinia/workers/__init__.py Show resolved Hide resolved
@aarontp aarontp requested a review from jleaniz June 6, 2024 21:20
@aarontp
Copy link
Member Author

aarontp commented Jun 6, 2024

PTAL, I added the soft limit exception to the execute() method as well.

Copy link
Collaborator

@jleaniz jleaniz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jleaniz jleaniz merged commit 7678a63 into google:master Jun 7, 2024
5 checks passed
aleu04 pushed a commit to aleu04/turbinia that referenced this pull request Jun 7, 2024
* Set celery task limits and fix worker timeout

* Remove solo pool and change concurrency=1

* Add soft/hard limit buffers

* Fix server timeout

* Late import for psutil

* small fixes, tests

* fix run tests

* test format string

* Fix process_result test

* fix execute test

* revert unnecessary config changes

* Handle soft timeout exception in execute()

* update timeout message

* Yaaaaaaaapf
@aarontp aarontp deleted the retry-fixes branch June 7, 2024 18:14
jleaniz added a commit that referenced this pull request Jun 13, 2024
* Added the TurbiniaRequest hashed object in Redis

* Converted TurbiniaTasks to hash objects in Redis

* Made get_request_data more efficient

* Made get_requests_summary more efficient

* Fixed some issues in request_status

* Remove GCP dependencies (#1440)

* Remove gcp dependencies

* Update dockerfiles

* Update dockerfiles

* Update gcp error reporting

* Updates to formatting

* Add unit test

* Update unit test

* Clean up

* Update unit test

* Update error reporting

* Update file

* Update config template

* Catch exception

* Updates

* fix lint

* Lint fixes

* Updates

* Updates

* Various updates and fixes

* Updates

* --- (#1483)

updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Updates

* Update typo

* Updates to unit tests

* Updates to unit tests and linter fixes

* Update table width UI

* Bump the pip group across 2 directories with 1 update (#1487)

Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests).
Bumps the pip group with 1 update in the /turbinia/api/cli directory: [requests](https://github.com/psf/requests).


Updates `requests` from 2.32.0 to 2.32.3
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.0...v2.32.3)

Updates `requests` from 2.32.0 to 2.32.3
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.0...v2.32.3)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix small UI bug

* Minor typos/errors

* Set celery task limits and fix worker timeout (#1486)

* Set celery task limits and fix worker timeout

* Remove solo pool and change concurrency=1

* Add soft/hard limit buffers

* Fix server timeout

* Late import for psutil

* small fixes, tests

* fix run tests

* test format string

* Fix process_result test

* fix execute test

* revert unnecessary config changes

* Handle soft timeout exception in execute()

* update timeout message

* Yaaaaaaaapf

* Review fixes

* Updates and yapf fix

* Minor updates

* Update docstrings

* Updates to evidence_size

* Updates

* Change log level for message

* Lint

* bug fixes

* Address review comments

* Fix docstrings

* Minor UI update

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Igor Rodrigues <igormr@nyu.edu>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Aaron Peterson <aaronp@gmail.com>
jleaniz added a commit that referenced this pull request Jun 13, 2024
* Added the TurbiniaRequest hashed object in Redis

* Converted TurbiniaTasks to hash objects in Redis

* Made get_request_data more efficient

* Made get_requests_summary more efficient

* Fixed some issues in request_status

* Remove GCP dependencies (#1440)

* Remove gcp dependencies

* Update dockerfiles

* Update dockerfiles

* Update gcp error reporting

* Updates to formatting

* Add unit test

* Update unit test

* Clean up

* Update unit test

* Update error reporting

* Update file

* Update config template

* Catch exception

* Updates

* fix lint

* Lint fixes

* Updates

* Updates

* Various updates and fixes

* Updates

* --- (#1483)

updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
...




* Updates

* Update typo

* Updates to unit tests

* Updates to unit tests and linter fixes

* Update table width UI

* Bump the pip group across 2 directories with 1 update (#1487)

Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests).
Bumps the pip group with 1 update in the /turbinia/api/cli directory: [requests](https://github.com/psf/requests).


Updates `requests` from 2.32.0 to 2.32.3
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.0...v2.32.3)

Updates `requests` from 2.32.0 to 2.32.3
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.0...v2.32.3)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
...




* Fix small UI bug

* Minor typos/errors

* Set celery task limits and fix worker timeout (#1486)

* Set celery task limits and fix worker timeout

* Remove solo pool and change concurrency=1

* Add soft/hard limit buffers

* Fix server timeout

* Late import for psutil

* small fixes, tests

* fix run tests

* test format string

* Fix process_result test

* fix execute test

* revert unnecessary config changes

* Handle soft timeout exception in execute()

* update timeout message

* Yaaaaaaaapf

* Review fixes

* Updates and yapf fix

* Minor updates

* Update docstrings

* Updates to evidence_size

* Updates

* Change log level for message

* Lint

* bug fixes

* Address review comments

* Fix docstrings

* Minor UI update

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Igor Rodrigues <igormr@nyu.edu>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Aaron Peterson <aaronp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Redis timeout
2 participants