Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Samza AM retry count logging #1701

Merged
merged 3 commits into from
Aug 8, 2024
Merged

Conversation

MikeBarskii
Copy link
Contributor

LISAMZA-43659

Description: According to org/apache/samza/clustermanager/ContainerProcessManager.java:520

The rules to shut down the whole app if too many container failures have happened:

  1. Failure count for a task group id must be > the configured retry count
  2. The last failure (the one prior to this one) must have happened less than retry window ms ago

Issue: org/apache/samza/clustermanager/ContainerProcessManager.java:575 doesn't reflect point 2 of the counting behavior well

Processor ID: {} (current Container ID: {}) has failed {} times, with last failure {} ms ago. This is greater than retry count of {} and window of {} ms

Fix: Add to the logs information about point 2

Processor ID: {} (current Container ID: {}) has failed {} times. This is greater than the retry count of {}. The failure occurred {} ms after the previous one, which is less than the retry window of {} ms."

Michael Barskii and others added 3 commits July 12, 2024 13:03
* print current timestamp

* Fix typo

* fix build issue about grolifant okhttp

---------

Co-authored-by: Haolan Ye <hye@linkedin.com>
@dxichen dxichen merged commit 5b4b1b7 into apache:master Aug 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants