Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-35126] Rework default checkpoint progress check window #850

Merged
merged 1 commit into from
Jul 15, 2024

Conversation

gyfora
Copy link
Contributor

@gyfora gyfora commented Jul 4, 2024

What is the purpose of the change

Currently the checkpoint progress health check window is configurable by Duration. This makes it hard to enable by default as the sensible interval depends on the checkpoint interval.

At the same time the operator already contains logic for a minimum progress check interval computed from the checkpoint timeout , tolerable failures and checkpoint interval.

Furthermore for any job with checkpointing enabled this health check is very valuable to have enabled by default similar to the restart health check. This PR also proposes to enable this feature by default with the the minimum checkpoint check interval set.

Brief change log

  • Enable checkpoint progress health check by default (for jobs with checkpointing configured)
  • Set the minimum based on the calculated that already enforces the lower bound

Verifying this change

Unit tests + manual verification

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes

Documentation

  • Does this pull request introduce a new feature? no

<td>Duration</td>
<td>If no checkpoints are completed within the defined time window, the job is considered unhealthy. This must be bigger than checkpointing interval.</td>
<td>If no checkpoints are completed within the defined time window, the job is considered unhealthy. The minimum window size is `max(checkpointingInterval, checkpointTimeout) * (tolerableCheckpointFailures + 2)`, which also serves as the default value when checkpointing is enabled. For example with checkpoint interval 10 minutes and 0 tolerable failures, the default progress check window will be 20 minutes.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that we added a description. This would be a hidden gem otherwise.

@gyfora gyfora merged commit cb90b10 into apache:main Jul 15, 2024
169 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants