Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design for repository maintenance job #7375

Merged
merged 1 commit into from
Mar 25, 2024

Conversation

qiuming-best
Copy link
Contributor

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)
#7291

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

@qiuming-best qiuming-best added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Feb 1, 2024
@github-actions github-actions bot added the Area/Design Design Documents label Feb 1, 2024
Copy link

codecov bot commented Feb 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.64%. Comparing base (270b1de) to head (ebd90bb).
Report is 51 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7375      +/-   ##
==========================================
- Coverage   61.76%   61.64%   -0.13%     
==========================================
  Files         262      263       +1     
  Lines       28433    28634     +201     
==========================================
+ Hits        17563    17651      +88     
- Misses       9640     9741     +101     
- Partials     1230     1242      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@qiuming-best qiuming-best force-pushed the repo-maintenance branch 2 times, most recently from 7968b25 to f68cbb9 Compare February 1, 2024 05:48
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Show resolved Hide resolved
"/tmp/credentials",
filesystem.NewFileSystem(),
)
cmd.CheckError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can we see this error message? Will it dump to any of Velero's debug log or backupRepository CR's message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two kinds of messages, one is the fmt.Fprintf type of message, which would be redirected to termination-log, and another kind of message would be log could be retrieved by job log

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we wanna add a utility in velero to handle the error when the it's running in a job, to selectively dump the error in termination-log, so the velero server can read it via k8s api-server, and that should be added in the design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the FileHook section in the design, which could dump errors into termination-log

design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
"/tmp/credentials",
filesystem.NewFileSystem(),
)
cmd.CheckError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we wanna add a utility in velero to handle the error when the it's running in a job, to selectively dump the error in termination-log, so the velero server can read it via k8s api-server, and that should be added in the design.

design/repository-maintenance.md Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Outdated Show resolved Hide resolved
design/repository-maintenance.md Show resolved Hide resolved
design/repository-maintenance.md Show resolved Hide resolved
design/repository-maintenance.md Show resolved Hide resolved
@qiuming-best qiuming-best force-pushed the repo-maintenance branch 4 times, most recently from 801289c to f7aaaa0 Compare February 23, 2024 07:43
@qiuming-best qiuming-best force-pushed the repo-maintenance branch 4 times, most recently from 494354b to 210cfc1 Compare February 27, 2024 01:51
@qiuming-best qiuming-best force-pushed the repo-maintenance branch 6 times, most recently from ba46cc3 to 0d92c65 Compare February 28, 2024 02:37

Our CLI command is designed as follows:
```shell
$ velero repo-maintenance --repo-name $repo-name --repo-type $repo-type --backup-storage-location $bsl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this command trigger maintenance job every time the command is run? If so, wouldn't it be better to let users set a schedule for maintenance jobs? I hope I haven't misunderstood the design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this command calls the Kopia library or Restic to do maintenance and is used in a maintenance job, the command also reads some other configurations from the job such as environment variables and secret, etc. So using this command is less efficient than using Kopia or Restic to handle the repository directly for users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that repo (and BSL) needs to be accessible from where the command is being run? Wouldn't that run counter to the some other issues that are being worked on (such as #7344)? The idea is that BSL may not be reachable from the machine that runs the command. More over, repo may not even be on object storage. For these reasons, I think it is better to have the Velero server run maintenance job. I agree with the concerns mentioned in the document such as memory/CPU requirements but that is true even for large backups. More over, all Velero functionality is available in a declarative way so far and the requirement to run the command breaks that. I feel that users should be able to schedule the maintenance job just like backups and Velero server should run it as per schedule.

My comments may have come a bit late if code changes have already begun but I am curious to know what other developers think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command will be run in a k8s job and the job will be created by velero pod.
We do not expect user to run this command from the client side.

Signed-off-by: Ming Qiu <mqiu@vmware.com>
Comment on lines +139 to +142
# Only have one job one time
completions: 1
# Not parallel running job
parallelism: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawn-hurley have warned me in the past about using Jobs (from #2601 discussion) that this process has to be able to tolerate being started twice.

Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never", the same program may sometimes be started twice.

src

Keep that in mind, I haven't reviewed if repo maintenance starting twice "accidentally" would cause an issue or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the starting twice case would happen very close to each other when the jobs object status haven't been updated to Running yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My napkin algorithm would be to have some kind of leader election even if the expected running job is one. Such that if there is more than one, the later one is no-op or stopped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding me. If the job started twice "accidentally" for the maintenance scenario, it may have a bug but will not cause issues.

As the maintenance job updates the LastMaintenanceTime in backuprepositories CR only affects the next maintenance time which doesn't decide whether the repository is ready or not. it may lead to multiple maintenances in one maintenance frequency in this starting twice "accidentally" scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the same repository, only the maintenance job which gets the repository file lock could run maintenance, and it's guaranteed by the repository itself that only one maintenance at one time for each repository

@qiuming-best qiuming-best merged commit 24941b4 into vmware-tanzu:main Mar 25, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Design Design Documents kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants