Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add log retention for database logger #8622

Merged
merged 5 commits into from
Mar 28, 2024
Merged

Conversation

erikwilson
Copy link
Contributor

@erikwilson erikwilson commented Dec 29, 2023

Description

Adds log retention for database logger.

Test Plan

Automated tests added.

Manual testing:

  1. Deploy / start a cluster with this change.
  2. Deploy an hpsearch experiment configured with a log retention of 0 days.
  3. Run the log cleanup command to ensure the logs are removed.
  4. Modify the server configuration to use a short duration or cron expression for log cleanup.
  5. Restart the server.
  6. Start another hpsearch experiment with default configuration.
  7. Ensure logs are retained between cleanup cycles.
  8. Update experiment to have log retention of 0 days.
  9. Ensure that logs are removed on next cleanup cycle.

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

DET-9995

@cla-bot cla-bot bot added the cla-signed label Dec 29, 2023
Copy link

netlify bot commented Dec 29, 2023

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit ee1e140
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6604ab72610d1a0008b899fd
😎 Deploy Preview https://deploy-preview-8622--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

master/pkg/model/logging_config.go Outdated Show resolved Hide resolved
master/internal/api_master.go Outdated Show resolved Hide resolved
master/internal/api_master.go Outdated Show resolved Hide resolved
master/internal/api_master.go Outdated Show resolved Hide resolved
master/internal/core.go Outdated Show resolved Hide resolved
master/pkg/model/logging_config.go Outdated Show resolved Hide resolved
master/internal/rm/agentrm/agent.go Outdated Show resolved Hide resolved
master/internal/db/postgres_tasks.go Outdated Show resolved Hide resolved
master/internal/db/postgres_tasks.go Outdated Show resolved Hide resolved
master/internal/db/postgres_tasks.go Outdated Show resolved Hide resolved
@erikwilson
Copy link
Contributor Author

erikwilson commented Jan 9, 2024

For reference we had a discussion about the golang duration & postgres interval stuff, and decided to keep duration as days. This is because golang duration parsing does not support units larger than hours, we would need to create or find a wrapper for postgres support, and postgres interval math is complex and could be problematic. Here is a blog post with some more info:
https://blog.bossylobster.com/2021/10/go-duration-postgresql-interval.html

Copy link

codecov bot commented Jan 9, 2024

Codecov Report

Attention: Patch coverage is 56.15385% with 57 lines in your changes are missing coverage. Please review.

Project coverage is 47.17%. Comparing base (8f5de35) to head (ee1e140).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8622      +/-   ##
==========================================
+ Coverage   47.14%   47.17%   +0.03%     
==========================================
  Files        1150     1152       +2     
  Lines      141671   141773     +102     
  Branches     2415     2417       +2     
==========================================
+ Hits        66785    66888     +103     
+ Misses      74696    74695       -1     
  Partials      190      190              
Flag Coverage Δ
backend 43.00% <59.45%> (+0.14%) ⬆️
harness 64.31% <36.84%> (-0.02%) ⬇️
web 38.96% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
master/internal/checkpoint_gc.go 66.30% <100.00%> (+0.37%) ⬆️
master/internal/command/command.go 67.96% <100.00%> (+0.13%) ⬆️
master/internal/config/config.go 73.57% <100.00%> (-0.22%) ⬇️
master/internal/core_experiment.go 61.50% <100.00%> (+0.34%) ⬆️
master/internal/db/postgres_tasks.go 68.86% <100.00%> (+0.11%) ⬆️
master/internal/trial.go 41.58% <100.00%> (+0.11%) ⬆️
master/pkg/model/task.go 15.26% <ø> (ø)
master/pkg/schemas/expconf/experiment_config.go 64.21% <ø> (ø)
master/pkg/schemas/zgen_schemas.go 1.24% <ø> (ø)
master/pkg/tasks/task.go 6.07% <ø> (ø)
... and 9 more

... and 5 files with indirect coverage changes

@erikwilson erikwilson force-pushed the erikw/feat/log-retention branch 2 times, most recently from fc8b731 to 78625da Compare January 16, 2024 21:47
@stoksc stoksc self-assigned this Jan 23, 2024
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked offline but leaving this review still just incase i lost something

@@ -34,7 +34,7 @@ stages:
set -ex
mkdir -p /tmp/det
cat >/tmp/det/agent.yaml <<EOF
master_host: 127.0.0.1
master_host: host.docker.internal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to revert this when you land?

master/internal/api_master.go Outdated Show resolved Hide resolved
StartTime: time.Now().UTC(),
JobID: &jobID,
LogVersion: model.CurrentTaskLogVersion,
LogRetentionDays: taskSpec.LogRetentionDays,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave a really good comment on the field about how nulls work, etc?

i'm coming back to this review after a while and thought: what happens if the admin wants to retain most of the logs for a lot shorter time and sets the cluster wide retention time to a week from a month, and what does this field mean, then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add some comments to this method:
https://github.com/determined-ai/determined/pull/8622/files#diff-0aff34e1ab2c0076df8608480773216fd6b78a64d46d3f485aee6eb7586659faR87-R89

Please let me know if you think it needs more clarification or comments.

master/internal/core.go Outdated Show resolved Hide resolved
r, err := db.sql.Exec(fmt.Sprintf(`
WITH log_retention_tasks AS (
SELECT task_id, end_time, COALESCE(log_retention_days, %d) AS current_retention_days FROM tasks
WHERE task_id IN (SELECT DISTINCT task_id FROM task_logs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a minor optimization to not try to delete from tasks without logs, or more than that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, it seems like this would help if we have a database with a very large number of tasks but fewer task logs, which is likely to be the situation once retention is enabled.

this might be a good candidate for scale testing & optimization tho.

master/internal/db/postgres_tasks.go Outdated Show resolved Hide resolved
master/internal/db/postgres_tasks.go Outdated Show resolved Hide resolved
master/internal/logretention/logretention.go Outdated Show resolved Hide resolved
master/pkg/model/logging_config.go Outdated Show resolved Hide resolved
@determined-ci determined-ci requested a review from a team January 30, 2024 04:16
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Jan 30, 2024
@erikwilson erikwilson marked this pull request as ready for review January 30, 2024 20:55
@erikwilson erikwilson requested review from a team as code owners January 30, 2024 20:55

.. code:: yaml

log_retention_days: 90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It introduces a bit of complexity to have differing syntax for whether you're setting the default retention policy in the master or setting the retention policy for this one particular experiment. I think either syntax works for me, I just don't want to have to remember both and keep them straight.

Checking in, is this the desired choice? For comparison, log_policies has both the same syntax and same name in both master config and experiment config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names are somewhat unfortunate but due to collisions with log/logging naming in the configs this seems like the best compromise. The server can also take a schedule and default number of days for retention, while the experiment can only configure the number of days and not the schedule.

@hkumar92
Copy link
Contributor

@erikwilson this does not support the CLI or WebUI modifications of retention on individual experiments right? Is there a reason that wasn't done?

@erikwilson
Copy link
Contributor Author

@erikwilson this does not support the CLI or WebUI modifications of retention on individual experiments right? Is there a reason that wasn't done?

This does contain some of the CLI modifications, the rest of the changes can be added in another PR. For the most part we have tried to separate backend and WebUI PRs, as they have historically been done by different teams and developers.

@hkumar92
Copy link
Contributor

Thanks! One concern here is whether the CLI/WebUI/SDK can modify the retention periods for experiments once submitted? If not, I'd rather have a simpler interface where all you can configure for an experiment is whether it is subject to log deletion or not (and we can decide if we need to give users more control later).

@erikwilson erikwilson requested a review from hamidzr March 13, 2024 17:22
@salonig23 salonig23 force-pushed the erikw/feat/log-retention branch 7 times, most recently from 6451894 to 6cf216e Compare March 21, 2024 18:24
@erikwilson erikwilson requested a review from a team as a code owner March 21, 2024 18:24
@erikwilson erikwilson requested a review from jgongd March 21, 2024 18:24
@@ -119,6 +124,9 @@ def logs(args: Namespace) -> None:
help="number of lines to show, counting from the end "
"of the log (default is all)")
]),
Cmd("cleanup-logs", cleanup_logs, "cleanup expired task logs", [
Group(cli.output_format_args["json"], cli.output_format_args["yaml"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be necessary to have Group(cli.output_format_args["json"], cli.output_format_args["yaml"]) as cleanup_logs() prints one line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we relocate this command to task.py, since this is related to task?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the yaml/json options. But I think this should stay in master because that's what product wanted

Copy link
Contributor

@jgongd jgongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@salonig23 salonig23 force-pushed the erikw/feat/log-retention branch 3 times, most recently from da7f1dc to 89c07fb Compare March 26, 2024 23:43
@salonig23 salonig23 merged commit 85bb3c8 into main Mar 28, 2024
90 of 101 checks passed
@salonig23 salonig23 deleted the erikw/feat/log-retention branch March 28, 2024 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants