Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: collect det task logs as artifacts for ci jobs #9459

Merged
merged 8 commits into from
Jun 12, 2024
Merged

Conversation

hamidzr
Copy link
Contributor

@hamidzr hamidzr commented May 31, 2024

Ticket

https://hpe-aiatscale.atlassian.net/browse/RM-237

Description

collect and save tasks (currently just first trial of an experiment's) logs as ci artifacts for e2e tests.
adds about 4s to each job.

a quick log size estimate as is: looks like each run the total of all the logs collected today is <10MB
we have about 1300 test-e2e workflow runs each 30 days that adds up to about 13GB of storage use that should auto clear as the artifacts age out

Test Plan

samples: https://app.circleci.com/pipelines/github/determined-ai/determined/56392/workflows/04457f70-98f8-407c-835f-e5bca81a481b/jobs/2630701/artifacts
sample with task logs https://app.circleci.com/pipelines/github/determined-ai/determined/56396/workflows/24e85025-e118-4dd6-b095-301aa6e15dee/jobs/2631032/artifacts
https://app.circleci.com/pipelines/github/determined-ai/determined/56392/workflows/04457f70-98f8-407c-835f-e5bca81a481b/jobs/2630703/artifacts
https://app.circleci.com/pipelines/github/determined-ai/determined/56392/workflows/04457f70-98f8-407c-835f-e5bca81a481b/jobs/2630711/artifacts

I recycled some old code I had to do this. there's probably a better way to authenticate now

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

@cla-bot cla-bot bot added the cla-signed label May 31, 2024
Copy link

netlify bot commented May 31, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 4b15b31
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6669bb2448c0350008852696

@hamidzr hamidzr changed the title test: collect det task logs for as artifacts for ci jobs test: collect det task logs as artifacts for ci jobs May 31, 2024
@hamidzr hamidzr force-pushed the hz-collect-ci-logs branch 2 times, most recently from 9d965c3 to 7a54a84 Compare June 5, 2024 17:59
Copy link

codecov bot commented Jun 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.97%. Comparing base (86e6b68) to head (4b15b31).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9459   +/-   ##
=======================================
  Coverage   48.97%   48.97%           
=======================================
  Files        1238     1238           
  Lines      160512   160512           
  Branches     2784     2784           
=======================================
+ Hits        78609    78611    +2     
+ Misses      81728    81726    -2     
  Partials      175      175           
Flag Coverage Δ
backend 43.82% <ø> (+<0.01%) ⬆️
harness 63.85% <ø> (ø)
web 44.13% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 4 files with indirect coverage changes

@hamidzr hamidzr marked this pull request as ready for review June 5, 2024 19:08
@hamidzr hamidzr requested a review from a team as a code owner June 5, 2024 19:08
Copy link
Contributor

@davidfluck-hpe davidfluck-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with some general observations/questions. I won't request any of these changes, but I want to make a note of them (in no particular order).

  1. I'm not sure which Determined library functions return errors or throw exceptions, but as it is, I don't see a lot of error handling. How likely is the happy path, and how informative will the resultant error messages be if they break CI?

  2. I see a lot of string interpolation/manipulation for paths. Consider Python's pathlib for a higher-level interface for working with OS paths.

  3. What is the expected log volume?

  4. Is there a way to use an existing Determined venv in CircleCI instead of installing fire and determined each time? The compute cycles add up, and I'd rather install a package once ahead of time instead of on every build.

@hamidzr
Copy link
Contributor Author

hamidzr commented Jun 12, 2024

Thanks, all good points

I'm not sure which Determined library functions return errors or throw exceptions, but as it is, I don't see a lot of error handling. How likely is the happy path, and how informative will the resultant error messages be if they break CI?

This one I'm not sure I'm understanding correctly.

log volume

two optimizations:

  • compress the output which is all txt but would take away the browser usability
  • only upload on failures

I added a bit to the cmd to make a report on the size. They only stay around for 30 days afaik

Is there a way to use an existing Determined venv in CircleCI instead of installing fire and determined each time? The compute cycles add up, and I'd rather install a package once ahead of time instead of on every build.

probably? but also that adds a bit of complexity and reduces the portability of this command. I think it already runs as part of the venv that the parent job sets up and if the pkg is found, it should skip it eg here it doesn't download determined https://app.circleci.com/pipelines/github/determined-ai/determined/56582/workflows/f1305a3a-cd98-4741-a914-6b621ccd0448/jobs/2645863/parallel-runs/0/steps/0-128

@hamidzr
Copy link
Contributor Author

hamidzr commented Jun 12, 2024

a quick log size estimate as is: looks like each run the total of all the logs collected today is <10MB
we have about 1300 test-e2e workflow runs each 30 days that adds up to about 13GB of storage use that should auto clear as the artifacts age out

edit: also addressed this

I see a lot of string interpolation/manipulation for paths. Consider Python's pathlib for a higher-level interface for working with OS paths.

@hamidzr hamidzr assigned hamidzr and unassigned davidfluck-hpe Jun 12, 2024
@davidfluck-hpe
Copy link
Contributor

This one I'm not sure I'm understanding correctly.

How likely are those Determined library functions to return errors/raise exceptions? Is it possible for any of them to fail and raise an unhandled exception? I don't see much error handling, so I'm wondering if those Determined library functions are always going to succeed, or if there's a possibility that we're not handling errors. For example: resp = b.get_GetExperiments(self.session, archived=False, states=states). Can get_GetExperiments() fail? If so, how do we handle it?

@hamidzr
Copy link
Contributor Author

hamidzr commented Jun 12, 2024

This one I'm not sure I'm understanding correctly.

How likely are those Determined library functions to return errors/raise exceptions? Is it possible for any of them to fail and raise an unhandled exception? I don't see much error handling, so I'm wondering if those Determined library functions are always going to succeed, or if there's a possibility that we're not handling errors. For example: resp = b.get_GetExperiments(self.session, archived=False, states=states). Can get_GetExperiments() fail? If so, how do we handle it?

thanks for explaining, I get it now : ) outside of unexpected bugs that I might be adding here the usual network failures should be covered by our default retry logic that all the API calls via our Session gets

# Default retry logic
DEFAULT_MAX_RETRIES = urllib3.util.retry.Retry(
total=5,
backoff_factor=0.5, # {backoff factor} * (2 ** ({number of total retries} - 1))
status_forcelist=[502, 503, 504], # Bad Gateway, Service Unavailable, Gateway Timeout
)

any other class of errors you're thinking of?

@hamidzr hamidzr merged commit d0d30cf into main Jun 12, 2024
87 of 100 checks passed
@hamidzr hamidzr deleted the hz-collect-ci-logs branch June 12, 2024 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants