Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging when operator failed before generating artifact #1252

Merged
merged 5 commits into from
Apr 28, 2023

Conversation

likawind
Copy link
Contributor

@likawind likawind commented Apr 27, 2023

Describe your changes and why you are making these changes

This PR improves logging when a operator failed without generating artifact metadata. The current situation being:

  • operator failed without generating artifact metadata
  • we attempt to upload artifact metadata
  • we failed to read that metadata and thus result in showing a missing storage object failure in most error logs
    As the improvements, we modifies the logging with minimum change on current life-cycle management behavior:
  • If artif metadata is failed to read, we will log the error rather than fail the function call. We expect orchestration to handle downstream execution based on operator's state.
    • when reading the metadata, we assume empty metadata path + metadata field implies empty metadata, rather than an error state.
  • In orchestrator, we log operator's failure reason together with operator's ID when wf execution is stopped due to operator failures.

Related issue number (if any)

ENG-2815

Testings

We are not expecting to change any life-cycle management behavior. I'd consider it as success as long as all integration tests passed

Checklist before requesting a review

  • I have created a descriptive PR title. The PR title should complete the sentence "This PR...".
  • I have performed a self-review of my code.
  • I have included a small demo of the changes. For the UI, this would be a screenshot or a Loom video.
  • If this is a new feature, I have added unit tests and integration tests.
  • I have run the integration tests locally and they are passing.
  • I have run the linter script locally (See python3 scripts/run_linters.py -h for usage).
  • All features on the UI continue to work correctly.
  • Added one of the following CI labels:
    • run_integration_test: Runs integration tests
    • skip_integration_test: Skips integration tests (Should be used when changes are ONLY documentation/UI)

@likawind likawind added the run_integration_test Triggers integration tests label Apr 27, 2023
@@ -222,11 +221,18 @@ func (a *ArtifactImpl) updateArtifactResultAfterComputation(
)
if err != nil {
log.Errorf("Unable to read artifact result metadata from storage and unmarshal: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is this error ever expected in any cases? If it's a really fatal error we should probably also add a Tip to the execution state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And make it a system error or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is often triggered by the upstream op failed without creating the metadata file. I don't really want to update the exec status here since in most cases the artifact should be canceled, and caller decides. The main goal here is to prevent this from returning an err object to caller which often gives us the false signal on real failure (the op failure). So I think logging it in the context for now is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a comment here

return
execState.Error.Context = fmt.Sprintf(
"%s\nError reading metadata: %v",
execState.Error.Context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is execState.Error != nil in this case? I worry this will panic if there wasn't an issue with the computation, but something bad happened when we ReadFromStorage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's a good catch, we should check the pointer state before assigning values

@likawind likawind requested a review from kenxu95 April 28, 2023 20:40
Copy link
Contributor

@kenxu95 kenxu95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, very clean!

@likawind likawind merged commit 09d55b6 into main Apr 28, 2023
@vsreekanti vsreekanti deleted the improve_logging branch May 9, 2023 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_integration_test Triggers integration tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants