Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store error taxonomy data in the analytics database instead of OpenSearch #698

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mvandenburgh
Copy link
Member

@mvandenburgh mvandenburgh commented Nov 10, 2023

@jjnesbitt I tried to describe my approach here in the commit messages. Let me know if you have any suggestions regarding the design I took to add in the error processor; I'm not 100% satisfied with it, but this was the best solution I managed to come up with given that we need everything to be packaged in docker images.

I also tried to structure this with an eye towards unifying the two "job processor" fastapi servers in the future.

Since we'd also like to use this Django app for error taxonomy upload
(and potentially, any other analytics-related things other than build
timings in the future), I refactored this docker image by
- Renaming it to have a more generic name (`ci-analytics`)
- Removing the invocation of the build timing upload script from the
Docker file, and instead specifying it in the job template.
I kept the old one there for now instead of deleting it. Once we
confirm that the new analytics DB-based workflow works, I'll remove it.
Both of their job-template.yaml files got updated, so we need to bump
these as well.
jjnesbitt
jjnesbitt previously approved these changes Nov 13, 2023
Copy link
Collaborator

@jjnesbitt jjnesbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a couple questions/comments.

I agree that we should unify our two webhooks. I'd like to do that sooner rather than later, maybe even before #697 is merged.

for container in job_template["spec"]["template"]["spec"]["containers"]:
container.setdefault("env", []).extend(
[dict(name="JOB_INPUT_DATA", value=json.dumps(job_input_data))]
for template in ("job-template.yaml", "job-template-old.yaml"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just switch over to the new job template. If it fails we'll see it in either sentry or just by monitoring the cluster. The worst case scenario is that we lose the error taxonomy for a few jobs, which isn't really a problem. And in a situation where we need to immediately remedy the situation (not sure what that would be), we can always revert this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a grafana dashboard ready for the new one yet, since I wanted to wait until we have some data in the database to make one. If we remove the opensearch job now, we'll lose access to the error taxonomy dashboard until I get the new one working.

@jjnesbitt jjnesbitt dismissed their stale review November 13, 2023 17:19

Didn't mean to hit "Approve"

Failed jobs don't have a `Job` record saved.
@zackgalbreath
Copy link
Collaborator

is this PR obsolete now that #749 has been merged?

@jjnesbitt
Copy link
Collaborator

is this PR obsolete now that #749 has been merged?

That PR consolidates the app that handles creating the opensearch record into the Django app, but the data is still going to opensearch. The goal of this PR is to store the error taxonomy classification itself into the analytics database, removing the need to push that data to opensearch at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants