Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make kebechet respond to release tickets on all failures #629

Open
tumido opened this issue Dec 1, 2020 · 23 comments
Open

Make kebechet respond to release tickets on all failures #629

tumido opened this issue Dec 1, 2020 · 23 comments
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. hacktoberfest Issues targeting the hacktoberfest participants. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries.

Comments

@tumido
Copy link
Member

tumido commented Dec 1, 2020

Is your feature request related to a problem? Please describe.
Releasing via kebechet is very convenient and straightforward when it works. When it doesn't and it's not related to permissions (user is not a maintainer and such) it's really hard to triage it.

Describe the solution you'd like
Get an error message if any of the build/release steps fails.

Describe alternatives you've considered
n/a

Additional context
aicoe-aiops/categorical-encoding#16

@saisankargochhayat
Copy link
Contributor

We already mentioned in the release issue that the person trying to create the release is not a maintainer.
For ex - thoth-station/storages#2109

@goern
Copy link
Member

goern commented Dec 3, 2020

@tumido are you good with this behavior? can we close this issue?

@tumido
Copy link
Member Author

tumido commented Dec 3, 2020

@goern please don't close, I don't think we understand each other here 🙂

@saisankargochhayat yeah, that's true and that's precisely why I've excluded those cases in the description, see:

... and it's not related to permissions (user is not a maintainer and such)

In our case the issue was hard to triage because Kebechet failed to push the tag, since it was already released outside of Kebechet via git tag and the tag already existed, while the version in version.py was outdated (didn't match the tag). See:
aicoe-aiops/categorical-encoding#15
aicoe-aiops/categorical-encoding#16
aicoe-aiops/categorical-encoding#17
And finally solved here:
aicoe-aiops/categorical-encoding#19

As you can see we've been very much in blind of what's happening and kebechet didn't tell us why it failed. This ticket is precisely for such occasions of anticipated failures, not about this exact failure type. My ask here is if we can make kebechet report the status every time, in any failure case.

@saisankargochhayat
Copy link
Contributor

From what I understand, in a scenario like this a comment on the release issue is what we want -
aicoe-aiops/categorical-encoding#16 (comment)
Is that correct?

@tumido
Copy link
Member Author

tumido commented Dec 3, 2020

Well, I don't think we should pay attention to this particular cause, it's not about fancy reporting on specific narrow reason of failure. This should be about bare old school reporting for any kind of failure.

If the bot can provide any insight into what happen, it will mean the bot saved us from filing 3 more triage trial issues. A link to the job run, failed steps, log of the step, whatever - that's what I'd like to see, the "debug" data.

@saisankargochhayat
Copy link
Contributor

saisankargochhayat commented Dec 3, 2020

So as a general principle for any exception we encounter, we do put an issue comment indicating the user, this seems like a corner case, where the release was manually created instead of using kebechet, which messed up things. But feel free to let me know if you can find anywhere else that reporting the error could be helpful to the user. Souce code link - https://github.com/thoth-station/kebechet/blob/master/kebechet/managers/version/version.py

Maybe it's a good idea, to write in the version manager's documentation stating at any point you manually release it's a good idea to ensure that the source code version string and the tag release both indicate the same version.

@goern goern added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 7, 2021
@tumido
Copy link
Member Author

tumido commented Jan 12, 2021

I think a more generic failure handling would be appreciated on the user side.

Right now I'm debugging another issue, with a different repo where the release process failed silently (the release PR has been opened, the git tag was pushed, yet no image was delivered to quay). There's no message from any of the bots on any of the issues (sesheta even closed the release issue as a success). See for yourself:

aicoe-aiops/mailing-list-analysis-toolkit#24
image

https://quay.io/repository/aicoe/mailing-list-analysis-toolkit?tab=tags
image

I wasn't able to locate the Tekton pipeline responsible for that release, so I've triggered the "Deliver container image" issue pipeline for the missing image:
aicoe-aiops/mailing-list-analysis-toolkit#27

The build failed on some networking error (now I know that it was a networking issue, since I was able to locate the Tekton job):
https://tekton-dashboard-openshift-pipelines.apps.ocp4.prod.psi.redhat.com/#/namespaces/aicoe-infra-prod/pipelineruns/aicoe-issue-qdx7b
image

Yet the bots are still silent on the issues. This is not about a single corner case. This is more about a generic "safety measures" e.g. I face any error, I report it. Can we make AICoE-CI do that please?

cc @harshad16

@harshad16
Copy link
Member

@tumido thanks for pointing it out, on side of aicoe-ci, we are trying to get this message to the user either on the PR or the issue opened. There are some changes to be made to get this to a state where is more convenient for the user to get more information. we will try to get these details for the user.

on the topic of kebechet, the feature that can be useful is responding to the issues of why it is stale, the reason is that the pod running the kebechet run has failed, but as it failed, there is no message relayed all the way to GitHub issue, we should plan on managing this, either by reporting error traceback to the Github issue or pr for that we would have to monitor the exceptions or via a sidecar container which responds the GitHub issue with the log of the failed main container.

@tumido
Copy link
Member Author

tumido commented Jan 12, 2021

@harshad16 I know it's hard to catch every possibility and I know aicoe-ci is doing its best and I'm totally rooting for you! Yet we're still pushing the limits and demands and opening new issues... 😄

The sidecar container or something sounds like a wonderful idea (it also sounds like a lot of work)! Looking forward to the bright future 👍

@sesheta
Copy link
Member

sesheta commented Apr 29, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2021
@goern
Copy link
Member

goern commented Apr 30, 2021

/remove-lifecycle stale

@harshad16 what is the status of this?

@sesheta sesheta removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2021
@tumido
Copy link
Member Author

tumido commented Apr 30, 2021

So.. Since I always wanted to learn the CI ropes, I've experimented with this when building by own CI for the OperateFirst slack bot...

I think a comment like from the bots would be enough:
tumido/slack-first#54 (comment)
image

I'm updating the same comment in various stages of the CI with the most recent actions taken. It helps me understand which workflow and at which step it got stuck.

If it would be possible to have something like this for AICoE-CI, I think it would be a huge jump forward in usability.

@goern
Copy link
Member

goern commented May 1, 2021

I'm all in for more chatops, as long as we keep it accessible to us Red Hats using Google Chat ;)

Shall we send out event from the CI to a Kafka topic and have different consumers send messages to slack or gchat?

@goern
Copy link
Member

goern commented Jun 10, 2021

/priority backlog

@sesheta sesheta added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jun 10, 2021
@sesheta
Copy link
Member

sesheta commented Jul 15, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 15, 2021
@goern
Copy link
Member

goern commented Jul 16, 2021

/remove-lifecycle rotten
/help
/good-first-issue

@sesheta
Copy link
Member

sesheta commented Jul 16, 2021

@goern:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/remove-lifecycle rotten
/help
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sesheta sesheta added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 16, 2021
@sesheta
Copy link
Member

sesheta commented Aug 24, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 24, 2021
@goern
Copy link
Member

goern commented Sep 15, 2021

/remove-lifecycle rotten

@sesheta sesheta removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 15, 2021
@goern goern added the hacktoberfest Issues targeting the hacktoberfest participants. label Sep 15, 2021
@goern
Copy link
Member

goern commented Oct 27, 2021

/assign goern

@goern
Copy link
Member

goern commented Oct 27, 2021

/sig user-experience

@sesheta sesheta added the sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries. label Oct 27, 2021
@sesheta
Copy link
Member

sesheta commented Jan 25, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2022
@harshad16
Copy link
Member

/lifecycle frozen

@sesheta sesheta added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 25, 2022
@goern goern removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 26, 2023
@goern goern removed their assignment Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. hacktoberfest Issues targeting the hacktoberfest participants. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/user-experience Issues or PRs related to the User Experience of our Services, Tools, and Libraries.
Projects
Status: 🆕 New
Development

No branches or pull requests

5 participants