-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCO-811: MachineConfigNode introduction for MCO State Reporting #1490
MCO-811: MachineConfigNode introduction for MCO State Reporting #1490
Conversation
@cdoern: This pull request references MCO-811 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Skipping CI for Draft Pull Request. |
4a1237e
to
df1e756
Compare
@cdoern: This pull request references MCO-811 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
df1e756
to
1facaff
Compare
@cdoern: This pull request references MCO-811 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cdoern: This pull request references MCO-811 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1facaff
to
0ef72be
Compare
@cdoern: This pull request references MCO-811 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
A few comments on the status so far. I will be moving the MachineConfiguration operator/v1 reporting to another enhancement since that is quite separate from machineconfignode. @JoelSpeed I think we have differing views on how the I can consider changing the message to something like |
0ef72be
to
8ab3b20
Compare
@cdoern: This pull request references MCO-811 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cdoern: This pull request references MCO-811 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
||
|
||
```console | ||
$ oc get machineconfignodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when an upgrade gets triggered, all the nodes need to be updated to latest config. Hence, reporting UPDATED
as false
for all nodes would be more aligned. Once a node has been updated, they get set to true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, maybe instead of updating
, what we really want is idle
? Then the user can still gauge updated status via MCP and use this to see what nodes are in action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to Idle, that seems like it could be helpful. What would trigger Idle to go into false?
I had expected, kind of as @sinnykumari mentions
Name. Updated. Updating
my-node-1. False. True
my-node-2. False. False
Then the conditions when you expand it would say on my-node-2
, not yet queued for update or something to explain why they aren't updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make a card to track an idle state, though that will cause unnecessary API calls that might have a "blip" effect where Updated only is true for a fraction of a second. Might instead want to switch updated to idle
f61c40f
to
f933511
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdoern Please open the diff view on the PR and make sure to respond to/resolve any open conversations, I can see several that have not had any response, or where the feedback has not made its way to the document itself.
As far as I understand it, the EP is up to date now with the state of the TP feature in 4.15. Before we merge, I'd like any outstanding points or feedback that you intend to fix before GA to be listed somewhere, I've suggested a couple in the EP, that could be a good place for it.
I think once we are tracking a list of TODOs and the existing feedback has been actioned or added to the TODOs, then we can merge this and start looking at the improvements towards GA
Implementation wise, there has been discussion of who should own these objects: the MCD or a new Controller called the MachineStateController. While either approach would work, there are positive and negative impacts to each. | ||
|
||
|
||
## MachineConfigDaemon Approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which approach did we end up with? Which controller is owning/updating the MCN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Daemon is. I'll add this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the approach we are going with, it should probably be in the proposal section rather than alternatives, and the original proposal should be written down as an alternative
f933511
to
8e162e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have cards for the follow up, and link those in the TP -> GA, I think we can merge this. When the work to action the TP -> GA goes through, it would be good to get the EP updated as well
Implementation wise, there has been discussion of who should own these objects: the MCD or a new Controller called the MachineStateController. While either approach would work, there are positive and negative impacts to each. | ||
|
||
|
||
## MachineConfigDaemon Approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the approach we are going with, it should probably be in the proposal section rather than alternatives, and the original proposal should be written down as an alternative
The states are mostly in the past tense. This is because processes like `Drained` are defined by completion primarily not the progress. So a user will have `UpdateExecuted` == Unknown and `Drained` == Unkown until the Drain actually completes. | ||
However, the unknown phase will be accompanied by a message for how the drain is currently going or if the drain has gone wrong. | ||
|
||
The condition transitions back to False once the update process is completed. Once Updated == true is the case, all previous states get set to false and their reasons/messages show the fact that this was the message from the previous update cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it would be good to make sure at the beginning of an update, everything is cleared down, and then marked complete as you go through each phase.
Might need to consider what happens when some updates don't pass through all phases though
8e162e6
to
b7d3e29
Compare
#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template. |
/override markdownlint |
/override ci/prow/markdownlint |
@cdoern: cdoern unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cdoern: cdoern unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override ci/prow/markdownlint @cdoern when it comes time to override again after it's approved, your team lead or staff engineer can override the job for you. |
@dhellmann: Overrode contexts on behalf of dhellmann: ci/prow/markdownlint In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the content of the proposal. It reflects my understanding of the current implementation, and while there is some outstanding feedback, particularly about the UX of the feature, we now have cards in Jira to track those.
There's just one dissonance at the moment, the Alternatives section talks about two ways to implement the mechanics of the proposal, I would have expected one of these to be in the proposal section so it's clear which was chosen and implemented.
If you want to fix that and then poke @sinnykumari adding the final acks to make this merge, I think it's time
b7d3e29
to
cbc7f9a
Compare
also @sinnykumari you need to override lint using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Charlie and everyone for the review. This has turned out great!
Before adding approval label, have one clarifying question.
Today, cluster admin looks at MCP status to see overall state. With MCN in place, there is chance that admin would be confused that why both differs. For situation like power outage, manual node cordoning, SRIOV operator performing some actions. Since MCN is more tied to MCO operation compared to MCP showing overall health of nodes in the cluster, we may want to call this out explicitly in the enhancement perhaps in Summary or any better place?
Introduce an enhancement discussing the MachineConfigNode introduction Signed-off-by: Charlie Doern <cdoern@redhat.com>
@sinnykumari I just added a statement to that effect. Though, I will push back a bit since in an outage, finding out why your nodes failed to update is equally as useful as checking the MCP status. Sure, there are helpful internally looking at MCO procedures but at the end of the day they are a way to track node update status and since the MCO owns the update code, it just so happens that a lot of these actions are tied to the MCO. |
cbc7f9a
to
7f02b07
Compare
Thanks for updating with needed changes. |
Since there are no outstanding comment, adding lgtm as well for the merge. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sinnykumari, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@sinnykumari: Overrode contexts on behalf of sinnykumari: ci/prow/markdownlint In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cdoern: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This is an enhancement discussing the MachineConfigNode introduction for increased MCO state reporting.
See implementation here:
openshift/machine-config-operator#4012
openshift/api#1596
openshift/api#1621