Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-811: MachineConfigNode introduction for MCO State Reporting #1490

Merged
merged 1 commit into from
Feb 22, 2024

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Oct 6, 2023

This is an enhancement discussing the MachineConfigNode introduction for increased MCO state reporting.

See implementation here:
openshift/machine-config-operator#4012
openshift/api#1596
openshift/api#1621

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 6, 2023

@cdoern: This pull request references MCO-811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

Introduce an enhancement discussing the MachineConfigState introduction as well as increased State reporting in the MachineConfigutationStatus operator object.

See implementation here:
openshift/machine-config-operator#3942
openshift/api#1596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 6, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 6, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 6, 2023
@cdoern cdoern force-pushed the machineconfigstate branch 2 times, most recently from 4a1237e to df1e756 Compare October 9, 2023 20:27
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 10, 2023

@cdoern: This pull request references MCO-811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This is an enhancement discussing the MachineConfigState introduction as well as increased State reporting in the MachineConfigutationStatus operator object.

See implementation here:
openshift/machine-config-operator#3942
openshift/api#1596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 12, 2023

@cdoern: This pull request references MCO-811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This is an enhancement discussing the MachineConfigState introduction as well as increased State reporting in the MachineConfigutationStatus operator object.

See implementation here:
openshift/machine-config-operator#3970
openshift/api#1596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 12, 2023

@cdoern: This pull request references MCO-811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This is an enhancement discussing the MachineConfigState introduction as well as increased State reporting in the MachineConfigutationStatus operator object.

See implementation here:
openshift/machine-config-operator#3970
openshift/api#1596
openshift/api#1621

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cdoern cdoern marked this pull request as ready for review October 13, 2023 00:12
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 13, 2023
@cdoern cdoern changed the title MCO-811: [WIP] MCO State Reporting MCO-811: MCO State Reporting Oct 13, 2023
@openshift-ci openshift-ci bot requested review from jan--f and knobunc October 13, 2023 00:12
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 17, 2023

@cdoern: This pull request references MCO-811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This is an enhancement discussing the MachineConfigNode introduction as well as increased State reporting in the MachineConfigutationStatus operator object.

See implementation here:
openshift/machine-config-operator#3970
openshift/api#1596
openshift/api#1621

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
enhancements/machine-config/mco-state-reporting.md Outdated Show resolved Hide resolved
@cdoern
Copy link
Contributor Author

cdoern commented Oct 24, 2023

A few comments on the status so far. I will be moving the MachineConfiguration operator/v1 reporting to another enhancement since that is quite separate from machineconfignode.

@JoelSpeed I think we have differing views on how the False Conditions should look. I was thinking of keeping the conditions as is and just setting the status to false, that is how the MCP seems to do it.

I can consider changing the message to something like Completed Drain. Desired Drainer was: but I would like to keep the Reason since otherwise I would need to create 2x the amount of StateProgress's to track the negative states as well.

@cdoern cdoern changed the title MCO-811: MCO State Reporting MCO-811: MachineConfigNode introduction for MCO State Reporting Oct 24, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 24, 2023

@cdoern: This pull request references MCO-811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This is an enhancement discussing the MachineConfigNode introduction for increases MCO state reporting.

See implementation here:
openshift/machine-config-operator#3970
openshift/api#1596
openshift/api#1621

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 24, 2023

@cdoern: This pull request references MCO-811 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This is an enhancement discussing the MachineConfigNode introduction for increased MCO state reporting.

See implementation here:
openshift/machine-config-operator#3970
openshift/api#1596
openshift/api#1621

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.



```console
$ oc get machineconfignodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when an upgrade gets triggered, all the nodes need to be updated to latest config. Hence, reporting UPDATED as false for all nodes would be more aligned. Once a node has been updated, they get set to true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, maybe instead of updating, what we really want is idle? Then the user can still gauge updated status via MCP and use this to see what nodes are in action

Copy link
Contributor

@JoelSpeed JoelSpeed Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to Idle, that seems like it could be helpful. What would trigger Idle to go into false?

I had expected, kind of as @sinnykumari mentions

Name.       Updated.   Updating
my-node-1.  False.         True
my-node-2.  False.          False

Then the conditions when you expand it would say on my-node-2, not yet queued for update or something to explain why they aren't updating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a card to track an idle state, though that will cause unnecessary API calls that might have a "blip" effect where Updated only is true for a fraction of a second. Might instead want to switch updated to idle

@cdoern cdoern force-pushed the machineconfigstate branch 2 times, most recently from f61c40f to f933511 Compare January 12, 2024 17:16
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdoern Please open the diff view on the PR and make sure to respond to/resolve any open conversations, I can see several that have not had any response, or where the feedback has not made its way to the document itself.

As far as I understand it, the EP is up to date now with the state of the TP feature in 4.15. Before we merge, I'd like any outstanding points or feedback that you intend to fix before GA to be listed somewhere, I've suggested a couple in the EP, that could be a good place for it.

I think once we are tracking a list of TODOs and the existing feedback has been actioned or added to the TODOs, then we can merge this and start looking at the improvements towards GA

Implementation wise, there has been discussion of who should own these objects: the MCD or a new Controller called the MachineStateController. While either approach would work, there are positive and negative impacts to each.


## MachineConfigDaemon Approach
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which approach did we end up with? Which controller is owning/updating the MCN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Daemon is. I'll add this in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the approach we are going with, it should probably be in the proposal section rather than alternatives, and the original proposal should be written down as an alternative

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have cards for the follow up, and link those in the TP -> GA, I think we can merge this. When the work to action the TP -> GA goes through, it would be good to get the EP updated as well

enhancements/machine-config/machine-config-node.md Outdated Show resolved Hide resolved
enhancements/machine-config/machine-config-node.md Outdated Show resolved Hide resolved
Implementation wise, there has been discussion of who should own these objects: the MCD or a new Controller called the MachineStateController. While either approach would work, there are positive and negative impacts to each.


## MachineConfigDaemon Approach
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the approach we are going with, it should probably be in the proposal section rather than alternatives, and the original proposal should be written down as an alternative

The states are mostly in the past tense. This is because processes like `Drained` are defined by completion primarily not the progress. So a user will have `UpdateExecuted` == Unknown and `Drained` == Unkown until the Drain actually completes.
However, the unknown phase will be accompanied by a message for how the drain is currently going or if the drain has gone wrong.

The condition transitions back to False once the update process is completed. Once Updated == true is the case, all previous states get set to false and their reasons/messages show the fact that this was the message from the previous update cycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it would be good to make sure at the beginning of an update, everything is cleared down, and then marked complete as you go through each phase.
Might need to consider what happens when some updates don't pass through all phases though

@dhellmann
Copy link
Contributor

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

@cdoern
Copy link
Contributor Author

cdoern commented Feb 14, 2024

/override markdownlint

@cdoern
Copy link
Contributor Author

cdoern commented Feb 14, 2024

/override ci/prow/markdownlint

Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

@cdoern: cdoern unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers.

In response to this:

/override ci/prow/markdownlint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

@cdoern: cdoern unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers.

In response to this:

/override markdownlint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dhellmann
Copy link
Contributor

/override ci/prow/markdownlint

@cdoern when it comes time to override again after it's approved, your team lead or staff engineer can override the job for you.

Copy link
Contributor

openshift-ci bot commented Feb 14, 2024

@dhellmann: Overrode contexts on behalf of dhellmann: ci/prow/markdownlint

In response to this:

/override ci/prow/markdownlint

@cdoern when it comes time to override again after it's approved, your team lead or staff engineer can override the job for you.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the content of the proposal. It reflects my understanding of the current implementation, and while there is some outstanding feedback, particularly about the UX of the feature, we now have cards in Jira to track those.

There's just one dissonance at the moment, the Alternatives section talks about two ways to implement the mechanics of the proposal, I would have expected one of these to be in the proposal section so it's clear which was chosen and implemented.

If you want to fix that and then poke @sinnykumari adding the final acks to make this merge, I think it's time

@cdoern
Copy link
Contributor Author

cdoern commented Feb 21, 2024

also @sinnykumari you need to override lint using /override ci/prow/markdownlint

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Charlie and everyone for the review. This has turned out great!
Before adding approval label, have one clarifying question.
Today, cluster admin looks at MCP status to see overall state. With MCN in place, there is chance that admin would be confused that why both differs. For situation like power outage, manual node cordoning, SRIOV operator performing some actions. Since MCN is more tied to MCO operation compared to MCP showing overall health of nodes in the cluster, we may want to call this out explicitly in the enhancement perhaps in Summary or any better place?

Introduce an enhancement discussing the MachineConfigNode introduction

Signed-off-by: Charlie Doern <cdoern@redhat.com>
@cdoern
Copy link
Contributor Author

cdoern commented Feb 22, 2024

@sinnykumari I just added a statement to that effect.

Though, I will push back a bit since in an outage, finding out why your nodes failed to update is equally as useful as checking the MCP status. Sure, there are helpful internally looking at MCO procedures but at the end of the day they are a way to track node update status and since the MCO owns the update code, it just so happens that a lot of these actions are tied to the MCO.

@sinnykumari
Copy link
Contributor

Thanks for updating with needed changes.
/approve
/override ci/prow/markdownlint

@sinnykumari
Copy link
Contributor

Since there are no outstanding comment, adding lgtm as well for the merge.
/lgtm

Copy link
Contributor

openshift-ci bot commented Feb 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2024
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2024
Copy link
Contributor

openshift-ci bot commented Feb 22, 2024

@sinnykumari: Overrode contexts on behalf of sinnykumari: ci/prow/markdownlint

In response to this:

Thanks for updating with needed changes.
/approve
/override ci/prow/markdownlint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Feb 22, 2024

@cdoern: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 4d28317 into openshift:master Feb 22, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.