Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPCLOUD-1578: Add enhancement for converting Machine API resource to Cluster API #1465

Merged

Conversation

JoelSpeed
Copy link
Contributor

To enable us to migrate from Machine API to Cluster API, we need to implement a conversion system that will convert existing resources to Cluster API, and allow the continued use of Machine API for customers who need it.

This enhancement covers the first phase of conversion, allowing a bi-directional sync of Machine API and Cluster API resources.

In the future, a second enhancement will describe how we turn the bi-directional project, with choice over which API is authoritative, to always having Cluster API implement the Machine lifecycle.

@openshift-ci openshift-ci bot requested review from dulek and knobunc August 30, 2023 13:41
@JoelSpeed JoelSpeed changed the title Add enhancement for converting Machine API resource to Cluster API OCPCLOUD-1578: Add enhancement for converting Machine API resource to Cluster API Aug 30, 2023
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 30, 2023

@JoelSpeed: This pull request references OCPCLOUD-1578 which is a valid jira issue.

In response to this:

To enable us to migrate from Machine API to Cluster API, we need to implement a conversion system that will convert existing resources to Cluster API, and allow the continued use of Machine API for customers who need it.

This enhancement covers the first phase of conversion, allowing a bi-directional sync of Machine API and Cluster API resources.

In the future, a second enhancement will describe how we turn the bi-directional project, with choice over which API is authoritative, to always having Cluster API implement the Machine lifecycle.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 30, 2023
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass. Reviewed as far as 'Summary of the rules outlined above'.

1. Cluster admin uses `oc edit` to update the value of the `machines.openshift.io/authoritative-api` annotation to `cluster.x-k8s.io`
1. The validating webhook verifies the value of the annotation is valid
1. The validating webhook observes that no Cluster API equivalent resource has been created, or, the equivalent Cluster API resource is stale
1. The validating webhook fetches the syncrhonisation failure condition from the Machine API MachineSet and returns this, along with a reject message, to the Cluster Admin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivia: synchronisation

1. The validating webhook verifies the value of the annotation is valid
1. The validating webhook observes that no Cluster API equivalent resource has been created, or, the equivalent Cluster API resource is stale
1. The validating webhook fetches the syncrhonisation failure condition from the Machine API MachineSet and returns this, along with a reject message, to the Cluster Admin
1. Based on the rejected API call, the cluster admin is informat that they cannot currently migrate this MachineSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivia: informed


When errors occur during synchronisation, the controller will write conditions to the Machine API resource to report the failure.

On successful sychronsiation, the controller will add an annotation to the non-authoritative resource to indicate the observed generation of the last successful synchronsiation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivia: synchronisation (twice, different each time 😂), although honestly yours are better.

Machine synchronisation will happen in much the same way as in MachineSets.
A Machine API Machine will map to a Machine and an InfrastructureMachine in Cluster API.

If a Machine is owned by a MachineSet, then the existing InfrastructureTemplate from the MachineSet can be used to create the InfrastructureMachine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not do this. I think the CAPI machine objects should always be generated from the Machine object without reference to the template that they were supposedly created from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, rewording


When Cluster API Machines are authoritative, if a Machine API Machine exists for the Cluster API Machine,
the controller will synchronise the Machine and InfrastructureMachine content onto the Machine API Machine.
Since the Machien API Machine represents a superset of the Cluster API resources (for supported features), there should be no errors converting in this direction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this look like for unsupported features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any feature that is unsupported in CAPI wouldn't be converted to create the CAPI resource in the first place.
Any feature that is unsupported in MAPI would result in a backwards conversion error. In theory, for Machines this should be fine, since the spec is immutable once created.
For MachineSets, if the user updated the CAPI MachineSet to have an incompatible template, and the backwards conversion failed, we would report on the MAPI Machine status that it cannot complete the conversion, and we would stop updating the observed synchronised generation, at which point this prevents backwards updates

Comment on lines 275 to 276
When a Cluster API MachineSet creates a new Machine, the synchronisation controller will create it’s Machine API equivalent if, and only if, the Cluster API MachineSet has a Machine API equivalent.
This will allow users to switch a MachineSet back to Machine API at a later date if they wish to do so.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic if we're allowing mutable MachineSets, described above.

  • User creates MAPI MachineSet
  • User sets CAPI MachineSet authoritative
  • User updates CAPI MachineSet's InfrastructureTemplate to one containing incompatible features
  • User sets MAPI MachineSet to authoritative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the backwards conversion cannot happen, then the move to MAPI being authoritative will be blocked by the webhook operations, so I think this is problematic yes, but we have already got a mitigation in place

Comment on lines 263 to 266
ControlPlaneMachineSets (additionally to MachineSets) have a concept of failure domains.
This is also a concept in Cluster API.
When converting ControlPlaneMachineSets to Cluster API, we will ensure that the appropriate failure domain configuration is configured inside the Cluster API InfrastructureCluster,
and that the failure domains from the ControlPlaneMachineSet Machine API configuration are reflected arcoss.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in Cluster API a failure domain is a string reference to something understood by the InfrastructureProvider, whereas in CPMS a failure domain is a complete object. This gets complicated when failure domains are modified. CPMS handles this gracefully, but in Cluster API a failure domain may not be modified. No controller will notice that, e.g. a failure domain reference by a MachineTemplate has been modified and trigger a rollout. The result will just be inconsistent state. The solution to this is not to modify them, and to instead create new failure domains with a new name.

This may mean creating failure domains based on the hash of their contents as described for MachineSets, but note that previously-created failure domain should likely not be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there is an API change we can introduce upstream here? To prevent removal or modification of failure domains? Removal I guess we can only prevent if there is still a Machine using the failure domain. I suspect we can put a case forward for making a change.

I think in general a little more thought needs to go into the failure domain stuff for CPMS and Cluster API going forward, I was intending a light touch general idea for this enhancement with a future enhancement to go into the depth of it

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it feasible to do the following 2 things:

  • Make the MAPI object the canonical source for authoritative-api in all cases
  • Instead of a sync operator, run a sync controller for an object alongside the MAPI resource controller which owns it

I think in practise this would amount to adding a new boilerplate controller and/or code in MAO and requiring providers to implement it. This would also naturally have the conversion code live alongside the controller code, which seems like a good outcome for maintenance and testing. It could also solve a bunch of synchronisation headaches by naturally serialising them.

This will look up the current state of the Machine API equivalent resource and determine whether the controller should be reconciling that specific resource or not.
This will hopefully be a relatively small carry patch that we will need to carry and should be written in a way that the logic is utilitarian so that we can reduce the duplication of the code.

To ensure we always have the latest version, caches should not be used when performing this check (Cluster API is based on controller-runtime which by default uses a caching client).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're worried about correctness: there is no such thing as having the latest version. Removing caching just makes the problem harder to test by making the error window smaller. I think we can either:

  • Run all associated controllers in the same manager with the same informers
  • Ensure the protocol handles update latency safely

Suspect the latter will be simpler. Incidentally, moving all authoritative-api state logic to the MAPI object status might help here, as it would serialise it in a single object.

If you're worried about performance, in practise a controller-runtime informer will see an update pretty quickly because of its associated watch. I suspect this optimisation would not be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suspect the latter will be simpler. Incidentally, moving all authoritative-api state logic to the MAPI object status might help here, as it would serialise it in a single object.

Even with this, there is a potential that both receive an event to reconcile, and they both think the status makes them authoritative right? The correct one will be able to update the API, the incorrect one will be rejected on a stale resource version, but that doesn't mean they didn't make out of band changes.

That said, having the changeover controller by the sync controller does have its benefits. If the change is accepted, that means we can ensure/finish any sync of the config needed before we update the status version of the field

Do you think the controllers would always reconcile based on just the status, or, do you expect they would check for spec/status and compare them? ie, should we stop the controllers reconciling the infra when there is a discrepancy between the desired and actual authoritative API 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual resource controllers should look only at the status.

The sync controller should take additional resource-specific steps to ensure that the state is consistent before flipping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a new resource, this means the sync controller must run before the actual controller itself. I wonder if that could lead us into issues if the sync cannot happen for some reason. I guess if the controller always copies the authoritative over on first reconcile then this is ok, the delay is minimal, but then only accepts the change based on conditions later, I think that should be safe. Will explore

Comment on lines 412 to 414
In addition to this, the webhook will validate, on a change of authoritative API, that the synchronisation is up to date.
If the value of the `machine.openshift.io/synchronised-generation` annotation does not match the current generation,
the change will be rejected and requested to be retried once the synchronisation has occurred.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this requirement. Surely we can always update the authoritative API and the sync controller can catch up later?

I suspect this would create significant churn during initial resource creation where we do things like:

Reconcile(foo):
   ...
   foo.Status.Bar = <whatever>
   patch(foo.Status)
   return /* Expect to be reconciled again immediately */

In this case we will be called again immediately but fail because the sync controller hasn't caught up yet. The early rounds of controller-runtime's default backoff are very short, so I wouldn't be surprised if this happened 4-5 times as a matter of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is this only is checked when there's a change of the authoritative API, as in, if you change from MAPI to CAPI or vice versa, so I don't think it will actually cause issues in most cases.

However, I think if we move to the controller "accepting" a change to the authoritative API, that we will be able to drop this anyway

By extending the integration tests, we can sculpt the desired inputs for mirror resources existing, not existing, and authoritative in both API versions,
and then determine whether the reconcile continued or exited based on these conditions.

#### Full migration E2E
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to write here:

We can set a cluster-wide flag to make CAPI authoritative in all cases then run our existing test suites, which should continue to pass.

However, we can't do that because our existing test suites will continue to use MAPI, and we don't allow continued synchronisation of resources from MAPI->CAPI when CAPI is authoritative.

This suggests a couple of problems to me:

  • We can't leverage our existing battle-hardened test suite as a regression suite, at least for machine operations
  • We can't make CAPI globally authoritative until all our users stop using MAPI, which we may never achieve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't leverage our existing battle-hardened test suite as a regression suite, at least for machine operations

Given the existing tests that exist in origin, we can update these to use unstructured since they only leverage MachineSet scaling.

For other test suites, you're correct.

However, one of the plans for the future is that we eventually turn the feature into a one way sync. WIth this (feature gated) approach we can still run the existing test suite, leveraging the MAPI suite with everything being run by CAPI controllers

This is also how we will make CAPI the backend globally eventually, but we have no intention of forcing CAPI authoritative in the sense of the API at least, as a part of this project

we will make sure that Cluster API controllers in the 4.N-1 release do not reconcile when a mirror resource exists with the same name.
Independent of the authoritative annotation.

This will mean that on downgrade all Machine management reverts to Machine API controllers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we prevent downgrade when CAPI->MAPI is no longer possible for some objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, a Downgrade is actually an Upgrade according to CVO, so, if we can detect that there's going go to be an issue with the direction of the upgrade, we could set the upgradeable false condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this, no, we couldn't do this. We would block all upgrades, or none, which would just be problematic

@jeana-redhat
Copy link

I understand that this is not final, but the information here is very detailed and provides a huge head-start for docs. I have done some initial planning and will keep an eye on this as things evolve.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is reading really well to me, thanks for the hard work everyone. i've left a bunch of comments.

If the Autoscaler last changed the replica count, it should be mirrored back to the Machine API MachineSet if it is authoritative.
- This creates an exception to the rules outline above and adds complication to the conversion, but, may be less of a burden than carrying a patch in the autoscaler

Both avenues above will need to be explored further before a decision is made on how the Autoscaler will handle mixed MachineSet clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me that the earlier we can switch the autoscaler to observing capi objects, the earlier we can drop our mapi carry patches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once we get started with the main conversion stuff it would be good to start getting the autoscaler ready, but until MAPI machines are being reflected into CAPI I don't think there's anything we can do there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading through the enhancement again, i have more thoughts about this.

  • we could expand the ClusterAutoscaler resource type, either through normal API or annotation, to allow the user to specify the authoritative API
  • based on which API is specified as authoritative, the CAO will deploy the proper configuration of CAS

the trouble, as mentioned earlier, is the carry patches we have on the autoscaler. we need to have different behavior between mapi and capi and it's more than just changing the GVR for the resources. so, we'll need to figure out if we can make mapi authoritative for some amount of releases and then switch to capi, or if the user needs to specify when deploying their ClusterAutoscaler manifest. then we need to figure out if we are adding a carry patch to the CAS or what.


To avoid this potential, our approach suggests to have the authoritative annotation exist in only one location, on the Machine API resource.

### Keep the authoritative annotation on the Cluster API resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part of the "Alternatives" section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's an alternative to keeping the annotation on the MAPI resource, though that's not so obvious right now

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking through and writing this up, it looks very thorough.
I left a couple of questions where I have doubts or desire clarification, but overall it makes sense to me.


We will implement a two-way synchronisation controller that synchronises Machines and related resources between the Machine API and Cluster API equivalents.

Using an annotation on the Machine API resource, users will be able to choose which API is authoritative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how the 2way-sync-controller would behave for events coming from, for example, the CAPI resource, having to check its MAPI counterpart all the time to determine whether to reconcile it or not.

Or are you planning to cache this somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events are only created for create, update and delete events. What should happen is that the controller updates the CAPI resource, writes that, gets an event back for the update, then reconciles the CAPI machine, checks that the CAPI machine is up to date WRT the MAPI Machine, and then exit without writing an update. So I would expect for each write a second reconcile to stabilise. I don't think it will create an infinite loop if that's a concern

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you planning to cache this somehow?

See below (under Synchronisation of resources). To my understanding, once the syncer is happy that the non-authoritative resource is copacetic, it will write the authoritative resource's metadata.generation to the non-authoritative resource. Subsequent reconciles in the syncer will no-op if the generations match. If they don't, the syncer regenerates the non-authoritative resource and updates it, along with the new generation of the authoritative resource. Note that in some cases the regenerated non-authoritative resource may be identical to its previous incarnation, in which case only the generation needs to be bumped. Regardless: the non-authoritative resource will be updated any time the generation of the authoritative resource changes, but not otherwise.

In pseudocode, assuming C(API) is authoritative:

C := GET(nsname, CType)
M := GET(nsname, Mtype)  // both objects use the same NamespacedName

if M.isNotFound:
  // This is a valid state
  return

if C.metadata.generation == getCGeneration(M):
  return

M = map2M(C)  // includes updating M's record of C's generation -- which may be the only change
POST(M)

return

(NB: There may be quirks depending on where we decide to store the generation: annotation or a status field -- see below. I.e. it is possible that two separate updates to M will be needed.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think Eric's description is what I had in mind, do you think we should be clarifying this further Dam or are you happy with what's already included below?

Comment on lines 309 to 316
##### MachineSet scale up

If a MachineSet scales up, it will create a new Machine as it does today.
Because the managed resource in this scenario is the MachineSet the synchronisation controller will ensure that the Machine resources are reflected in both Machine API and Cluster API.

If the MachineSet is a Cluster API MachineSet and there is no Machine API equivalent, this synchronisation will not happen and the machines will be created as Cluster API only.

New Machines created in the scale up with default to the same authoritative API as the MachineSet that creates them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note here, that MachineSet scaleup behaviour is different in MAPI vs CAPI:
See this past (not-a-bug) issue: https://issues.redhat.com/browse/OCPCLOUD-1540
Not sure this would necessarily be a problem but just something that might be worth remembering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think it will be an issue, the sync controller should handle this, but it is a behavioural change we will likely want to document as part of this, CC @jeana-redhat


To enable a reverse migration, should the customer wish, we will need to enable the user to create a Machine API equivalent of a Cluster API resource if it does not exist.

The user will be able to do this by creating the Machine API equivalent resource and ensuring that the authoritative annotation is present from creation and is set to `cluster.x-k8s.io`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the user creates the Machine API equivalent, are they going to populate fields that are ultimately overridden by the sync controller?

Ideally they would create a resource with a name that matches a CAPI-authoritative resource plus the required annotation, and nothing else. In reality, I suspect there are fields that the API currently defines as required that they would have to fill out.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is kind of addressed further down in the table. I still think a description of what fields are required would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can look that up, at the moment, I suspect there are a decent number of required fields. I wonder if there's a way that we can reduce that number. Certainly within anything that is webhook validated we can probably skip the validation when non-authoritative

To enable the Cluster API controllers to observe the authoritative annotation, we will need to implement a check at the beginning of the reconcile logic of each controller.

This will look up the current state of the Machine API equivalent resource and determine whether the controller should be reconciling that specific resource or not.
This will hopefully be a relatively small carry patch that we will need to carry and should be written in a way that the logic is utilitarian so that we can reduce the duplication of the code.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is something that should be split into a library so that the patches we carry in the controllers is just making calls to the shared logic. I presume this is what you're trying to say here, am I correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I had assumed that this would be a library logic that we import yes, though, I think we should test this to see if that's easier to maintain (given it will then need go.mod changes) than seeing if we can just maintain the code locally inside each repository without interfering with go modules


To ensure that we do not break customer clusters, we need to make sure that our logic that converts the resources between Machine API and Cluster API is thoroughly tested.

To ensure we don’t break customers we will write extensive tests for the conversion, primarily, we expect this to be done via unit testing, however, we will still need to test these conversions in E2E however, since the semantics of a configuration option may differ between the two APIs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to having a library for filtering reconciles, could the logic for conversion reside in a library that provides a wrapper CLI?

This way, we can have a suite of YAML fixtures that we can do conversions on in CI without having to spin up a full cluster every time.

I'm envisioning something like

func MachineSetToMAPI(*capi.MachineSet) *mapi.MachineSet
func MachineSetToCAPI(*mapi.MachineSet) *capi.MachineSet
...
func <Type>To<APIGroup>...

Of course, we should still have extensive e2e tests, but this may be a way to make the inner loop of testing conversion logic faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed we could add a unit testing suite for all of the conversions which would allow us to internally test these kinds of things. A CLI was floated in the past as an option to allow users to convert resources they have away from clusters. For example gitops style resources.

If we still think there's value in that I don't think it would be difficult to implement a CLI and package it in a way similar to CCO do for their ccoctl tool

This will be a fairly major patch that we need to carry.
We need to make sure that the patch will apply cleanly in the future and that we minimise the maintenance burden of this carry as much as possible.

#### Mirroring resources may be confusing for end users
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this will be confusing for users, especially if there are no annotations on the CAPI resources at all.

I think it makes sense for controlling the authoritative API from the MAPI side exclusively. When CAPI resources are mirrored only, perhaps we set an annotation on them that indicates they're not "active"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally say we should add a condition to the CAPI resource rather than an annotation. Other controllers should leave the additional condition alone. Will need to be tested though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now proposing to add a Paused condition upstream, do you think this would cover this concern? We intend to use pausing upstream as a way to stop CAPI from reconciling the resources, and with admission control, will prevent the unpausing of resources

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think pausing on CAPI plus the admission control would help.

Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent three days on this and can't seem to get all the way through it, so I'm gonna leave my so-far comments and come back later.


#### Synchronisation of resources

To ensure that resources are synchronised, we will implement a new controller `machine-api-synchronisation-controller`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

choosing a word that won't be misspelled frequently

s/synchroni[sz]ation/mapping/ ?

Then we'll have a capi-mapi-mapper and a mapi-capi-mapper.

And resources can be capi-mapi-mapping-capable and mapi-capi-mapping-capable.

But abbreviated to sync works too :P

This controller will be responsible for translation of Machine API resources to Cluster API resources, and vice-versa.

This controller will be responsible for creating Cluster API versions of Machine API resources, when they do not exist.
It will observe the authoritative annotation described [above][authoritative-annotation] to synchronise between the two resources,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this link isn't rendering


We will implement a two-way synchronisation controller that synchronises Machines and related resources between the Machine API and Cluster API equivalents.

Using an annotation on the Machine API resource, users will be able to choose which API is authoritative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you planning to cache this somehow?

See below (under Synchronisation of resources). To my understanding, once the syncer is happy that the non-authoritative resource is copacetic, it will write the authoritative resource's metadata.generation to the non-authoritative resource. Subsequent reconciles in the syncer will no-op if the generations match. If they don't, the syncer regenerates the non-authoritative resource and updates it, along with the new generation of the authoritative resource. Note that in some cases the regenerated non-authoritative resource may be identical to its previous incarnation, in which case only the generation needs to be bumped. Regardless: the non-authoritative resource will be updated any time the generation of the authoritative resource changes, but not otherwise.

In pseudocode, assuming C(API) is authoritative:

C := GET(nsname, CType)
M := GET(nsname, Mtype)  // both objects use the same NamespacedName

if M.isNotFound:
  // This is a valid state
  return

if C.metadata.generation == getCGeneration(M):
  return

M = map2M(C)  // includes updating M's record of C's generation -- which may be the only change
POST(M)

return

(NB: There may be quirks depending on where we decide to store the generation: annotation or a status field -- see below. I.e. it is possible that two separate updates to M will be needed.)


To understand the likelihood of this, we can use existing CCX data [^7] to determine if there will be clashes in this data before we implement the controller.

In cases where conversion fails, the controller will add a condition to the Machine API MachineSet defining the errors present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there are at least three failure modes here:

  • The syncer's internal logic is unable to come up with what it considers a sane target resource. This one is easy: update the condition as noted.
  • The syncer thinks it came up with a reasonable target resource, but gets an HTTP error when POSTing it to the server. This comes in at least two flavors:
    • Retryable, e.g. 409 conflict. Requeue and try again next time.
    • Fatal, e.g. rejected by webhook. For this one we probably also update the condition.
  • This is the weird one: k8s accepts the POST, but the object's controller -- if it ran -- would have noticed some internal problem and marked its (non-this-enhancement-related) status conditions to indicate failure. We either:
    • Code the non-authoritative controller to essentially "dry run" the update and set its conditions accordingly.
      • First, this would probably be Hard™, as not all of that logic is likely to be linear or in one place.
      • Second, how does the syncer know when that's done?
    • OR the non-authoritative controller truly no-ops. This is the option that seems correct based on this design. However, then the problem lies dormant and undetectable until a subsequent auth switch happens! Effectively the migration broke a Machine.
      ...but I guess this is okay: it means the migration didn't work, and the admin can/should just flip auth back to the original. And, I guess, iterate?
      It doesn't sound ideal, but I think I've convinced myself that this is the right option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-authoritative should truly no-op, and yes, there may well be broken machines. This is why we are introducing this initially with a get out of jail card, the ability to switch back

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2023
@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle stale

I need to come back to this and address the feedback, planning to implement in the new year so will want to do that soon

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2023
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2023
@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2023
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale


There will be a small number of fields that are allowed to be updated by the user:
* `.metadata.labels` - that are not critical to Machine API or Cluster API functionality, and are not in reserved Kubernetes or OpenShift namespaces, and do not conflict with the authoritative resource
* `.metadata.annotations` - that are not critical to Machine API or Cluster API functionality, and are not in reserved Kubernetes or OpenShift namespaces, and do not conflict with the authoritative resource
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have a way to enumerate the disallowed labels/annotations? I don't know exactly how many of these there are, but it seems like it might be difficult to be exhaustive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can regex check for the kube and openshift namespaces, and the others will be those which are used in selectors. Eg, you shouldn't be able to modify the labels/annotations on a resource if it'll stop it matching the selectors.

I think as well we need to make sure that you can't modify the annotations if they are present on the authoritative API, ie, would be put back if you removed them.


The controller will set the `Paused` condition to `True` when the resource is non-authoritative and will not perform any further reconciliation.

#### The Cluster Autoscaler in mixed MachineSet clusters
Copy link

@nrb nrb Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elmiko Have you had a chance to look over this again for consistency? I know you've looked at ensuring we respect upstream autoscaler values, so hopefully that will help us transition smoothly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this some in a standup, and the autoscaler's behavior will be expanded on further in a separate enhancement, so I'd say this doesn't need to be blocking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we talked about the autoscaler usage in our standup yesterday. in general i think we are ok to go forward with this design but there are a few details to keep in mind:

  • today, the autoscaler is configured to watch a specific group and version for the scalable types.
  • given that we will allow per-machineset upgrading into capi, we will most likely need to carry a patch on the autoscaler that will allow us to differentiate between the various resource types we are seeing.
  • there will need to be some design documented about how the autoscaler logic will work in terms of selecting the proper resource to scale. this can be handled in a followup enhancement specifically about the autoscaler.


#### Synchronisation logic will need to be solid before it gets into the hand of customers

To ensure that we do not break customer clusters, we need to make sure that our logic that converts the resources between Machine API and Cluster API is thoroughly tested.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're allowing two different controllers to handle the overall migration process, it's possible we could have the conversion controller shipping and converting things before enabling the authoritative checks, right?

That way we could have telemetry for those customers who have opted in to see what their conversions are looking like, whether they're overall healthy or there are glaring problems that we need to address/add to the test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're allowing two different controllers to handle the overall migration process, it's possible we could have the conversion controller shipping and converting things before enabling the authoritative checks, right?

In theory yes, we could ship that part and have the conversion in place, but we would need to enforce the authority to always be MAPI without the migration

That way we could have telemetry for those customers who have opted in to see what their conversions are looking like, whether they're overall healthy or there are glaring problems that we need to address/add to the test suite.

Yeah that would be good to know, though I'm not sure the handover part is particularly complicated. I kind of expect that to actually be one of the easiest parts so I'm not sure we'd necessarily bother shipping them separately. Was there a specific concern you had?


### Open Questions

N/A
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scope of feature gaps? I think that's something we will have to find out as we implement conversions per platform, it's difficult to answer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll sync up with you on this later, but, as we go through and write the code out that converts providerSpec to InfraTemplate, I'm expecting we will find gaps.

As we find gaps, if the field is in use, and we can't convert, we will have to return an error which will end up on the status in the Synchronized condition.

Every time we write an error case, a Jira story will be created (assuming disciplined engineers) and that will then need to be resolved. Once resolved, the error case can be removed


## Graduation Criteria

The project will initially be introduced under a feature gate `MachineAPIMigration`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need two feature gates? One for synchronization, one for enabling the authoritative API management controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had intended initially to use one gate to cover the whole feature, in the future though I expect we may have per platform gates as we evolve some of the migrations at a different pace.

When migration between the two authoritative API versions is not possible, the sync controller will add a condition to the resource to indicate the reason for the failure.
However, with the above described workflow, the user is not presented with a synchronous response to the migration request.

To improve the feedback loop, a webhook will be introduced to validate the migration request.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this meaningfully changes the enhancement other than implementation detail, but could/would a ValidatingAdmissionPolicy be a valid alternative to introducing a web hook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this has crossed my mind, and I think whoever implements this should do some research/decisioning on this

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 20, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 27, 2024
@JoelSpeed
Copy link
Contributor Author

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 30, 2024
@deads2k
Copy link
Contributor

deads2k commented Apr 30, 2024

/approve

leaving lgtm with the team.

Copy link
Contributor

openshift-ci bot commented Apr 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2024
Copy link
Contributor

openshift-ci bot commented Apr 30, 2024

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.


## Infrastructure Needed

A new repository `github.com/openshift/machine-api-synchronisation-operator` will be needed to complete this project.
Copy link

@jeana-redhat jeana-redhat May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this line. Does that mean the "operator" referred to under ## Graduation Criteria above is a new Operator specifically to handle this?

ETA: it occurs to me that this should probably be machine-api-synchronization-operator if it's an official OCP component 🇺🇸 😅

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeana-redhat Yeah, there will be an operator to do the synchronization/copy of the data, and one to manage which API is authoritative.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it can use machine-api-sync-operator so no one has to write it out or choose an English variant 😉

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is lgtm from the cluster autoscaler perspective, i left a little more details in response to @nrb 's question to help highlight challenges we will need to address in the future. i don't think they will impact this work though.


The controller will set the `Paused` condition to `True` when the resource is non-authoritative and will not perform any further reconciliation.

#### The Cluster Autoscaler in mixed MachineSet clusters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we talked about the autoscaler usage in our standup yesterday. in general i think we are ok to go forward with this design but there are a few details to keep in mind:

  • today, the autoscaler is configured to watch a specific group and version for the scalable types.
  • given that we will allow per-machineset upgrading into capi, we will most likely need to carry a patch on the autoscaler that will allow us to differentiate between the various resource types we are seeing.
  • there will need to be some design documented about how the autoscaler logic will work in terms of selecting the proper resource to scale. this can be handled in a followup enhancement specifically about the autoscaler.

@nrb
Copy link

nrb commented May 1, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 26a8982 into openshift:master May 1, 2024
2 checks passed
1. The migration controller updates `status.authoritativeAPI` to `ClusterAPI`
1. The Cluster API controller takes over management of the MachineSet going forwards

To migrate back from Cluster API to Machine API, the procedure is the same, however, the value of the field should be set to `MachineAPI`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we save ourselves half the work by disallowing this (e.g. in a webhook/CEL)?
What's the use case for migrating from CAPI->MAPI?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the goals section:

Allow users to migrate Machines between Machine API and Cluster API to test the new API

However, upon re-reading it, the full motivation isn't spelled out.

The idea for going back is an escape hatch in case something is broken or missing in CAPI that does work/exists in MAPI.

If there was consensus around dropping that it would indeed save us some work, but I think right now we'd like to be super conservative in moving forward, since breakages could render clusters unusable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we are trying to be very very conservative here so that we don't end up getting anyone stuck in a situation where they are migrated to Cluster API and can't move backwards if there's a bug. It would be preferable to not have to deal with the reverse situation, and we are of course prioritising the forward direction, but I would still be keen to see the reverse direction implemented unless we suddenly become super confident in our migration and that no one would need the reverse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head I'm trying to relate this back to specific kinds of things that could go wrong and how the user could get out of them.
If we really messed and and ended up reprovisioning all the Machines that would be bad, but at that point switching back (and potentially reprovisioning them all again, heh) wouldn't be a solution.
Machines aren't supposed to do much once they're already provisioned, so as long as they can just sit there there's not a lot to go wrong.
If we're unable to create new Machines in a MachineSet for whatever reason then the user always has the escape hatch of creating a new MAPI MachineSet rather than reverse-migrating the existing one.
The only thing I can think of is if we're unable to delete a Machine using CAPI (and not because we've thrown away some data that we needed, since we likely can't recover that), then migrating it back to MAPI could be a workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right about individual Machines. The main reason to go back would be because the deletion doesn't work in the same way. There could also be behavioural differences between Machines that we haven't implemented during runtime. One example is tagging, which I know is updated continuously in MAPI, but may be different for CAPI, on the list to check but also hopefully discover others of these kinds of things.

But I think the reversal scenario is more important for higher level objects like MachineSet and ControlPlaneMachineSet.

So playing through a MachineSet scenario:

  • User has existing MachineSet in MAPI that they have been using for a while
    • They decide they would like to move over to CAPI for new Machines because we have explained to them that there's a bug we won't fix in creation of Machines in MAPI, but it's fixed in CAPI (carrot on a stick)
    • Once they have moved, they decide they would like to change the instance type, so they create a new InfrastructureTemplate and update their MachineSet to point to the new template
    • The MAPI version of the MachineSet still exists, but is not updated. It now represents an incorrect configuration.
      • Having two representations of the same resource that differ seems bad, must we remove the MAPI version for them? What if some tooling they use is observing the MAPI resources still and reads state from them?
      • What would we suggest for preventing this?
    • They realise that they are not actually ready for using CAPI (some feature is not present that they want to use - or maybe some extension they need to use on top isn't ready for CAPI - or maybe another bug in CAPI that is more serious than the MAPI one)
    • They want to move back to managing Machines via MAPI
    • They cannot reverse the migration
      • They must now create a new MachineSet based on the MAPI version of the resource (which is not up to date because they made changes on the CAPI side)
      • They must scale down the old MachineSet and remove it, because there's no backwards migration, this is unnecessary disruption

We want to be able to convert the CAPI status back to MAPI status since we intend to eventually remove the MAPI controllers and support the MAPI APIs through the conversion layer. So really, we are talking about reversing the spec only as the part of this to drop right?

So lets say we didn't have backwards conversion for specs:

  • Once migrated, a resource must be deleted on the MAPI side to avoid the problem of config eventually drifting
    • How would we enforce this? Delete it for them?
  • We still need to be able to write the status from CAPI to MAPI to support getting rid of the MAPI controllers
  • We still need all the admission checks to prevent users writing to CAPI if the resource is not authoritative
  • If the MAPI MachineSet is still authoritative, then what happens during a scale up?
    • Our intention in the longer term is to use the CAPI MachineSet logic, which would create a CAPI Machine.
      • We then need to create a MAPI Machine to reflect this, we have no backward logic, the MAPI MachineSet is now broken since it has fewer Machines than the CAPI MachineSet
    • This would mean we have to keep the MachineSet controller around for MAPI indefinitely, and MAO, which again we had intended to remove, or spend time to move the MachineSet controller around

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now we'd like to be super conservative in moving forward

I can't help being reminded of what they used to call the MIT approach.

They must scale down the old MachineSet and remove it, because there's no backwards migration, this is unnecessary disruption

I don't agree with this part; I'm not aware of any rule that you are only allowed to have one MachineSet. So the user need not do this step.

So really, we are talking about reversing the spec only as the part of this to drop right?

Technically I only asked about forbidding a change of the authoritativeAPI field from CAPI to MAPI.
But you're right that this also calls into question the value of reverse-migrating the spec if nothing is going to use it.

Really though, I'd love to do so much less than that.


## Proposal

We will implement a two-way sync controller that synchronises Machines and related resources between the Machine API and Cluster API equivalents.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help but wonder what would happen if we... didn't.

We don't do this with any other kind of resource that gets deprecated. For example, prior to 4.13 the installer created ImageContentSourcePolicy CRs for the mirror config, but since then it creates ImageDigestMirrorSet CRs instead. You can still create ICSPs and they'll work, and upgraded clusters with those resources continue to be fine, but that's not the default way we manage mirrors any more. If we were to take a similar approach, plus create a tool that does rolling migrations of MachineSets - initially on demand, and eventually compulsorily before upgrade or automatically - what would be the implications of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several things we want to achieve with this project, one of which is removing the Machine controllers for Machine API, so at some point, we will need something that implements a conversion from Machine API resources to Cluster API resources, since we aren't going to be allowed to remove the API within a major OpenShift version change.

I think switching over to installing CAPI only will happen at some point, but that's orthogonal to the fact we will need a conversion mechanism some how, and that conversion mechanism needs to account for not only existing MachineSets, but also control plane nodes, across multiple platforms. This is far more complex and risky than something like the registry mirroring.

If we don't migrate users, we have to maintain the two sets of controllers forever, which is not an option


##### Resource created as Cluster API Resource, wish to migrate to Machine API Resource

To enable a reverse migration, should the customer wish, we will need to enable the user to create a Machine API equivalent of a Cluster API resource if it does not exist.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine any justification for allowing this.

Also, worth noting that CAPI can support multiple infrastructure providers simultaneously, but MAPI cannot. So at best this is only possible for InfrastructureTemplates that match the platform of the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the openshift-cluster-api namespace, we expect the resources here to only be for the cluster itself and not for any guest clusters.

Since we don't today support multiple platforms for a cluster, there should be no reason to have a non-matching Infrastructure template.

As for reasons for this, we expect customers are not going to want to have to deal with two APIs at once, for an extended period. If they test the infrastructure migration and decide they've hit a roadblock, we want to allow them to go backwards. Given other parts of this enhancement that allow removing MAPI equivalent objects for existing CAPI resources, this is the counter to that and the get out of jail free card

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.