Skip to content
This repository has been archived by the owner on Jul 15, 2024. It is now read-only.
This repository has been archived by the owner on Jul 15, 2024. It is now read-only.

Proposal: ApplicationSet Progressive Rollout #61

Open
maruina opened this issue Nov 20, 2020 · 6 comments
Open

Proposal: ApplicationSet Progressive Rollout #61

maruina opened this issue Nov 20, 2020 · 6 comments
Assignees

Comments

@maruina
Copy link
Contributor

maruina commented Nov 20, 2020

Hi all,
I'm opening this issue because I'd like to discuss with the community a possible solution for what I think it's a common issue while doing GitOps.

Consider the following scenario, where you have multiple production clusters across multiple regions. When you use the ApplicationSet all the Application are updated at the same time and with the automated SyncPolicy you are basically doing a global rollout.

What I would like is to introduce the concept of a progressive rollout where you can decide how to rollout you application across those production clusters.

I can see two possible implementation of this idea:

  1. Extend the ApplicationSet specification to support the progressive rollout. The ApplicationSet controller will then be responsible for updating the desired Applications in the desired order. For example, you might want to update a region first, or updating 10% of all your clusters.
  2. Create a new controller with a new CRD dealing with the progressive rollout. The controller watches for ApplicationSet and does something like "argocd app sync " and argocd app wait <MYAPP>.

I think the second approach is much cleaner and doesn't overload the ApplicationSet controller, but I'd like to hear the community thoughts on this.

@RichiCoder1
Copy link

I'm pretty interested in this, and there was actually some discussion in the slack about this: https://argoproj.slack.com/archives/C014ZPM32LU/p1600274721058300

Sounded like the conclusion was a new CRD, but one that possible was a core part of ArgoCD: argoproj/argo-cd#1283

@maruina
Copy link
Contributor Author

maruina commented Nov 21, 2020

Thanks, I had a look at the slack conversation and at the ApplicationSync idea. I think it's a good one but I'm not sure how would that interact with the ApplicationSet.

I still think that a new controller makes sense, because it will allow you to iterate and move fast in the same way we're doing for the AppSet controller.

We also have quite different requirements in controlling the application rollout. For example:

  • all the clusters in the same region are equal. It doesn't matter if you deploy to clusterA first and then to clusterB
  • the order of regions matters. You might have the majority of your customers in EMEA so you might want to rollout a new version of you app to APAC first
  • needs to be able to skip and requeue a cluster. Imagine that a cluster is briefly under maintenance or drained or not available, maybe you don't want to fail the whole rollout but just retry the affected cluster later
  • have a concept of bake time, to let your new deployment soak before moving forward. This might be useful to mitigate bugs that might happen only after a while or under specific traffic conditions.
  • support for metric checks, to see if it's safe to proceed
  • support for hooks so you can call a third-party API before moving on

This is why I think a separate controller might make more sense. The CRD could be something like

kind: ProgressiveRollout
metadata:
  name: myrollout
  namespace: argocd
spec:
  # A Kubernetes object representing the ApplicationSet
  # A change in this object will trigger a new progressive rollout
  applicationSetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: ApplicationSet
    name: myappset
  rollout:
    # the type of the rollout strategy
    strategy: canary
    canary:
      # The ordered list of regions to use as canary
      regions:
        - eu-central-1
        - ap-southeast-1
      # (optional) The number of regions that can be updated in parallel. Default to 1
      parallelRegions: 1
      # (optional) The number of zones that can be updated in parallel. Default to 1
      parallelZones: 1
      # (optional) The number of clusters that can be updated in parallel. Default to 1
      parallelClusters: 1
      # (optional) The maximum number of cluster used as canary, per region. Default to 1.
      maxClusters: 1
      # (optional) The time to wait after a region is completed. Default to 0.
      bakeTimeRegion: 1h
      # (optional) The time to wait after a zone is completed. Default to 0.
      bakeTimeZone: 30m
      # (optional) The time to wait after a cluster is completed. Default to 0.
      bakeTimeCluster: 10m
    primary:
      regions:
        - ap-northeast-1
        - eu-west-1
        - eu-central-1
        - ap-southeast-1
      parallelRegions: 2
      parallelZones: 3
      parallelClusters: 1
      bakeTimeRegion: 2h
      bakeTimeZone: 1m
      bakeTimeCluster: 10m
  # (optional) 
  retries:
      # (optional) specifies the number of retries per cluster before marking the ProgressiveRollout failed. Default to 1
      # A value of -1 is infinite retries
      attempts: 3
      # (optional) retry interval. Default to 10m
      interval: 30m
  # (optional)
  metrics:
    - name: cluster-drained
      type: pre-deployment-cluster
      thresholdRange:
        min: 1
      interval: 5m
    # minimum req success rate (non 5xx responses)
    - name: region-request-success-rate
      type: post-deployment-region
      # percentage (0-100)
      thresholdRange:
        min: 99
    - name: myapp-custom-check
      type: post-bake-time-region
      templateRef:
        # Name of the MetricTemplate
        name: myapp-custom-check
        # (optional) The namespace where the metric check lives. Default to the operator namespace.
        namespace: mynamespace
      # accepted values
      thresholdRange:
        min: 10
        max: 1000
      # metric query time window
      interval: 5m
  # (optional)
  webhooks:
    - name: "regional load test"
      type: post-bake-time-region
      url: http://load-test-service.example.com
      timeout: 15s
      retries: 3
      metadata:
        cmd: "hey -z 1m -q 5 -c 2 http://myapp.example.com"
  # (optional)
  alerts:
    - name: "on-call Slack"
      severity: error
      providerRef:
        name: on-call-slack
        namespace: rollout-system
    - name: "info Slack"
      severity: info
      providerRef:
        name: info-slack

note that this mimic heavily Flagger, but it tries to extend it.

@OmerKahani
Copy link
Contributor

@maruina currently ApplicationSet only handles the installation of the application CRD. The solution you are looking for is for the installation and the first sync stage? or for the day-to-day sync?

@maruina
Copy link
Contributor Author

maruina commented Nov 22, 2020

Day-to-day sync. Every time the AppSet creates/updates the Application(s), something should take care of their synchronization.

@maruina
Copy link
Contributor Author

maruina commented Dec 19, 2020

Hi all, I created a PoC for a progressive rollout controller using ApplicationSet. You can find it here https://github.com/maruina/argocd-progressive-rollout-controller

Any feedback is very much appreciated :)

@ghostsquad
Copy link

Has any progress been made on this front? I believe this to be a major blocker for adopting ApplicationSet.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants