diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 51459c6f1..29e24ea23 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -1,9 +1,16 @@ # Table of contents * [Introduction](README.md) +* [How it works](how-it-works.md) ## Install * [Installing Flagger](install-1/installing-flagger.md) * [Installing Grafana](install-1/installing-grafana.md) +## Usage + +* [Progressive Delivery](usage/progressive-delivery.md) +* [Monitoring](usage/monitoring.md) +* [Alerting](usage/alerting.md) + diff --git a/docs/gitbook/how-it-works.md b/docs/gitbook/how-it-works.md new file mode 100644 index 000000000..4d244c56f --- /dev/null +++ b/docs/gitbook/how-it-works.md @@ -0,0 +1,147 @@ +--- +description: Automated canary deployments process +--- + +# How it works + +[Flagger](https://github.com/stefanprodan/flagger) takes a Kubernetes deployment and optionally a horizontal pod autoscaler \(HPA\) and creates a series of objects \(Kubernetes deployments, ClusterIP services and Istio virtual services\) to drive the canary analysis and promotion. + +![flagger-canary-hpa](https://github.com/raw/stefanprodan/flagger/master/docs/diagrams/flagger-canary-hpa.png) + +### Canary Custom Resource + +For a deployment named _**podinfo**_, a canary promotion can be defined using Flagger's custom resource: + +```yaml +apiVersion: flagger.app/v1alpha1 +kind: Canary +metadata: + name: podinfo + namespace: test +spec: + # deployment reference + targetRef: + apiVersion: apps/v1 + kind: Deployment + name: podinfo + # the maximum time in seconds for the canary deployment + # to make progress before it is rollback (default 600s) + progressDeadlineSeconds: 60 + # hpa reference (optional) + autoscalerRef: + apiVersion: autoscaling/v2beta1 + kind: HorizontalPodAutoscaler + name: podinfo + service: + # container port + port: 9898 + # Istio gateways (optional) + gateways: + - public-gateway.istio-system.svc.cluster.local + # Istio virtual service host names (optional) + hosts: + - app.istio.weavedx.com + canaryAnalysis: + # max number of failed metric checks before rollback + threshold: 5 + # max traffic percentage routed to canary + # percentage (0-100) + maxWeight: 50 + # canary increment step + # percentage (0-100) + stepWeight: 10 + metrics: + - name: istio_requests_total + # minimum req success rate (non 5xx responses) + # percentage (0-100) + threshold: 99 + interval: 1m + - name: istio_request_duration_seconds_bucket + # maximum req duration P99 + # milliseconds + threshold: 500 + interval: 30s + +``` + +### Canary Deployment + +![flagger-canary-steps](https://github.com/raw/stefanprodan/flagger/master/docs/diagrams/flagger-canary-steps.png) + +Gated canary promotion stages: + +* scan for canary deployments +* creates the primary deployment if needed +* check Istio virtual service routes are mapped to primary and canary ClusterIP services +* check primary and canary deployments status + * halt advancement if a rolling update is underway + * halt advancement if pods are unhealthy +* increase canary traffic weight percentage from 0% to 5% \(step weight\) +* check canary HTTP request success rate and latency + * halt advancement if any metric is under the specified threshold + * increment the failed checks counter +* check if the number of failed checks reached the threshold + * route all traffic to primary + * scale to zero the canary deployment and mark it as failed + * wait for the canary deployment to be updated \(revision bump\) and start over +* increase canary traffic weight by 5% \(step weight\) till it reaches 50% \(max weight\) + * halt advancement while canary request success rate is under the threshold + * halt advancement while canary request duration P99 is over the threshold + * halt advancement if the primary or canary deployment becomes unhealthy + * halt advancement while canary deployment is being scaled up/down by HPA +* promote canary to primary + * copy canary deployment spec template over primary +* wait for primary rolling update to finish + * halt advancement if pods are unhealthy +* route all traffic to primary +* scale to zero the canary deployment +* mark rollout as finished +* wait for the canary deployment to be updated \(revision bump\) and start over + +You can change the canary analysis _max weight_ and the _step weight_ percentage in the Flagger's custom resource. + +### Canary Analisys + + The canary analysis is using the following promql queries: + +_HTTP requests success rate percentage_ + +```javascript +sum( + rate( + istio_requests_total{ + reporter="destination", + destination_workload_namespace=~"$namespace", + destination_workload=~"$workload", + response_code!~"5.*" + }[$interval] + ) +) +/ +sum( + rate( + istio_requests_total{ + reporter="destination", + destination_workload_namespace=~"$namespace", + destination_workload=~"$workload" + }[$interval] + ) +) +``` + +_HTTP requests milliseconds duration P99_ + +```javascript +histogram_quantile(0.99, + sum( + irate( + istio_request_duration_seconds_bucket{ + reporter="destination", + destination_workload=~"$workload", + destination_workload_namespace=~"$namespace" + }[$interval] + ) + ) by (le) +) +``` + diff --git a/docs/gitbook/usage/alerting.md b/docs/gitbook/usage/alerting.md new file mode 100644 index 000000000..7fd0e2907 --- /dev/null +++ b/docs/gitbook/usage/alerting.md @@ -0,0 +1,41 @@ +--- +description: Slack & Alertmanager +--- + +# Alerting + +### Slack + +Flagger can be configured to send Slack notifications: + +```bash +helm upgrade -i flagger flagger/flagger \ +--namespace=istio-system \ +--set slack.url=https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK \ +--set slack.channel=general \ +--set slack.user=flagger +``` + +Once configured with a Slack incoming **webhook**, Flagger will post messages when a canary deployment has been initialised, when a new revision has been detected and if the canary analysis failed or succeeded. + +![flagger-slack](https://github.com/raw/stefanprodan/flagger/master/docs/screens/slack-canary-notifications.png) + +A canary deployment will be rolled back if the progress deadline exceeded or if the analysis reached the maximum number of failed checks: + +![flagger-slack-errors](https://github.com/raw/stefanprodan/flagger/master/docs/screens/slack-canary-failed.png) + +### Prometheus Alert Manager + +Besides Slack, you can use Alertmanager to trigger alerts when a canary deployment failed: + +```yaml + - alert: canary_rollback + expr: flagger_canary_status > 1 + for: 1m + labels: + severity: warning + annotations: + summary: "Canary failed" + description: "Workload {{ $labels.name }} namespace {{ $labels.namespace }}" +``` + diff --git a/docs/gitbook/usage/monitoring.md b/docs/gitbook/usage/monitoring.md new file mode 100644 index 000000000..04f5d89aa --- /dev/null +++ b/docs/gitbook/usage/monitoring.md @@ -0,0 +1,73 @@ +--- +description: Metrics & Logging +--- + +# Monitoring + +### Grafana + +Flagger comes with a Grafana dashboard made for canary analysis. Install Grafana with Helm: + +```bash +helm upgrade -i flagger-grafana flagger/grafana \ +--namespace=istio-system \ +--set url=http://prometheus:9090 \ +--set user=admin \ +--set password=admin +``` + +The dashboard shows the RED and USE metrics for the primary and canary workloads: + +![canary dashboard](https://github.com/raw/stefanprodan/flagger/master/docs/screens/grafana-canary-analysis.png) + +### Logging + +The canary errors and latency spikes have been recorded as Kubernetes events and logged by Flagger in json format: + +```text +kubectl -n istio-system logs deployment/flagger --tail=100 | jq .msg + +Starting canary deployment for podinfo.test +Advance podinfo.test canary weight 5 +Advance podinfo.test canary weight 10 +Advance podinfo.test canary weight 15 +Advance podinfo.test canary weight 20 +Advance podinfo.test canary weight 25 +Advance podinfo.test canary weight 30 +Advance podinfo.test canary weight 35 +Halt podinfo.test advancement success rate 98.69% < 99% +Advance podinfo.test canary weight 40 +Halt podinfo.test advancement request duration 1.515s > 500ms +Advance podinfo.test canary weight 45 +Advance podinfo.test canary weight 50 +Copying podinfo.test template spec to podinfo-primary.test +Halt podinfo-primary.test advancement waiting for rollout to finish: 1 old replicas are pending termination +Scaling down podinfo.test +Promotion completed! podinfo.test +``` + +### Metrics + +Flagger exposes Prometheus metrics that can be used to determine the canary analysis status and the destination weight values: + +```bash +# Canaries total gauge +flagger_canary_total{namespace="test"} 1 + +# Canary promotion last known status gauge +# 0 - running, 1 - successful, 2 - failed +flagger_canary_status{name="podinfo" namespace="test"} 1 + +# Canary traffic weight gauge +flagger_canary_weight{workload="podinfo-primary" namespace="test"} 95 +flagger_canary_weight{workload="podinfo" namespace="test"} 5 + +# Seconds spent performing canary analysis histogram +flagger_canary_duration_seconds_bucket{name="podinfo",namespace="test",le="10"} 6 +flagger_canary_duration_seconds_bucket{name="podinfo",namespace="test",le="+Inf"} 6 +flagger_canary_duration_seconds_sum{name="podinfo",namespace="test"} 17.3561329 +flagger_canary_duration_seconds_count{name="podinfo",namespace="test"} 6 +``` + +#### + diff --git a/docs/gitbook/usage/progressive-delivery.md b/docs/gitbook/usage/progressive-delivery.md new file mode 100644 index 000000000..e9dd0a698 --- /dev/null +++ b/docs/gitbook/usage/progressive-delivery.md @@ -0,0 +1,187 @@ +--- +description: 'Automated canary analysis, promotions and rollbacks walkthrough' +--- + +# Progressive Delivery + +This guide shows you how to use Istio and Flagger to automate canary deployments. + +Create a test namespace with Istio sidecar injection enabled: + +```bash +export REPO=https://github.com/raw/stefanprodan/flagger/master + +kubectl apply -f ${REPO}/artifacts/namespaces/test.yaml +``` + +Create a deployment and a horizontal pod autoscaler: + +```bash +kubectl apply -f ${REPO}/artifacts/canaries/deployment.yaml +kubectl apply -f ${REPO}/artifacts/canaries/hpa.yaml +``` + +Create a canary custom resource \(replace example.com with your own domain\): + +```yaml +apiVersion: flagger.app/v1alpha1 +kind: Canary +metadata: + name: podinfo + namespace: test +spec: + # deployment reference + targetRef: + apiVersion: apps/v1 + kind: Deployment + name: podinfo + # the maximum time in seconds for the canary deployment + # to make progress before it is rollback (default 600s) + progressDeadlineSeconds: 60 + # HPA reference (optional) + autoscalerRef: + apiVersion: autoscaling/v2beta1 + kind: HorizontalPodAutoscaler + name: podinfo + service: + # container port + port: 9898 + # Istio gateways (optional) + gateways: + - public-gateway.istio-system.svc.cluster.local + # Istio virtual service host names (optional) + hosts: + - app.example.com + canaryAnalysis: + # max number of failed metric checks before rollback + threshold: 5 + # max traffic percentage routed to canary + # percentage (0-100) + maxWeight: 50 + # canary increment step + # percentage (0-100) + stepWeight: 10 + metrics: + - name: istio_requests_total + # minimum req success rate (non 5xx responses) + # percentage (0-100) + threshold: 99 + interval: 1m + - name: istio_request_duration_seconds_bucket + # maximum req duration P99 + # milliseconds + threshold: 500 + interval: 30s +``` + +Save the above resource as podinfo-canary.yaml and then apply it: + +```bash +kubectl apply -f ./podinfo-canary.yaml +``` + +After a couple of seconds Flagger will create the canary objects: + +```bash +# applied +deployment.apps/podinfo +horizontalpodautoscaler.autoscaling/podinfo +canary.flagger.app/podinfo + +# generated +deployment.apps/podinfo-primary +horizontalpodautoscaler.autoscaling/podinfo-primary +service/podinfo +service/podinfo-canary +service/podinfo-primary +virtualservice.networking.istio.io/podinfo +``` + +Trigger a canary deployment by updating the container image: + +```bash +kubectl -n test set image deployment/podinfo \ +podinfod=quay.io/stefanprodan/podinfo:1.2.1 +``` + +Flagger detects that the deployment revision changed and starts a new rollout: + +```text +kubectl -n test describe canary/podinfo + +Status: + Canary Revision: 19871136 + Failed Checks: 0 + State: finished +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal Synced 3m flagger New revision detected podinfo.test + Normal Synced 3m flagger Scaling up podinfo.test + Warning Synced 3m flagger Waiting for podinfo.test rollout to finish: 0 of 1 updated replicas are available + Normal Synced 3m flagger Advance podinfo.test canary weight 5 + Normal Synced 3m flagger Advance podinfo.test canary weight 10 + Normal Synced 3m flagger Advance podinfo.test canary weight 15 + Normal Synced 2m flagger Advance podinfo.test canary weight 20 + Normal Synced 2m flagger Advance podinfo.test canary weight 25 + Normal Synced 1m flagger Advance podinfo.test canary weight 30 + Normal Synced 1m flagger Advance podinfo.test canary weight 35 + Normal Synced 55s flagger Advance podinfo.test canary weight 40 + Normal Synced 45s flagger Advance podinfo.test canary weight 45 + Normal Synced 35s flagger Advance podinfo.test canary weight 50 + Normal Synced 25s flagger Copying podinfo.test template spec to podinfo-primary.test + Warning Synced 15s flagger Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available + Normal Synced 5s flagger Promotion completed! Scaling down podinfo.test +``` + +During the canary analysis you can generate HTTP 500 errors and high latency to test if Flagger pauses the rollout. + +Create a tester pod and exec into it: + +```bash +kubectl -n test run tester \ +--image=quay.io/stefanprodan/podinfo:1.2.1 \ +-- ./podinfo --port=9898 + +kubectl -n test exec -it tester-xx-xx sh +``` + +Generate HTTP 500 errors: + +```bash +watch curl http://podinfo-canary:9898/status/500 +``` + +Generate latency: + +```bash +watch curl http://podinfo-canary:9898/delay/1 +``` + +When the number of failed checks reaches the canary analysis threshold, the traffic is routed back to the primary, the canary is scaled to zero and the rollout is marked as failed. + +```text +kubectl -n test describe canary/podinfo + +Status: + Canary Revision: 16695041 + Failed Checks: 10 + State: failed +Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal Synced 3m flagger Starting canary deployment for podinfo.test + Normal Synced 3m flagger Advance podinfo.test canary weight 5 + Normal Synced 3m flagger Advance podinfo.test canary weight 10 + Normal Synced 3m flagger Advance podinfo.test canary weight 15 + Normal Synced 3m flagger Halt podinfo.test advancement success rate 69.17% < 99% + Normal Synced 2m flagger Halt podinfo.test advancement success rate 61.39% < 99% + Normal Synced 2m flagger Halt podinfo.test advancement success rate 55.06% < 99% + Normal Synced 2m flagger Halt podinfo.test advancement success rate 47.00% < 99% + Normal Synced 2m flagger (combined from similar events): Halt podinfo.test advancement success rate 38.08% < 99% + Warning Synced 1m flagger Rolling back podinfo.test failed checks threshold reached 10 + Warning Synced 1m flagger Canary failed! Scaling down podinfo.test +``` + +#### +