Skip to content

K8S operator for Azure VMSS/VM for automatic repair and update


Notifications You must be signed in to change notification settings


Repository files navigation

Azure Kubernetes Autopilot

license DockerHub Artifact Hub

Kubernetess service for automatic maintenance of an Azure cluster.

  • auto repair (repair nodes if NotReady; VM and VMSS support)
  • auto update (update VMSS instances automatically to latest model; only VMSS)

Supports Azure AKS and custom Azure Kubernetes clusters.

Supports shoutrrr notifications.

(Successor of azure-k8s-autorepair)


  azure-k8s-autopilot [OPTIONS]

Application Options:
      --log.debug                                                  debug mode [$LOG_DEBUG]
      --log.devel                                                  development mode [$LOG_DEVEL]
      --log.json                                                   Switch log output to json format [$LOG_JSON]
      --dry-run                                                    Dry run (no redeploy triggered) [$DRY_RUN]
      --instance.nodename=                                         Name of node where autopilot is running [$INSTANCE_NODENAME]
      --instance.namespace=                                        Name of namespace where autopilot is running [$INSTANCE_NAMESPACE]
      --instance.pod=                                              Name of pod where autopilot is running [$INSTANCE_POD]
      --azure.environment=                                         Azure environment name (default: AZUREPUBLICCLOUD) [$AZURE_ENVIRONMENT]
      --repautoscaler.scaledown-locktime=                          Prevents cluster autoscaler from scaling down the affected node after
                                                                   update and repair (default: 60m) [$AUTOSCALER_SCALEDOWN_LOCKTIME]
      --kube.node.labelselector=                                   Node Label selector which nodes should be checked
      --lease.enable                                               Enable lease (leader election; enabled by default in docker images)
                                                                   [$LEASE_ENABLE]                                                Name of lease lock (default: azure-k8s-autopilot-leader) [$LEASE_NAME]
      --repair.crontab=                                            Crontab of check runs (default: @every 2m) [$REPAIR_CRONTAB]
      --repair.notready-threshold=                                 Threshold (duration) when the automatic repair should be tried (eg.
                                                                   after 10 mins of NotReady state after last successfull heartbeat)
                                                                   (default: 10m) [$REPAIR_NOTREADY_THRESHOLD]
      --repair.concurrency=                                        How many VMs should be redeployed concurrently (default: 1)
      --repair.lock-duration=                                      Duration how long should be waited for another redeploy on the same node (default: 30m)
      --repair.lock-duration-error=                                Duration how long should be waited for another redeploy on the same node in case an error
                                                                   occurred (default: 5m) [$REPAIR_LOCK_DURATION_ERROR][restart|redeploy|reimage|delete] Defines the action which should be tried to repair the node (VMSS)
                                                                   (default: redeploy) [$REPAIR_AZURE_VMSS_ACTION][restart|redeploy]                  Defines the action which should be tried to repair the node (VM)
                                                                   (default: redeploy) [$REPAIR_AZURE_VM_ACTION]                            Azure VM provisioning states where repair should be tried (eg. avoid
                                                                   repair in "upgrading" state; "*" to accept all states) (default:
                                                                   succeeded, failed) [$REPAIR_AZURE_PROVISIONINGSTATE]
      --repair.lock-annotation=                                    Node annotation for repair lock time (default:
      --update.crontab=                                            Crontab of check runs (default: @every 15m) [$UPDATE_CRONTAB]
      --update.concurrency=                                        How many VMs should be updated concurrently (default: 1)
      --update.lock-duration=                                      Duration how long should be waited for another update on the same node (default: 15m)
      --update.lock-duration-error=                                Duration how long should be waited for another update on the same node in case an error
                                                                   occurred (default: 5m) [$UPDATE_LOCK_DURATION_ERROR]
      --update.lock-annotation=                                    Node annotation for update lock time (default:
      --update.ongoing-annotation=                                 Node annotation for ongoing update lock (default:
      --update.exclude-annotation=                                 Node annotation for excluding node for updates (default:
                                                          [$UPDATE_EXCLUDE_ANNOTATION][update|update+reimage|delete]    Defines the action which should be tried to update the node (VMSS)
                                                                   (default: update+reimage) [$UPDATE_AZURE_VMSS_ACTION]                            Azure VM provisioning states where update should be tried (eg. avoid
                                                                   repair in "upgrading" state; "*" to accept all states) (default:
                                                                   succeeded, failed) [$UPDATE_AZURE_PROVISIONINGSTATE]
      --update.failed-threshold=                                   Failed node threshold when node update is stopped (default: 2)
      --drain.kubectl=                                             Path to kubectl binary (default: kubectl) [$DRAIN_KUBECTL]
      --drain.enable                                               Enable drain handling [$DRAIN_ENABLE]
      --drain.delete-emptydir-data                                 Continue even if there are pods using emptyDir (local emptydir that will
                                                                   be deleted when the node is drained) [$DRAIN_DELETE_EMPTYDIR_DATA]
      --drain.force                                                Continue even if there are pods not managed by a ReplicationController,
                                                                   ReplicaSet, Job, DaemonSet or StatefulSet [$DRAIN_FORCE]
      --drain.grace-period=                                        Period of time in seconds given to each pod to terminate gracefully. If
                                                                   negative, the default value specified in the pod will be used.
      --drain.ignore-daemonsets                                    Ignore DaemonSet-managed pods. [$DRAIN_IGNORE_DAEMONSETS]
      --drain.pod-selector=                                        Label selector to filter pods on the node [$DRAIN_POD_SELECTOR]
      --drain.timeout=                                             The length of time to wait before giving up, zero means infinite
                                                                   (default: 0s) [$DRAIN_TIMEOUT]
      --drain.wait-after=                                          Wait after drain to let Kubernetes detach volumes etc (default: 30s)
      --drain.dry-run                                              Do not drain, uncordon or label any node [$DRAIN_DRY_RUN]
      --drain.disable-eviction                                     Force drain to use delete, even if eviction is supported. This will
                                                                   bypass checking PodDisruptionBudgets, use with caution.
      --drain.retry-without-eviction                               Retry drain without eviction if first drain failed
      --drain.ignore-failure                                       Ignore failed drain and continue with actions [$DRAIN_IGNORE_FAILURE]
      --notification=                                              Shoutrrr url for notifications (
      --server.bind=                                               Server address (default: :8080) [$SERVER_BIND]                                       Server read timeout (default: 5s) [$SERVER_TIMEOUT_READ]
      --server.timeout.write=                                      Server write timeout (default: 10s) [$SERVER_TIMEOUT_WRITE]

Help Options:
  -h, --help                                                       Show this help message

for Azure API authentication (using ENV vars) see

for Kubernetes ServiceAccont is discoverd automatically (or you can use env path KUBECONFIG to specify path to your kubeconfig file)


(see :8080/metrics)

Metric Description
autopilot_repair_count Count of repair actions
autopilot_repair_node_status Node status
autopilot_repair_duration Duration of repair task
autopilot_update_count Count of update actions
autopilot_update_duration Duration of last exec

AzureTracing metrics

see armclient tracing documentation