Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

backup-operator: Support periodically backup (#1841) #2028

Merged
merged 19 commits into from
Feb 18, 2019

Conversation

ukinau
Copy link
Contributor

@ukinau ukinau commented Dec 17, 2018

This PR implemented periodically backup which is discussed in #1841 .
The specification is mean to be almost same as what #1841 concluded.

After merged this PR, the user can enable periodically backup as following

apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdBackup
metadata:
  name: example-etcd-cluster-backup
  namespace: default
spec:
  backupPolicy:
    backupIntervalInSecond: 125   # new field 
    maxBackups: 4                         # new field
  clientTLSSecret: etcd-backup-cert
  etcdEndpoints:
  - https://<etcd-endpoint>
  s3:
    awsSecret: aws
    endpoint: https://<s3-endpoint>
    path: test/example
  storageType: S3

As you can see above, 2 new fields are added

  • backupIntervalInSecond
    • specify how often snapshot would take. Above example will do every 125 second
  • maxBackups
    • specify how many backups keep. Above example will keep only latest 4 backup. Once the number of backup exceeded than maxBackups, backup-operator delete oldest backup file which have same prefix

close #1841

This commit added BackupIntervalInSecond in BackupPolicy, which perform
periodic backup as coreos#1841 issue describe.  This commit is part of coreos#1841.
By specifying BackupIntervalInSecond, user can let etcd-backup-operator
do periodic backup.

The specification of BackupIntervalInSecond is following
 - unit in sec
 - >0 means interval.
 - =0 means explicit disable interval backup, it will just do one time
   backup.
This commit implement validation of BackupIntervalInSecond.
After this commit, backup-operator will make sure BackupIntervalInSecond
follow following restrictions

- <0 is not allowed, failed validation
- 0 is valid and disable periodic backup
- >0 is valid and means interval
Current backup status is only designed for one-shot snapshot.
Always it show lastest results but it would be nice if we could record
the last time to successfully take a snapshot.
This commit added MaxBackups attributs which let backup-operator delete
older snapshots if the number of snapshots exceeded than MaxBackups.

Specification of MaxBackups is following

 - <0 is not allowed, which cause validation failure
 - =0 is to indicate MaxBackups is infinite, which let not operator
   delete any exisiting snapshots
 - >0 is to indicate the max number of snapshot
After support periodic backup, backup-operator added revision number and
date to s3 path as following <bucket name>/<object name>_v<rev>_<date>.
This behaviour has been applied even if backup is one-shot backup,
therfore this change broke exisiting behaviour.

This commit brough back original behaviour which use s3 path without
adding anything <bucket name>/<object name>, if backup is not periodic
@etcd-bot
Copy link
Collaborator

Can one of the admins verify this patch?

2 similar comments
@etcd-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@etcd-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@hexfusion
Copy link
Member

@etcd-bot ok to test

Copy link
Member

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ukinau thanks for the PR few nits, also can you update the codegen files that should fix your failed unit test.

./hack/k8s/codegen/update-generated.sh

pkg/backup/backup_manager.go Show resolved Hide resolved
pkg/backup/writer/writer.go Show resolved Hide resolved
pkg/controller/backup-operator/sync.go Show resolved Hide resolved
@hexfusion
Copy link
Member

@etcd-bot retest this please

Copy link
Member

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more nits and we are having multiple failures on CI if you could look into that I would appreciate it.

pkg/backup/writer/s3_writer.go Show resolved Hide resolved
pkg/backup/backup_manager.go Show resolved Hide resolved
pkg/controller/backup-operator/abs_backup.go Show resolved Hide resolved
pkg/controller/backup-operator/util.go Show resolved Hide resolved
pkg/controller/backup-operator/sync.go Show resolved Hide resolved
pkg/controller/backup-operator/sync.go Show resolved Hide resolved
pkg/controller/backup-operator/sync.go Show resolved Hide resolved
pkg/controller/backup-operator/sync.go Show resolved Hide resolved
pkg/controller/backup-operator/sync.go Show resolved Hide resolved
@hexfusion
Copy link
Member

@ukinau good progress, thanks! We need to get to the bottom of this CI failure can you take a look please?

pkg/apis/etcd/v1beta2/zz_generated.deepcopy.go:137:20: in.LastSuccessDate.DeepCopyInto undefined (type time.Time has no field or method DeepCopyInto)

@ukinau ukinau force-pushed the support-periodic-backup branch 3 times, most recently from 06fc1a9 to f84a3d5 Compare December 18, 2018 18:42
After I generated the code based on k8s object
(zz_generated.deepcopy.go), we happened to be in failing to build.
This is because all k8s custom resource's fileds should implement
DeepCopyInto but time.Time we added doesn't implement it.
For this purpose we should have used meta/v1.Time which is the
implementation to implement all necessary function for k8s object
and same function of time.Time.

And also this commit include some refactoring which is pointed out
in code-review
@ukinau
Copy link
Contributor Author

ukinau commented Dec 18, 2018

@hexfusion
Thanks for your quick review and Sorry for taking a time, I fixed the problems of "in.LastSuccessDate.DeepCopyInto undefined" and the comments you left.
Now it seems all CI passed.

@hexfusion
Copy link
Member

@hasbro17 PTAL

@ukinau
Copy link
Contributor Author

ukinau commented Jan 8, 2019

@hexfusion Is there anything I can do to make this PR merged faster?

@hexfusion
Copy link
Member

@ukinau thank you for your patience I will get a final review in hopefully by the end of the week.

@ukinau
Copy link
Contributor Author

ukinau commented Jan 16, 2019

@hexfusion @hasbro17
I added followings

  • CHANGELOG
  • example manifests
  • e2eslow test for periodic backup

but CI for e2eslow test seems having something other trouble which is not related to this change and seems related to environment issue. Coud you check it out?

There seem be something untracked files preventing from performing e2e test

<span class="timestamp"><b>02:35:45</b> </span>Please move or remove them before you switch branches.
<span class="timestamp"><b>02:35:45</b> </span>error: The following untracked working tree files would be removed by checkout:
<span class="timestamp"><b>02:35:45</b> </span> CHANGELOG-1.14.md
<span class="timestamp"><b>02:35:45</b> </span> cmd/cloud-controller-manager/app/core.go
<span class="timestamp"><b>02:35:45</b> </span> cmd/controller-manager/app/helper_test.go
<span class="timestamp"><b>02:35:45</b> </span> hack/.shellcheck_failures
<span class="timestamp"><b>02:35:45</b> </span> hack/verify-shellcheck.sh
<span class="timestamp"><b>02:35:45</b> </span> pkg/scheduler/algorithm/priorities/types.go
<span class="timestamp"><b>02:35:45</b> </span> pkg/scheduler/algorithm/priorities/types_test.go
<span class="timestamp"><b>02:35:45</b> </span>Please move or remove them before you switch branches.
<span class="timestamp"><b>02:35:45</b> </span>Aborting
<span class="timestamp"><b>02:35:45</b> </span>: command failed: [git checkout 371d86631ea24e33427a59c144c0a879bfebca64]: exit status 1

@hexfusion
Copy link
Member

@ukinau thanks for the update and your patience I will take a hard look at this soon with the hopes of getting it merged before the end of the week.

@hexfusion hexfusion added this to the v0.9.4 milestone Jan 21, 2019
@hexfusion
Copy link
Member

@etcd-bot retest all

@hexfusion
Copy link
Member

@etcd-bot retest this please

@hasbro17
Copy link
Contributor

@ukinau This seems good to me. Apologies for the delay but we've been having some issues with our jenkins instances. Once we resolve that we should be good to run the CI and merge.

@hexfusion We can sync up offline but it's an older issue that we've seen before: our instance is out of memory.

@hexfusion
Copy link
Member

@hasbro17 I should be able to resolve Jenkins today.

@hexfusion
Copy link
Member

@etcd-bot retest this please

@@ -78,6 +83,7 @@ func TestBackupAndRestore(t *testing.T) {
s3Path := path.Join(os.Getenv("TEST_S3_BUCKET"), "jenkins", suffix, time.Now().Format(time.RFC3339), "etcd.backup")

testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)
testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ukinau I think I know why the test is failing. Previously we had the two subtests:

testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)
testEtcdRestoreOperatorForS3Source(t, clusterName, s3Path)

The first test creates the backup file at s3Path and the second expects the backup file to be present in order to restore a cluster.

With the periodic backup sub test now in between the two we are now deleting all the backup files in a defer statement. So testEtcdRestoreOperatorForS3Source will fail because the previous test deleted the backup file.

func testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path) {
        ...
        // create etcdbackup resource
	eb, err := f.CRClient.EtcdV1beta2().EtcdBackups(f.Namespace).Create(backupCR)
	if err != nil {
		t.Fatalf("failed to create etcd back cr: %v", err)
	}
	defer func() {
		if err := f.CRClient.EtcdV1beta2().EtcdBackups(f.Namespace).Delete(eb.Name, nil); err != nil {
			t.Fatalf("failed to delete etcd backup cr: %v", err)
		}
		// cleanup backup files
		allBackups, err = wr.List(context.Background(), backupS3Source.Path)
		if err != nil {
			t.Fatalf("failed to list backup files: %v", err)
		}
		if err := e2eutil.DeleteBackupFiles(wr, allBackups); err != nil {
			t.Fatalf("failed to cleanup backup files: %v", err)
		}
	}()
        ...
}

Can you please move testEtcdBackupOperatorForPeriodicS3Backup at the end so it doesn't interfere with the restore test.

// Backup then restore tests
testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)
testEtcdRestoreOperatorForS3Source(t, clusterName, s3Path)
// Periodic backup test
testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)

@hasbro17
Copy link
Contributor

@ukinau One more thing I noticed when trying out your PR is that for some reason the periodic runner isn't deleted when we delete the EtcdBackup CR.
With the CR deleted, the periodic runner for that CR will keep on spamming the logs since its unable to find the CR.

time="2019-01-31T07:39:42Z" level=info msg="getMaxRev: endpoint http://example-etcd-cluster-client:2379 revision (1)"
time="2019-01-31T07:39:52Z" level=info msg="getMaxRev: endpoint http://example-etcd-cluster-client:2379 revision (1)"
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller

Can you please try this out and confirm?

@hasbro17
Copy link
Contributor

hasbro17 commented Feb 4, 2019

@ukinau Just double checking if you still have time to work on the changes. If not I can probably take over this PR.

@ukinau
Copy link
Contributor Author

ukinau commented Feb 7, 2019

@hasbro17 Sorry for late response. I will spend more time on this weekend.
Actually I already have some mind about this error. So let me continue on working.
Sorry for in-completed my patch.

time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com "example-etcd-cluster-backup" not found)" pkg=controller

restore test expected backup file to be present but
periodic backup test actually cleanup backup file to be created
so we failed to perform restore test because of that.

that's why we moved periodic test after restore test
@hexfusion
Copy link
Member

@etcd-bot retest this please

@ukinau
Copy link
Contributor Author

ukinau commented Feb 14, 2019

It seems now everything is fine.
Could you take a look at it when you got time?
@hasbro17 @hexfusion

var err error
for {
for i := 1; i < 6; i++ {
latestEb, err = b.backupCRCli.EtcdV1beta2().EtcdBackups(b.namespace).Get(eb.Name, metav1.GetOptions{})
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ukinau Sorry for the delay. Yes think this should fix the infinite loop.

One more thing that I would improve on this is that we should not retry 5 times if the EtcdBackup CR is deleted or not found because that's when we want to stop the periodic backup. We only want to retry if it's some transient error.

  for i := 1; i < 6; i++ {
    latestEb, err = b.backupCRCli.EtcdV1beta2().EtcdBackups(b.namespace).Get(eb.Name, metav1.GetOptions{})
    if err != nil {
      // Stop backup if CR not found
      if apierrors.IsNotFound(err) {
        b.logger.Infof("Could not find EtcdBackup. Stopping periodic backup for EtcdBackup CR %v",
        eb.Name)
        break
      }
      
      b.logger.Warningf("[Attempt: %d/5] Failed to get latest EtcdBackup %v : (%v)",
        i, eb.Name, err)
      time.Sleep(1)
      continue
    }
    break
  }
  if err == nil {
    // Perform backup
    bs, err = b.handleBackup(&ctx, &latestEb.Spec, true)
  }

But that's just a minor change and since this PR has been out for a while I can make the above change as a follow up if you won't get the time for it anytime soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. I'm gonna fix it

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the retry improvement.

@hasbro17
Copy link
Contributor

LGTM

We'll need to follow this up to add a doc to walkthrough the example of a periodic backup.

Maybe an additional section in https://github.com/coreos/etcd-operator/blob/master/doc/user/walkthrough/backup-operator.md
Or a separate doc.

@jurgenweber
Copy link

jurgenweber commented Jul 1, 2019

I am finding this does not always clean up, I have Max backups 60, it got to 60 and now has stopped backing up.. Even though the status of the CR says succeeded, there are no new files.

spec:
  backupPolicy:
    backupIntervalInSecond: 3600
    maxBackups: 60
  clientTLSSecret: etcd-client-tls
  etcdEndpoints:
  - https://vault-etcd-cluster-client.secrets.svc:2379
  s3:
    awsSecret: etcd-vault-aws-backup-user
    forcePathStyle: false
    path: au-com-hipages-vault/backup/etcd
  storageType: S3
status:
  etcdRevision: 1050818
  etcdVersion: 3.3.13
  lastSuccessDate: "2019-07-01T21:51:51Z"
  succeeded: true

no errors or anything either;

amazing-dog-etcd-operator-etcd-backup-operator-694cbbdd6d-9899n etcd-backup-operator 2019-07-01T21:51:51.111176983Z time="2019-07-01T21:51:51Z" level=info msg="getMaxRev: endpoint https://vault-etcd-cluster-client.secrets.svc:2379 revision (1050818)"

@jurgenweber
Copy link

I found my issue; #2099

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal] support periodic backups
5 participants