backup-operator: Support periodically backup (#1841) #2028

ukinau · 2018-12-17T02:28:31Z

This PR implemented periodically backup which is discussed in #1841 .
The specification is mean to be almost same as what #1841 concluded.

After merged this PR, the user can enable periodically backup as following

apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdBackup
metadata:
  name: example-etcd-cluster-backup
  namespace: default
spec:
  backupPolicy:
    backupIntervalInSecond: 125   # new field 
    maxBackups: 4                         # new field
  clientTLSSecret: etcd-backup-cert
  etcdEndpoints:
  - https://<etcd-endpoint>
  s3:
    awsSecret: aws
    endpoint: https://<s3-endpoint>
    path: test/example
  storageType: S3

As you can see above, 2 new fields are added

backupIntervalInSecond
- specify how often snapshot would take. Above example will do every 125 second
maxBackups
- specify how many backups keep. Above example will keep only latest 4 backup. Once the number of backup exceeded than maxBackups, backup-operator delete oldest backup file which have same prefix

close #1841

This commit added BackupIntervalInSecond in BackupPolicy, which perform periodic backup as coreos#1841 issue describe. This commit is part of coreos#1841. By specifying BackupIntervalInSecond, user can let etcd-backup-operator do periodic backup. The specification of BackupIntervalInSecond is following - unit in sec - >0 means interval. - =0 means explicit disable interval backup, it will just do one time backup.

This commit implement validation of BackupIntervalInSecond. After this commit, backup-operator will make sure BackupIntervalInSecond follow following restrictions - <0 is not allowed, failed validation - 0 is valid and disable periodic backup - >0 is valid and means interval

Current backup status is only designed for one-shot snapshot. Always it show lastest results but it would be nice if we could record the last time to successfully take a snapshot.

This commit added MaxBackups attributs which let backup-operator delete older snapshots if the number of snapshots exceeded than MaxBackups. Specification of MaxBackups is following - <0 is not allowed, which cause validation failure - =0 is to indicate MaxBackups is infinite, which let not operator delete any exisiting snapshots - >0 is to indicate the max number of snapshot

After support periodic backup, backup-operator added revision number and date to s3 path as following <bucket name>/<object name>_v<rev>_<date>. This behaviour has been applied even if backup is one-shot backup, therfore this change broke exisiting behaviour. This commit brough back original behaviour which use s3 path without adding anything <bucket name>/<object name>, if backup is not periodic

etcd-bot · 2018-12-17T02:29:02Z

Can one of the admins verify this patch?

etcd-bot · 2018-12-17T02:29:08Z

Can one of the admins verify this patch?

etcd-bot · 2018-12-17T02:30:08Z

Can one of the admins verify this patch?

hexfusion · 2018-12-17T03:13:50Z

@etcd-bot ok to test

hexfusion

@ukinau thanks for the PR few nits, also can you update the codegen files that should fix your failed unit test.

./hack/k8s/codegen/update-generated.sh

pkg/backup/backup_manager.go

pkg/backup/writer/writer.go

pkg/controller/backup-operator/sync.go

pkg/backup/backup_manager.go

hexfusion · 2018-12-17T13:50:41Z

@etcd-bot retest this please

hexfusion

Few more nits and we are having multiple failures on CI if you could look into that I would appreciate it.

pkg/backup/writer/s3_writer.go

pkg/backup/backup_manager.go

pkg/controller/backup-operator/abs_backup.go

pkg/controller/backup-operator/util.go

pkg/controller/backup-operator/sync.go

hexfusion · 2018-12-18T18:08:15Z

@ukinau good progress, thanks! We need to get to the bottom of this CI failure can you take a look please?

pkg/apis/etcd/v1beta2/zz_generated.deepcopy.go:137:20: in.LastSuccessDate.DeepCopyInto undefined (type time.Time has no field or method DeepCopyInto)

After I generated the code based on k8s object (zz_generated.deepcopy.go), we happened to be in failing to build. This is because all k8s custom resource's fileds should implement DeepCopyInto but time.Time we added doesn't implement it. For this purpose we should have used meta/v1.Time which is the implementation to implement all necessary function for k8s object and same function of time.Time. And also this commit include some refactoring which is pointed out in code-review

ukinau · 2018-12-18T18:59:45Z

@hexfusion
Thanks for your quick review and Sorry for taking a time, I fixed the problems of "in.LastSuccessDate.DeepCopyInto undefined" and the comments you left.
Now it seems all CI passed.

hexfusion · 2018-12-19T17:56:48Z

@hasbro17 PTAL

ukinau · 2019-01-08T06:20:20Z

@hexfusion Is there anything I can do to make this PR merged faster?

hexfusion · 2019-01-08T12:48:17Z

@ukinau thank you for your patience I will get a final review in hopefully by the end of the week.

pkg/backup/backup_manager.go

ukinau · 2019-01-16T15:08:44Z

@hexfusion @hasbro17
I added followings

CHANGELOG
example manifests
e2eslow test for periodic backup

but CI for e2eslow test seems having something other trouble which is not related to this change and seems related to environment issue. Coud you check it out?

There seem be something untracked files preventing from performing e2e test

<span class="timestamp"><b>02:35:45</b> </span>Please move or remove them before you switch branches.
<span class="timestamp"><b>02:35:45</b> </span>error: The following untracked working tree files would be removed by checkout:
<span class="timestamp"><b>02:35:45</b> </span> CHANGELOG-1.14.md
<span class="timestamp"><b>02:35:45</b> </span> cmd/cloud-controller-manager/app/core.go
<span class="timestamp"><b>02:35:45</b> </span> cmd/controller-manager/app/helper_test.go
<span class="timestamp"><b>02:35:45</b> </span> hack/.shellcheck_failures
<span class="timestamp"><b>02:35:45</b> </span> hack/verify-shellcheck.sh
<span class="timestamp"><b>02:35:45</b> </span> pkg/scheduler/algorithm/priorities/types.go
<span class="timestamp"><b>02:35:45</b> </span> pkg/scheduler/algorithm/priorities/types_test.go
<span class="timestamp"><b>02:35:45</b> </span>Please move or remove them before you switch branches.
<span class="timestamp"><b>02:35:45</b> </span>Aborting
<span class="timestamp"><b>02:35:45</b> </span>: command failed: [git checkout 371d86631ea24e33427a59c144c0a879bfebca64]: exit status 1

hexfusion · 2019-01-21T13:59:19Z

@ukinau thanks for the update and your patience I will take a hard look at this soon with the hopes of getting it merged before the end of the week.

hexfusion · 2019-01-25T21:06:42Z

@etcd-bot retest all

hexfusion · 2019-01-25T21:24:51Z

@etcd-bot retest this please

hasbro17 · 2019-01-28T06:12:06Z

@ukinau This seems good to me. Apologies for the delay but we've been having some issues with our jenkins instances. Once we resolve that we should be good to run the CI and merge.

@hexfusion We can sync up offline but it's an older issue that we've seen before: our instance is out of memory.

hexfusion · 2019-01-28T12:28:30Z

@hasbro17 I should be able to resolve Jenkins today.

hexfusion · 2019-01-28T16:54:54Z

@etcd-bot retest this please

hasbro17 · 2019-01-31T06:41:26Z

test/e2e/e2eslow/backup_restore_test.go

@@ -78,6 +83,7 @@ func TestBackupAndRestore(t *testing.T) {
 	s3Path := path.Join(os.Getenv("TEST_S3_BUCKET"), "jenkins", suffix, time.Now().Format(time.RFC3339), "etcd.backup")

 	testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)
+	testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)


@ukinau I think I know why the test is failing. Previously we had the two subtests:

testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path) testEtcdRestoreOperatorForS3Source(t, clusterName, s3Path)

The first test creates the backup file at s3Path and the second expects the backup file to be present in order to restore a cluster.

With the periodic backup sub test now in between the two we are now deleting all the backup files in a defer statement. So testEtcdRestoreOperatorForS3Source will fail because the previous test deleted the backup file.

func testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path) { ... // create etcdbackup resource eb, err := f.CRClient.EtcdV1beta2().EtcdBackups(f.Namespace).Create(backupCR) if err != nil { t.Fatalf("failed to create etcd back cr: %v", err) } defer func() { if err := f.CRClient.EtcdV1beta2().EtcdBackups(f.Namespace).Delete(eb.Name, nil); err != nil { t.Fatalf("failed to delete etcd backup cr: %v", err) } // cleanup backup files allBackups, err = wr.List(context.Background(), backupS3Source.Path) if err != nil { t.Fatalf("failed to list backup files: %v", err) } if err := e2eutil.DeleteBackupFiles(wr, allBackups); err != nil { t.Fatalf("failed to cleanup backup files: %v", err) } }() ... }

Can you please move testEtcdBackupOperatorForPeriodicS3Backup at the end so it doesn't interfere with the restore test.

// Backup then restore tests testEtcdBackupOperatorForS3Backup(t, clusterName, operatorClientTLSSecret, s3Path) testEtcdRestoreOperatorForS3Source(t, clusterName, s3Path) // Periodic backup test testEtcdBackupOperatorForPeriodicS3Backup(t, clusterName, operatorClientTLSSecret, s3Path)

hasbro17 · 2019-01-31T07:43:18Z

@ukinau One more thing I noticed when trying out your PR is that for some reason the periodic runner isn't deleted when we delete the EtcdBackup CR.
With the CR deleted, the periodic runner for that CR will keep on spamming the logs since its unable to find the CR.

time="2019-01-31T07:39:42Z" level=info msg="getMaxRev: endpoint http://example-etcd-cluster-client:2379 revision (1)"
time="2019-01-31T07:39:52Z" level=info msg="getMaxRev: endpoint http://example-etcd-cluster-client:2379 revision (1)"
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller
time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com \"example-etcd-cluster-backup\" not found)" pkg=controller

Can you please try this out and confirm?

hasbro17 · 2019-02-04T20:06:24Z

@ukinau Just double checking if you still have time to work on the changes. If not I can probably take over this PR.

ukinau · 2019-02-07T12:19:17Z

@hasbro17 Sorry for late response. I will spend more time on this weekend.
Actually I already have some mind about this error. So let me continue on working.
Sorry for in-completed my patch.

time="2019-01-31T07:40:02Z" level=warning msg="Failed to get latest EtcdBackup example-etcd-cluster-backup : (etcdbackups.etcd.database.coreos.com "example-etcd-cluster-backup" not found)" pkg=controller

restore test expected backup file to be present but periodic backup test actually cleanup backup file to be created so we failed to perform restore test because of that. that's why we moved periodic test after restore test

hexfusion · 2019-02-09T18:39:17Z

@etcd-bot retest this please

ukinau · 2019-02-14T02:37:04Z

It seems now everything is fine.
Could you take a look at it when you got time?
@hasbro17 @hexfusion

hasbro17 · 2019-02-18T01:02:07Z

pkg/controller/backup-operator/sync.go

 			var err error
-			for {
+			for i := 1; i < 6; i++ {
 				latestEb, err = b.backupCRCli.EtcdV1beta2().EtcdBackups(b.namespace).Get(eb.Name, metav1.GetOptions{})
 				if err != nil {


@ukinau Sorry for the delay. Yes think this should fix the infinite loop.

One more thing that I would improve on this is that we should not retry 5 times if the EtcdBackup CR is deleted or not found because that's when we want to stop the periodic backup. We only want to retry if it's some transient error.

for i := 1; i < 6; i++ { latestEb, err = b.backupCRCli.EtcdV1beta2().EtcdBackups(b.namespace).Get(eb.Name, metav1.GetOptions{}) if err != nil { // Stop backup if CR not found if apierrors.IsNotFound(err) { b.logger.Infof("Could not find EtcdBackup. Stopping periodic backup for EtcdBackup CR %v", eb.Name) break } b.logger.Warningf("[Attempt: %d/5] Failed to get latest EtcdBackup %v : (%v)", i, eb.Name, err) time.Sleep(1) continue } break } if err == nil { // Perform backup bs, err = b.handleBackup(&ctx, &latestEb.Spec, true) }

But that's just a minor change and since this PR has been out for a while I can make the above change as a follow up if you won't get the time for it anytime soon.

Thanks for your review. I'm gonna fix it

hasbro17

LGTM other than the retry improvement.

hasbro17 · 2019-02-18T22:25:20Z

LGTM

We'll need to follow this up to add a doc to walkthrough the example of a periodic backup.

Maybe an additional section in https://github.com/coreos/etcd-operator/blob/master/doc/user/walkthrough/backup-operator.md
Or a separate doc.

jurgenweber · 2019-07-01T22:25:09Z

I am finding this does not always clean up, I have Max backups 60, it got to 60 and now has stopped backing up.. Even though the status of the CR says succeeded, there are no new files.

spec:
  backupPolicy:
    backupIntervalInSecond: 3600
    maxBackups: 60
  clientTLSSecret: etcd-client-tls
  etcdEndpoints:
  - https://vault-etcd-cluster-client.secrets.svc:2379
  s3:
    awsSecret: etcd-vault-aws-backup-user
    forcePathStyle: false
    path: au-com-hipages-vault/backup/etcd
  storageType: S3
status:
  etcdRevision: 1050818
  etcdVersion: 3.3.13
  lastSuccessDate: "2019-07-01T21:51:51Z"
  succeeded: true

no errors or anything either;

amazing-dog-etcd-operator-etcd-backup-operator-694cbbdd6d-9899n etcd-backup-operator 2019-07-01T21:51:51.111176983Z time="2019-07-01T21:51:51Z" level=info msg="getMaxRev: endpoint https://vault-etcd-cluster-client.secrets.svc:2379 revision (1050818)"

jurgenweber · 2019-07-02T04:45:24Z

I found my issue; #2099

ukinau added 6 commits December 15, 2018 08:43

backup: Add LastSuccessDate

3babc48

Current backup status is only designed for one-shot snapshot. Always it show lastest results but it would be nice if we could record the last time to successfully take a snapshot.

backup: Reset reason if backup succeeded

31c5f4b

hexfusion suggested changes Dec 17, 2018

View reviewed changes

pkg/backup/backup_manager.go Show resolved Hide resolved

pkg/backup/writer/writer.go Show resolved Hide resolved

pkg/controller/backup-operator/sync.go Show resolved Hide resolved

hexfusion reviewed Dec 17, 2018

View reviewed changes

pkg/backup/backup_manager.go Outdated Show resolved Hide resolved

ukinau added 2 commits December 17, 2018 14:19

backup: Update the codegen files

2dc177c

backup: fix typo

fce86f1

hexfusion suggested changes Dec 17, 2018

View reviewed changes

backup: Refactoring

d519aa7

ukinau force-pushed the support-periodic-backup branch 3 times, most recently from 06fc1a9 to f84a3d5 Compare December 18, 2018 18:42

ukinau force-pushed the support-periodic-backup branch from f84a3d5 to d1ae2c4 Compare December 18, 2018 18:46

hexfusion approved these changes Dec 18, 2018

View reviewed changes

raoofm mentioned this pull request Dec 19, 2018

ETCD backup policy is not working #2031

Closed

hasbro17 reviewed Jan 11, 2019

View reviewed changes

pkg/backup/backup_manager.go Outdated Show resolved Hide resolved

ukinau added 3 commits January 16, 2019 12:17

backup: add e2e slow test for periodic backup

e7dfe4b

Update generated k8s code license for new year(2019)

1cdb3aa

backup: Minor fix "composite literal uses unkeyed"

ab9b78e

ukinau force-pushed the support-periodic-backup branch from f0ecf00 to ab9b78e Compare January 16, 2019 03:18

backup: Update CHANGELOG.md for periodic support

bb3b72c

hexfusion added this to the v0.9.4 milestone Jan 21, 2019

hasbro17 reviewed Jan 31, 2019

View reviewed changes

ukinau added 2 commits February 9, 2019 20:03

backup: Move periodbackup test after restore test

e084e9b

restore test expected backup file to be present but periodic backup test actually cleanup backup file to be created so we failed to perform restore test because of that. that's why we moved periodic test after restore test

backup: Fixed the bug to get operator in infinite-loop

aee3c8d

hasbro17 reviewed Feb 18, 2019

View reviewed changes

hasbro17 approved these changes Feb 18, 2019

View reviewed changes

backup: Only Retry in the case of transient error

70aa25f

hasbro17 merged commit 4aea722 into coreos:master Feb 18, 2019

hasbro17 mentioned this pull request Feb 18, 2019

Add example walkthrough for periodic backup #2050

Open

hexfusion mentioned this pull request Feb 20, 2019

pkg/controller: Fix CR with same name in different namespaces #2026

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backup-operator: Support periodically backup (#1841) #2028

backup-operator: Support periodically backup (#1841) #2028

ukinau commented Dec 17, 2018

etcd-bot commented Dec 17, 2018

etcd-bot commented Dec 17, 2018

etcd-bot commented Dec 17, 2018

hexfusion commented Dec 17, 2018

hexfusion left a comment

hexfusion commented Dec 17, 2018

hexfusion left a comment

hexfusion commented Dec 18, 2018

ukinau commented Dec 18, 2018 •

edited

Loading

hexfusion commented Dec 19, 2018

ukinau commented Jan 8, 2019 •

edited

Loading

hexfusion commented Jan 8, 2019

ukinau commented Jan 16, 2019 •

edited

Loading

hexfusion commented Jan 21, 2019

hexfusion commented Jan 25, 2019

hexfusion commented Jan 25, 2019

hasbro17 commented Jan 28, 2019

hexfusion commented Jan 28, 2019

hexfusion commented Jan 28, 2019

hasbro17 Jan 31, 2019

hasbro17 commented Jan 31, 2019

hasbro17 commented Feb 4, 2019

ukinau commented Feb 7, 2019

hexfusion commented Feb 9, 2019

ukinau commented Feb 14, 2019

hasbro17 Feb 18, 2019

ukinau Feb 18, 2019

hasbro17 left a comment

hasbro17 commented Feb 18, 2019

jurgenweber commented Jul 1, 2019 •

edited

Loading

jurgenweber commented Jul 2, 2019

backup-operator: Support periodically backup (#1841) #2028

backup-operator: Support periodically backup (#1841) #2028

Conversation

ukinau commented Dec 17, 2018

etcd-bot commented Dec 17, 2018

etcd-bot commented Dec 17, 2018

etcd-bot commented Dec 17, 2018

hexfusion commented Dec 17, 2018

hexfusion left a comment

Choose a reason for hiding this comment

hexfusion commented Dec 17, 2018

hexfusion left a comment

Choose a reason for hiding this comment

hexfusion commented Dec 18, 2018

ukinau commented Dec 18, 2018 • edited Loading

hexfusion commented Dec 19, 2018

ukinau commented Jan 8, 2019 • edited Loading

hexfusion commented Jan 8, 2019

ukinau commented Jan 16, 2019 • edited Loading

hexfusion commented Jan 21, 2019

hexfusion commented Jan 25, 2019

hexfusion commented Jan 25, 2019

hasbro17 commented Jan 28, 2019

hexfusion commented Jan 28, 2019

hexfusion commented Jan 28, 2019

hasbro17 Jan 31, 2019

Choose a reason for hiding this comment

hasbro17 commented Jan 31, 2019

hasbro17 commented Feb 4, 2019

ukinau commented Feb 7, 2019

hexfusion commented Feb 9, 2019

ukinau commented Feb 14, 2019

hasbro17 Feb 18, 2019

Choose a reason for hiding this comment

ukinau Feb 18, 2019

Choose a reason for hiding this comment

hasbro17 left a comment

Choose a reason for hiding this comment

hasbro17 commented Feb 18, 2019

jurgenweber commented Jul 1, 2019 • edited Loading

jurgenweber commented Jul 2, 2019

ukinau commented Dec 18, 2018 •

edited

Loading

ukinau commented Jan 8, 2019 •

edited

Loading

ukinau commented Jan 16, 2019 •

edited

Loading

jurgenweber commented Jul 1, 2019 •

edited

Loading