Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save logs per backup #40

Merged
merged 16 commits into from
Sep 11, 2017
Merged

Save logs per backup #40

merged 16 commits into from
Sep 11, 2017

Conversation

ncdc
Copy link
Contributor

@ncdc ncdc commented Aug 14, 2017

Save logs per backup in object storage.

Add ark backup logs command for retrieval.

Fixes #39

TODO

  • docs on how to use dep
  • fix GetAllBackups to be forgiving of bucket dirs that are missing the metadata file (in GetAllBackups don't error if single backup is unreadable #66)
  • make gc controller delete everything under the backup prefix instead of targeting specific files
  • VolumeSnapshotAction needs to log to the per-backup log

@ncdc ncdc changed the title Save logs per backup [WIP] Save logs per backup Aug 14, 2017
@ncdc
Copy link
Contributor Author

ncdc commented Aug 14, 2017

EDIT: now using pre-signed URLs from object storage instead of a web server

This adds a web server to the container that serves up /v1/backup/logs?backup=<backup name>. At the moment it is only http and does no authentication/authorization checks.

@timothysc any thoughts/suggestions on the URL format?

@mattmoyer should I implement a SubjectAccessReview check to ensure the requester has access to get backups?

@ncdc
Copy link
Contributor Author

ncdc commented Aug 18, 2017

@jrnt30 this isn't the cleanest, but I've tested that it works at least for minio. I need to test aws/azure/gcp, but that'll have to wait for the weekend or next week. The way gcp library for doing pre-signed URLs is a bit odd b/c it requires the serviceaccount email and private key directly, unlike any of the other places where we talk to gcp. I need to look around and see if there's a better way, or if switching to this newer/higher level gcp api makes more sense.

tl;dr still a WIP, needs more unit tests, needs more godoc, needs some cleanup, needs manual e2e testing.

@jrnt30
Copy link
Contributor

jrnt30 commented Aug 18, 2017

Thanks, I'll see what merging that into the download backup looks like and if timer permits look at some AWS testing

@jrnt30
Copy link
Contributor

jrnt30 commented Aug 19, 2017

So I took a bit of a look into this. For AWS specifically there is a max TTL for presigned requests of a week http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html I haven't looked at the other providers yet, but I imagine it will be somewhat similar.

Depending on how we want to deal with TTLs, we may need to be a bit more explicit about the URLs and go the BackupDownloadRequest option but that does complicate the interaction with the CLI in general as we then have a resource we need to have propagate before we can actually pull the signed URL.

Another possibly is to add in additional TTL attributes (similar to the expiry) for the presigned links themselves and then augment the gc collector (or create a new LinkSync controller) that would be responsible for upserting that information.

There may be a few other options (like having a child CRD to the Backup that has it's own lifecycle for refresh and defaults to min of the the TTL provider vs. the providers limit) but I would need to give that some more thought.

I will work on updating the Download command based upon what is here as I think in general it will be fairly transparent whichever way we go.

@ncdc
Copy link
Contributor Author

ncdc commented Aug 21, 2017

@jrnt30 I have decided to go with a DownloadRequest type that can be used for backup logs, backup tarballs, and restore logs. I'm working on it now.

@ncdc
Copy link
Contributor Author

ncdc commented Aug 21, 2017

@jrnt30 now with DownloadRequest. PTAL. I've gotta go for now, but I'll keep working on it tonight/tomorrow.


listOptions := metav1.ListOptions{
//TODO: once kube-apiserver http://issue.k8s.io/51046 is fixed, uncomment
//FieldSelector: "metadata.name=" + args[0],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be dr.ObjectMeta.Name in the comment since you are appending the timestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. I also need to make sure the name of the DownloadRequest received from w.ResultChan matches that too.

@jrnt30
Copy link
Contributor

jrnt30 commented Aug 22, 2017

Thanks for the updates. I have started an integration with what you have w/ a bit of refactoring to extract some very common things which you could see https://github.com/ncdc/ark/compare/backup-logs...jrnt30:backup-download-request?expand=1

It's rough at this point and I just ran into the issue with Minio not being supported again due to AZ validation so I am throwing in the towel for tonight.

In general I think this should work pretty well and be extensible. There were a few small test/build things I had to fix, but I'm sure you would have found those as well.

@ncdc ncdc force-pushed the backup-logs branch 2 times, most recently from 820c8fd to a7d66d2 Compare August 22, 2017 18:57
@ncdc
Copy link
Contributor Author

ncdc commented Aug 22, 2017

@jrnt30 the latest code now has extracted the download request streaming into a helper function (I put it pkg/cmd/util/downloadrequest because we're also going to use it for restore logs) and I've fixed the test/compile issues. @skriss is wrapping up #43 which will fix the minio issue. I've tested this locally with minio (commenting out the AZ validation) and on AWS. Next up, I'll be testing GCP. @skriss is going to test Azure for us.

@jrnt30
Copy link
Contributor

jrnt30 commented Aug 22, 2017

👍 This looks good, the util w/ test is a superior solution.

I did a local rebase and can work in these changes. I'm going to hold off until it stabilizes a bit as the force pushes make the rebase particularly annoying.

@ncdc ncdc force-pushed the backup-logs branch 2 times, most recently from d49aea0 to 4262e42 Compare August 23, 2017 18:04
@@ -20,11 +20,13 @@ metadata:
name: default
persistentVolumeProvider:
gcp:
credentialsFile: /credentials/cloud
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually needed for the PVProvider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. The PVProvider doesn't use it directly; it needs to have GOOGLE_APPLICATION_CREDENTIALS set to this path. I feel like this is getting a bit messy... In my branch now, you could have your backup provider be something other than GCE, but if you want to handle PVs, then you have to specify the credentialsFile for the gcp PVProvider, because the code looks for that and sets the env var for you (so you don't have to put it in the deployment.yaml).

I feel like this is getting messy and I'm getting more and more in favor of having a config that looks like this, although I know @jbeda doesn't like it:

clouds:
  myCoolGCP:
    credentialsFile: /credentials/cloud
    project: foo
    zone: us-east1-b
  someOtherCloud:
    someDetails: goHere
persistentVolumeProvider:
  cloud: myCoolGCP
backupStorageProvider:
  cloud: someOtherCloud
  bucket: mybucket

This way, we load each cloud provider one time, instead of twice like we're doing now...

Let's discuss when you're in.


import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

type DownloadRequestSpec struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file needs godoc

"reflect"
"sync"
"time"

"golang.org/x/oauth2/google"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to change my stance on imports. I don't want us to spend time thinking about them. If the code compiles, that's good enough for me, regardless of order, whitespace, grouping, etc.

return c.downloadRequestClient.DownloadRequests(downloadRequest.Namespace).Delete(downloadRequest.Name, nil)
}

func (c *downloadRequestController) resync() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's slightly confusing that the only way that items with Phase = Processed actually get processed is because they're add to the queue in the re-sync. WDYT about not filtering them out in the informer AddFunc & having an UpdateFunc? Else, probably at least worth a comment somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because I intend for the ark client to manage the lifespan of a DownloadRequest:

  1. client creates DownloadRequest
  2. client waits for it to be Processed
  3. controller processes it
  4. client sees updated DownloadRequest and downloads the URL
  5. client deletes the DownloadRequest

The resync is a failsafe that is meant to clean up any DownloadRequests that the client failed to delete. I'm happy to put in a comment to this effect.

Does this make sense?

@ncdc ncdc changed the title [WIP] Save logs per backup Save logs per backup Aug 24, 2017
@ncdc ncdc changed the title Save logs per backup [WIP] Save logs per backup Aug 24, 2017
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Switch to dep.

Update the following:
- azure-sdk-for-go to 10.2.1-beta
- go-autorest to 8.1.1
- client-go to 4.0.0

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@ncdc ncdc changed the title [WIP] Save logs per backup Save logs per backup Sep 7, 2017
Copy link
Contributor

@jrnt30 jrnt30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. A few logging things we can probably remove.

I have a few general questions that are more for educating myself. If this isn't a good forum for that, I can take it elsewhere next time.

func (kb *kubernetesBackupper) Backup(backup *api.Backup, data io.Writer) error {
gzw := gzip.NewWriter(data)
defer gzw.Close()
func (kb *kubernetesBackupper) Backup(backup *api.Backup, data, log io.Writer) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly picky, but personally seeing log makes me assume that is actually the vehicle this func should be using to log it's output. Took me a few passes to figure it out via the context. Perhaps results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like "results". We could call them "outputFile" and "logFile" if that helps (even though they're io.Writers)?

}

func (l *logger) log(msg string, args ...interface{}) {
// TODO use a real logger that supports writing to files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the logger is provided an io.Writer, is this TODO still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because we'll be switching to logrus for 0.5.0

@@ -439,7 +444,8 @@ func TestBackupMethod(t *testing.T) {
require.NoError(t, err)

output := new(bytes.Buffer)
err = backupper.Backup(backup, output)
log := new(bytes.Buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More for my own education, but is this somewhere that we could have used iotuil.Discard as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's better. I'll switch.

@@ -240,7 +281,7 @@ func (kb *kubernetesBackupper) backupResource(
} else {
other = appsDeploymentsResource
}
glog.V(4).Infof("Skipping resource %q because it's a duplicate of %q", grString, other)
ctx.log("Skipping resource %q because it's a duplicate of %q", grString, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a mixture of V(2) and V(4) mappings here. Are we looking to just simplify logging and err towards providing more verbose logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to revisit log levels as part of the switch to logrus (another motivating factor).

metadataFileFormatString string = "%s/ark-backup.json"
backupFileFormatString string = "%s/%s.tar.gz"
metadataFileFormatString = "%s/ark-backup.json"
backupFileFormatString = "%s/%s.tar.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since both of these are actually tar.gz should we distinguish via the base filename? %s/%s-data.tar.gz & %s/%s-logs.tar.gz?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that. I'm also planning on putting the restore log files in here, probably restore-<restore name>-log.gz.

var errs []error
for _, key := range objects {
glog.V(4).Infof("Trying to delete bucket=%s, key=%s", bucket, key)
fmt.Printf("Trying to delete bucket=%s, key=%s\n", bucket, key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

for _, key := range objects {
glog.V(4).Infof("Trying to delete bucket=%s, key=%s", bucket, key)
fmt.Printf("Trying to delete bucket=%s, key=%s\n", bucket, key)
if err := br.objectStorage.DeleteObject(bucket, key); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding a DeleteDir to the ObjectStorage interface? I know for AWS at least you can add the recursive option so you don't have N+1 calls. I suppose at this point it's such a small number of nested files we may not have an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a recursive option for aws - where is that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer, this didn't make the cut for the GO SDK version and the others are doing something similar already... this is not relevant, sorry

@@ -84,6 +92,36 @@ func (op *objectStorageAdapter) ListCommonPrefixes(bucket string, delimiter stri
return ret, nil
}

func (op *objectStorageAdapter) ListObjects(bucket, prefix string) ([]string, error) {
res, err := op.gcs.Objects.List(bucket).Prefix(prefix).Do()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the max results here is 1k, which is what we added the paging support to S3 to allow more than. Do we want to implement something here? Azure looks to have a default of 5k.

Is there a threshold where we just say "N is good enough"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to support paging everywhere we can

arkClient, err := f.Client()
cmd.CheckError(err)

err = downloadrequest.Stream(arkClient.ArkV1(), args[0], v1.DownloadTargetKindBackupLog, os.Stdout, timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful abstraction, thanks!

return nil
}

const signedURLTTL = 10 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general question I have, which I can test if you aren't sure is: If the TTL expires on the actual signed URL in the middle of an in process download (slow connection, large download, etc.) will it proceed or be terminated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know

@jrnt30
Copy link
Contributor

jrnt30 commented Sep 11, 2017

LGTM!

key := fmt.Sprintf(backupFileFormatString, backupName, backupName)
glog.V(4).Infof("Trying to delete bucket=%s, key=%s", bucket, key)
return br.objectStorage.DeleteObject(bucket, key)
func (br *backupService) DeleteBackupDir(bucket, backupName string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to delete metadata file last here (and only if all other deletes succeed), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've waffled on this, trying to think about what happens in different scenarios. If we manage to delete the data tarball but fail to delete a logfile, we're left with a logfile and the metadata file. You can't restore, but you could view the logs (not super useful). I'm wondering if it's worth the effort to special case the metadata file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thinking around deleting the metadata file last was to not orphan any Ark-created object-storage files (or snapshots). As it is now, we could theoretically fail to delete both the tarball and log file, but delete the metadata file, leaving garbage around with no refs to it. I don't think it's a huge deal because the likelihood of this scenario is pretty small IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for GC we also need to call ListCommonPrefixes and try to delete anything in object storage where the metadata file is missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add an issue and keep moving.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #77

}
defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log the status code here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary. Then we'd print out something like "request failed (401 Unauthorized): AccessDenied". WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm fine with it as-is.

@skriss
Copy link
Member

skriss commented Sep 11, 2017

@ncdc finished reviewing, the object storage file deletion order/dependency is the only material question i have.

@@ -90,12 +91,13 @@ func NewBlockStorageAdapter(location string, apiTimeout time.Duration) (cloudpro
disksClient := disk.NewDisksClient(cfg[azureSubscriptionIDKey])
snapsClient := disk.NewSnapshotsClient(cfg[azureSubscriptionIDKey])

disksClient.Authorizer = spt
snapsClient.Authorizer = spt
authorizer := autorest.NewBearerAuthorizer(&spt.Token)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just be authorizer := autorest.NewBearerAuthorizer(spt) -- hit this bug when testing Azure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just this 1 line change? Any other Azure changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested backup with snapshots, and GC of said backup, with this change -- all looks good

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Due to a change in glog, if you don't call flag.Parse, every log line
prints out 'ERROR: logging before flag.Parse'. This works around that
change. Ultimately we need to switch to a different logging library.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Delete all objects in backup "dir" when deleting a backup, instead of
hard-coding individual file names/types. This way, we'll be able to
delete log files and anything else we add without having to update our
deletion code.

Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
Signed-off-by: Andy Goldstein <andy.goldstein@gmail.com>
@skriss skriss merged commit f438c22 into vmware-tanzu:master Sep 11, 2017
@ncdc ncdc deleted the backup-logs branch September 14, 2017 13:40
jmontleon added a commit to jmontleon/velero that referenced this pull request Dec 10, 2019
Update nmap-ncat line to clean up; necessary in some envs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move to dep for dependency management
3 participants