Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object storage integration (set of object versions) #461

Closed
alaric-dotmesh opened this issue Jun 18, 2018 · 11 comments
Closed

Object storage integration (set of object versions) #461

alaric-dotmesh opened this issue Jun 18, 2018 · 11 comments
Assignees
Labels

Comments

@alaric-dotmesh
Copy link
Contributor

alaric-dotmesh commented Jun 18, 2018

Design document:

https://docs.google.com/document/d/1VFigteB-8QTmNpIIobfPYnu8R9pnKW_zHazA_NAdwM0/edit

@Godley Godley self-assigned this Jun 19, 2018
@Godley
Copy link
Contributor

Godley commented Jun 19, 2018

Thoughts:

  • you won't be able to remote switch to this an S3 remote - all the comms for push/pull will be in the dm server, so you need a remote already in order to talk to it
    • this will mean there's no dm list or other metadata operations either

@alaric-dotmesh
Copy link
Contributor Author

Agreed. You'll be able to add an s3 remote (dm s3 remote add?), and reference it in dm push, dm pull and dm clone, but not dm remote switch it so there's no way for it to be the current remote for dm dot show and friends.

@alaric-dotmesh
Copy link
Contributor Author

An S3 remote is probably a set of S3 API credentials; then when we do dm clone REMOTE DOT and REMOTE names an S3 remote, DOT names an S3 bucket.

@Godley
Copy link
Contributor

Godley commented Jun 19, 2018

Plan:

  • start with dm remote add. Decide where this fits in (probably add dm s3 remote add instead of just subbing this into the current command)
  • block clients from switching to the remote somehow (probably including an appropriate error)
  • decide where this fits in with dm remote listing - do we include it as just a remote or have a separate category of output?

Server side:

  • start with clone. Add a new RPC call which is s3Transfer or something which will initially fail.
  • Add a call to fsMachine which will trigger the state machine to do the action
  • add a state in the state machine for clone/pull from s3, along a similar vein to the current cloneState/pullState, whatever it is. Will probably need a hand from Alaric/Priya at this point (and earlier on, probably!)
  • similar logic for push/pull depending what's done at this point (clone and pull are similar operations in dotmesh land)

Godley pushed a commit that referenced this issue Jun 20, 2018
also use them to check we can access S3 and have buckets to talk to.
Godley pushed a commit that referenced this issue Jun 21, 2018
Godley pushed a commit that referenced this issue Jun 21, 2018
Godley pushed a commit that referenced this issue Jun 22, 2018
Godley pushed a commit that referenced this issue Jun 22, 2018
It would be nice to not need this or be able to unmarshal to dmremotes then save the config file as dmremotes not remotes...
Godley pushed a commit that referenced this issue Jun 22, 2018
Godley pushed a commit that referenced this issue Jun 22, 2018
Godley pushed a commit that referenced this issue Jun 22, 2018
Godley pushed a commit that referenced this issue Jun 25, 2018
Godley pushed a commit that referenced this issue Jun 26, 2018
Godley pushed a commit that referenced this issue Jun 26, 2018
@Godley
Copy link
Contributor

Godley commented Jun 26, 2018

Idea not for right now: we're specifying that you can't switch to an s3 remote to avoid extra overhead/feature creep/etc, but it'd be quite easy to dm list an S3 remote (just call ListBuckets)

@Godley
Copy link
Contributor

Godley commented Jun 26, 2018

Testing thoughts:
there's a whole host of s3 mock servers out there, some already containerised, some not:
https://github.com/findify/s3mock
https://github.com/adobe/S3Mock
https://github.com/jserver/mock-s3
https://www.npmjs.com/package/mock-aws-s3

But many of them explicitly state they don't support versioning which might be a problem.

@Godley
Copy link
Contributor

Godley commented Jun 26, 2018

Some other random thoughts:

  1. Kate made a point about whether we can do something to automate IAM policy creation etc, should look into what can be done there. At the moment the expectation is you go off and create your own key/secret pair which is just to your account, generally.
  2. Should the secret be entered the same way Api key is, i.e hidden? at min its <keyId>:<secretKey>

Godley pushed a commit that referenced this issue Jul 5, 2018
Godley pushed a commit that referenced this issue Jul 5, 2018
Godley pushed a commit that referenced this issue Jul 6, 2018
Godley pushed a commit that referenced this issue Jul 6, 2018
…k we can't push again before committing something new
Godley pushed a commit that referenced this issue Jul 9, 2018
Godley pushed a commit that referenced this issue Jul 9, 2018
@Godley
Copy link
Contributor

Godley commented Jul 11, 2018

Left to do:
1. Saving DM state in S3 in case we destroy the local volume for some reason apparently we don't want this

  1. need to be able to pull a subset of keys
    1. Force pull/push based on divergence We don't support forced pull/push from the client for "normal" remotes, so implementing it just for S3 would mean we'd need to do some work on DM as well or else explicitly tell users -f for DM remotes does nothing.

@Godley
Copy link
Contributor

Godley commented Jul 12, 2018

Hacking my way through subsets at the moment. I think there may be some complications with push/pull - if a user clones a part of an S3 bucket, then pushes the changes to S3, this will delete the files we ignored when we cloned it. Similarly, if a user clones then pulls an S3 bucket, we will inadvertently get all of the files when the user requested only a part of them initially.

What I'm thinking is where we have defaultRemoteVolumeFor defined, for S3 remotes we add another field which is partialSubsets - if this isn't null, then it's a subset volume. You could then get that whenever a user attempts to pull or push an s3-origined volume, and pass that into s3transferrequest for the server to handle it.

We may need to think about an override to this, though, such as dm s3 sync whereby we push all removals/pull everything down. At that point we clear the partialSubsets entry and the server should do what I said deliberately.

Godley pushed a commit that referenced this issue Jul 12, 2018
Godley pushed a commit that referenced this issue Jul 12, 2018
Godley pushed a commit that referenced this issue Jul 12, 2018
Godley pushed a commit that referenced this issue Jul 13, 2018
Godley pushed a commit that referenced this issue Jul 13, 2018
Godley pushed a commit that referenced this issue Jul 13, 2018
@Godley Godley closed this as completed Jul 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants