Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for subfolders for aws s3 #1191

Merged
merged 11 commits into from
Apr 11, 2023

Conversation

kenxu95
Copy link
Contributor

@kenxu95 kenxu95 commented Apr 10, 2023

Describe your changes and why you are making these changes

When registering an S3 integration as the storage layer, the user can specify a subfolder that then acts as the root of artifact storage. The vault/ folder is created inside this subfolder. This subfolder can be defined with optional leading/trailing slashes - the backend will sanitize it into the consistent form path/to/dir before writing it to the database.

Related issue number (if any)

ENG-2738

Loom demo (if any)

https://www.loom.com/share/3dbf7fd6fa394117a79d5b77661af3d3

Checklist before requesting a review

  • I have created a descriptive PR title. The PR title should complete the sentence "This PR...".
  • I have performed a self-review of my code.
  • I have included a small demo of the changes. For the UI, this would be a screenshot or a Loom video.
  • If this is a new feature, I have added unit tests and integration tests.
  • I have run the integration tests locally and they are passing.
  • I have run the linter script locally (See python3 scripts/run_linters.py -h for usage).
  • All features on the UI continue to work correctly.
  • Added one of the following CI labels:
    • run_integration_test: Runs integration tests
    • skip_integration_test: Skips integration tests (Should be used when changes are ONLY documentation/UI)

@kenxu95 kenxu95 requested review from cw75 and saurav-c April 10, 2023 03:43
Base automatically changed from add_python_linting_to_scripts to main April 10, 2023 03:44
@kenxu95 kenxu95 requested a review from vsreekanti April 10, 2023 18:07
@kenxu95 kenxu95 force-pushed the eng-2738-add-support-for-subfolders-for-aws-s3 branch from 1483fac to 24499ba Compare April 10, 2023 19:23
@kenxu95
Copy link
Contributor Author

kenxu95 commented Apr 10, 2023

@vsreekanti should this subfolder thing also apply when the uses the integration as a data integration? Eg. should the file paths be relative to the subdirectory also?

Copy link
Contributor

@cw75 cw75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenxu95 looks good overall, left a few questions and comments.

src/golang/lib/models/shared/storage.go Outdated Show resolved Hide resolved
src/golang/lib/storage/convert.go Show resolved Hide resolved
# Check that any user-supplied root directory exists.
if self.root_dir != "":
# If nothing is returned by this filter call, then the directory does not exist.
if len(list(self.s3.Bucket(self.bucket).objects.filter(Prefix=self.root_dir))) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a lot of objects in this root directory, this call may take very long to finish. I wonder if there's a lower overhead way of checking if a s3 directory exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, let's discuss in https://aqueducthq.slack.com/archives/C01KF131001/p1681168125986639? We already use this method in our extract implementation when fetching entire directories.

region: str = ""

# Expected to be in the format "path/to/dir/" (without a leading slash, but with a trailing one).
root_dir: str = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used in data connection? I thought it's only relevant for S3 used as storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be here because the model forbids extra fields. I'll add a note that it's actually unused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to add the note.

return bucket, key

if root_dir_path != "":
path_parts += [_sanitize_path(root_dir_path)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if root_dir_path is assumed to be in the format path/to/dir/, why do we need to sanitize it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is assumed, but we don't check it here. I didn't want to be too strict with the assertions in case there was a codepath I missed. Agree, it's confusing - maybe I can just remove the comment for now and make no assumptions about the root path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to remove the comment.

@kenxu95 kenxu95 requested a review from cw75 April 10, 2023 23:20
Copy link
Contributor

@cw75 cw75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. See reminders inline to add the notes.

return bucket, key

if root_dir_path != "":
path_parts += [_sanitize_path(root_dir_path)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to remove the comment.

region: str = ""

# Expected to be in the format "path/to/dir/" (without a leading slash, but with a trailing one).
root_dir: str = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to add the note.

@kenxu95 kenxu95 added the run_integration_test Triggers integration tests label Apr 10, 2023
@kenxu95 kenxu95 merged commit 17e6ad8 into main Apr 11, 2023
@vsreekanti vsreekanti deleted the eng-2738-add-support-for-subfolders-for-aws-s3 branch April 13, 2023 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_integration_test Triggers integration tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants