Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add s3 file loader #599

Merged
merged 11 commits into from
Apr 11, 2023
Merged

Add s3 file loader #599

merged 11 commits into from
Apr 11, 2023

Conversation

jasondotparse
Copy link
Contributor

@jasondotparse jasondotparse commented Apr 4, 2023

Summary

Add a new loader type exactly like this one which already exists in the python repo.

Notes

  • The logic is largely the same as that which is in the python package, except for one difference: the python version benefits from its use of the unstructured python package which we do not have at our disposal in node. Therefore, the best I could do was allow a user to pass in the url of an existing unstructured API endpoint and make use of that within the S3 loader. I'd be open to suggestions if someone thinks a better solution is possible, but for now this seemed like the optimal path forward.
  • I could not find a way to use jest to mock the 'fs' and 'Unstructured' class dependencies of S3Loader for unit testing purposes. After multiple days of trying to mock these imports for a unit test, I finally fell back on just using dependency injection and passing them into the S3Loader class constructor so they could easily be stubbed in the test file.
  • After this work is done I will probably come back and update the class interface for the class to allow S3 access keys or perhaps a single S3 pre-signed URL to be passed into the S3 loader. And also add a S3 directory loader to match the on in the python repo.

@vercel
Copy link

vercel bot commented Apr 4, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Apr 11, 2023 9:51am

@vercel
Copy link

vercel bot commented Apr 4, 2023

@jasondotparse is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Collaborator

@nfcampos nfcampos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few comments

langchain/package.json Outdated Show resolved Hide resolved
langchain/src/document_loaders/s3.ts Outdated Show resolved Hide resolved
public async load() {
const { S3Client, GetObjectCommand } = await S3LoaderImports();

const tempDir = this._fs.mkdtempSync(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd want to not use fs and work with a Buffer in memory. This would enable using this outside of Node, eg in Edge environments

I'm going to open a PR very soon to make the Unstructured loader work with Buffers directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I just followed the logic for how the python package implemented the feature so that's why I did it this way. I wouldn't be opposed to coming back to this later and making the change to avoid using fs after the Unstructured class has been updated

Copy link
Collaborator

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@nfcampos nfcampos added the lgtm PRs that are ready to be merged as-is label Apr 11, 2023
@nfcampos nfcampos merged commit f975656 into langchain-ai:main Apr 11, 2023
@jasondotparse jasondotparse deleted the s3-loader branch April 11, 2023 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm PRs that are ready to be merged as-is
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants