Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added Document.from_gcs_multi() method to get multiple wrapped documents from a GCS directory. #223

Closed
wants to merge 9 commits into from

Conversation

holtskinner
Copy link
Member

Fixes #214 🦕

  • Note: from_gcs() takes in a GCS directory, but it only works for a single sharded document from a single input document source.
  • In a GA release, it would be a better practice to have from_gcs() take in any GCS directory and output a list of Wrapped Documents. But this would be a backwards-incompatible change now.
    • Not sure if it's possible/advisable to have two possible return types for from_gcs() and just have it return a list when there are multiple Wrapped Documents?

… documents from a GCS directory.

Fixes #214

- Note: `from_gcs()` takes in a GCS directory, but it only works for a single sharded document from a single input document source.
- In a GA release, it would be a better practice to have `from_gcs()` take in any GCS directory and output a list of Wrapped Documents. But this would be a backwards-incompatible change now.
  - Not sure if it's possible/advisable to have two possible return types for `from_gcs()` and just have it return a list when there are multiple Wrapped Documents?
@holtskinner holtskinner requested review from a team as code owners December 12, 2023 16:56
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Dec 12, 2023
holtskinner and others added 2 commits December 12, 2023 10:58
- Input data doesn't match what the actual `storage.list_blobs()` method outputs.
- Updated output in `print_gcs_document_tree()` to match the test cases, as this was the intended structure.
List[Document]:
A List of documents from gcs.
"""
return [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming that every sub-directory in the parent (bucket+prefix) is acceptable by Document.from_gcs, which I think has some validation checks.

What is the desired behavior here if some of those sub-directories causes Document.from_gcs to fail? Should the whole list_from_gcs call fail, or should it return just whatever is successfully loaded (maybe along with some metadata explaining the failed items)?

(Let's include unit tests for the case where one of the sub-directories contains wrong files.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_document_tree() only outputs directories that contain JSON files. (However, it doesn't directly validate if they are Document JSON format.)

.from_gcs() will throw a ValueError if a file is invalid. Or the documentai.Document.from_json() method will throw an exception.

It would probably make sense for this batch case to return whatever was successfully loaded and just skip files that throw exceptions. (And printing out or returning the failed files.)

@holtskinner holtskinner changed the title feat: Added Document.list_from_gcs() method to get multiple wrapped documents from a GCS directory. feat: Added Document.from_gcs_multi() method to get multiple wrapped documents from a GCS directory. Dec 15, 2023
@holtskinner
Copy link
Member Author

Closing out this PR. I don't think it makes sense to add this method because only the end user will know how exactly the GCS directories are set up. For example, if the documents in each directory are shards of the same document or should be treated as separate documents.

Unless the documents are being imported from the Batch Process output, in that case, the user can use from_batch_process_operation() or from_batch_process_metadata().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Importing Multiple Documents from a single GCS Path
2 participants