Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoogleDriveLoader seems to be pulling trashed documents from the folder #5104

Closed
2 tasks done
acetinick opened this issue May 22, 2023 · 5 comments · Fixed by #5220
Closed
2 tasks done

GoogleDriveLoader seems to be pulling trashed documents from the folder #5104

acetinick opened this issue May 22, 2023 · 5 comments · Fixed by #5220

Comments

@acetinick
Copy link

acetinick commented May 22, 2023

System Info

Hi

testing this loader, it looks as tho this is pulling trashed files from folders. I think this should be default to false if anything and be an opt in.

Who can help?

No response

Information

  • The official example notebooks/scripts

Related Components

  • Document Loaders

Reproduction

use GoogleDriveLoader

  1. point to folder
  2. move a file to trash in folder

Reindex

File still can be searched in vector store.

Expected behavior

Should not be searchable

@NickL77
Copy link
Contributor

NickL77 commented May 23, 2023

Do you have more details on reproducing? Are you using google_drive.ipynb?

Looking through the code, if folder_id is provided, the loader only fetches the files after load() is called. This would be surprising behavior unless there is lag on Google's side.

On the other hand, if document_ids or file_ids is provided, it makes sense to me why this would be happening (although unclear if it's desirable), as the google-provided id may stay the same even if the document is moved to the trash.

@Klaudioz
Copy link

It's happening the same to me, so I've recorded a video showing the issue.

This is the script I've used for the test:

from langchain.document_loaders import GoogleDriveLoader

loader = GoogleDriveLoader(
    folder_id="<ADD YOUR FOLDER ID",
    file_types=["sheet", "document", "pdf"],
    recursive=False,
    credentials_path="credentials.json",
    token_path="token.json"
)

docs = loader.load()
print(len(docs))

@Klaudioz
Copy link

It's working well if I go to the trash in Google Drive, and click "Empty Trash".

@NickL77
Copy link
Contributor

NickL77 commented May 24, 2023

It's working well if I go to the trash in Google Drive, and click "Empty Trash".

+1, I just tested this and saw the same behavior. This seems like a Google Drive pattern.

@NickL77
Copy link
Contributor

NickL77 commented May 24, 2023

I found a way to check if a file is trashed, fill send out a PR tonight.

hwchase17 pushed a commit that referenced this issue May 25, 2023
…issue #5104) (#5220)

# Change Default GoogleDriveLoader Behavior to not Load Trashed Files
(issue #5104)

Fixes #5104

If the previous behavior of loading files that used to live in the
folder, but are now trashed, you can use the `load_trashed_files`
parameter:

```
loader = GoogleDriveLoader(
    folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5",
    recursive=False,
    load_trashed_files=True
)
```

As not loading trashed files should be expected behavior, should we
1. even provide the `load_trashed_files` parameter?
2. add documentation? Feels most users will stick with default behavior

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

DataLoaders
- @eyurtsev

Twitter: [@nicholasliu77](https://twitter.com/nicholasliu77)
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this issue Jun 19, 2023
…issue langchain-ai#5104) (langchain-ai#5220)

# Change Default GoogleDriveLoader Behavior to not Load Trashed Files
(issue langchain-ai#5104)

Fixes langchain-ai#5104

If the previous behavior of loading files that used to live in the
folder, but are now trashed, you can use the `load_trashed_files`
parameter:

```
loader = GoogleDriveLoader(
    folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5",
    recursive=False,
    load_trashed_files=True
)
```

As not loading trashed files should be expected behavior, should we
1. even provide the `load_trashed_files` parameter?
2. add documentation? Feels most users will stick with default behavior

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:

DataLoaders
- @eyurtsev

Twitter: [@nicholasliu77](https://twitter.com/nicholasliu77)
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants