Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documents are not saved in Opensearch when using textsplitters #2036

Closed
nirajmchauhan opened this issue Jul 21, 2023 · 11 comments
Closed

Documents are not saved in Opensearch when using textsplitters #2036

nirajmchauhan opened this issue Jul 21, 2023 · 11 comments
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@nirajmchauhan
Copy link

I am using CheerioWebBaseLoader to load the documents and save them in the AWS Opensearch. What I realised is that the documents are not getting saved and there is no error as well. But when I remove the TextSplitter then the document is getting saved. Here is my code:

const loader = new CheerioWebBaseLoader(link);

const rawDocs = (await loader.load()).map((d) => {
  return new Document({
    metadata: { originalSource: link },
    pageContent: d.pageContent,
  });
});

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 100,
  chunkOverlap: 10,
});

const docs = await textSplitter.splitDocuments(rawDocs);

await OpenSearchVectorStore.fromDocuments(
  docs,
  new OpenAIEmbeddings(),
  {
    client,
    indexName: "documents",
  }
);

I cannot go ahead without using TextSplitter as if the document is huge, then OpenAI gives 400 errors. Is there a solution to this?

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jul 21, 2023
@dosubot
Copy link

dosubot bot commented Jul 21, 2023

Answer generated by a 🤖

Answer

I understand that you're having trouble saving documents in AWS OpenSearch when using the TextSplitter in the LangChain framework. Specifically, you're using the CheerioWebBaseLoader to load documents and the RecursiveCharacterTextSplitter to split them. However, when the TextSplitter is used, the documents are not being saved, even though no error is thrown.

The issue might be related to the size of the chunks that your documents are being split into. The RecursiveCharacterTextSplitter is currently set to split your documents into chunks of 100 characters with an overlap of 10 characters. If these chunks are too small, they might not meet the minimum size requirement for indexing in OpenSearch, which could be why the documents are not being saved.

As a potential solution, I suggest increasing the chunkSize and chunkOverlap values when you instantiate the RecursiveCharacterTextSplitter. Here's how you can adjust your code:

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000, // Increase chunkSize
  chunkOverlap: 200, // Increase chunkOverlap
});

This modification will split your documents into larger chunks, which should meet the indexing requirements of OpenSearch and allow your documents to be saved.

Please note that this is just one potential solution based on the information provided. If this doesn't resolve your issue, or if your experience contradicts this solution, I recommend checking if you're using the latest version of the LangChain framework, as updates often include bug fixes and improvements.

If you find that this solution is correct and you believe this is a bug that could impact other users, we encourage you to make a pull request. Your contribution would be greatly appreciated by the LangChain community.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@nirajmchauhan
Copy link
Author

Hey 🤖
I changed the chunk & overlap size as you suggested, still the documents are not getting saved.

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

@nirajmchauhan
Copy link
Author

Ok, after some debugging, I came across this error.

{
  type: "mapper_parsing_exception",
  reason: "failed to parse field [metadata.loc] of type [keyword] in document with id 'bd2f243c-3a51-40a1-8887-d39b327c127d'. Preview of field's value: '{lines={from=1, to=173}}'",
  caused_by: {
    type: "illegal_state_exception",
    reason: "Can't get text on a START_OBJECT at 1:19523",
  },
}

@dosubot
Copy link

dosubot bot commented Jul 21, 2023

Answer generated by a 🤖

Answer

I found related issues that might be helpful. I did my best to summarize the solutions, but I recommend looking at the related issues yourself.

Open Issues

Opensearch store does not support object in metadata and fail silently

The suggested solution is:

  1. Do not support object in metadata, similar to the approach taken in the Pinecone vector store. This can be seen in the code here: https://github.com/hwchase17/langchainjs/blob/61e89dbaec07b4b50a55efe0f74faec5b2e701fd/langchain/src/vectorstores/pinecone.ts#L59

  2. Throw an appropriate error so that the client knows what's wrong. This can be implemented following the guide here: https://github.com/opensearch-project/opensearch-js/blob/main/guides/bulk.md#handling-errors

As a workaround until the solution is implemented, you can delete the object in metadata using the following code:

  docs.forEach((d) => {
    d.metadata.source = basename(d.metadata.source);
    delete d.metadata.pdf;
    delete d.metadata.loc;
  });

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

@jacoblee93
Copy link
Collaborator

The issue seems to be that OpenSearch can't handle certain types of metadata?

I think the bot's response sounds reasonable here - adding a filter step to the vector store itself would be nice too if you feel up for opening a PR?

@nirajmchauhan
Copy link
Author

I ended up fixing this by adding indexing for {metadata:{loc:{lines:{start: Int, end: Int}}}}

@kiraraty
Copy link

kiraraty commented Aug 8, 2023

I ended up fixing this by adding indexing for {metadata:{loc:{lines:{start: Int, end: Int}}}}

What's Int? please

@nirajmchauhan
Copy link
Author

Its Integer

@zhjch05
Copy link

zhjch05 commented Oct 8, 2023

@nirajmchauhan Would you mind share how did you add indexing? Appreciate the help! Facing same issue. Is there a way to add a general dynamic mapping for metadata that automatically handles this?

@rushikeshRL
Copy link

@zhjch05
You can update the mapping from Opensearch dashboard:
image

@zhjch05
Copy link

zhjch05 commented Feb 6, 2024

@zhjch05 You can update the mapping from Opensearch dashboard: image

Thank you!

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants