Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split_pdf splits too much, since it does not take into account that different entities might have same type (but different confidence) #336

Open
evekhm opened this issue Jul 11, 2024 · 1 comment
Assignees

Comments

@evekhm
Copy link

evekhm commented Jul 11, 2024

Here is entities example returned from splitter:

[text_anchor {
  text_segments {
    end_index: 1424
  }
}
type_: "form1"
confidence: 0.96
page_anchor {
  page_refs {
  }
  page_refs {
    page: 1
  }
  page_refs {
    page: 2
  }
}
, text_anchor {
  text_segments {
    start_index: 1424
    end_index: 6935
  }
}
type_: "form1"
confidence: 0.68
page_anchor {
  page_refs {
    page: 3
  }
  page_refs {
    page: 4
  }
}
]

In this case we see that all pages are actually of same type and we should not split. However document.Document.split_pdf would not detect that.

@holtskinner
Copy link
Member

Ok, this is a bit complicated because the Document AI Custom Splitter specifically detected those two "form1" entries as separate documents.

If we combine them together by default, it could create ambiguity when there are multiple separate documents of the same type in a file.

We could create a parameter like combine_like_document_types or something like that, but I think this issue would be best resolved on the Custom Splitter itself.

@holtskinner holtskinner changed the title split_pdf splits too much, since it does not take into account that different entities might have same type (but different confidence) split_pdf splits too much, since it does not take into account that different entities might have same type (but different confidence) Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants