Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The scripts/code used to match the PDF miner outputs on documents to the XML representations #20

Open
abirami005 opened this issue Feb 26, 2020 · 7 comments

Comments

@abirami005
Copy link

Do you provide the scripts/code that you developed to match the PDFMiner outputs on the documents to the XML representation of the PDF page itself? Thanks

@zhxgj
Copy link
Contributor

zhxgj commented Feb 27, 2020

We cannot open source the code at the moment as it is related to our IP protection.

@bertsky
Copy link

bertsky commented Mar 2, 2020

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

@zhxgj
Copy link
Contributor

zhxgj commented Mar 2, 2020

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

@pollyMath
Copy link

I assume this means that providing only the code for extracting annotations from XML representation is also not possible at the moment?

@zhxgj
Copy link
Contributor

zhxgj commented Mar 5, 2020

@pollyMath Unfortunately that is what our IP lawyer told us.

@bertsky
Copy link

bertsky commented Jan 11, 2021

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

@zhxgj Did your lawyers reach a verdict regarding the publication of PDF/XML alignment data?

Note: This is relevant to a number of potential applications of this corpus, for which some choices made in the COCO format would be incompatible or suboptimal, e.g.

  • definition/granularity of region classes
  • not annotating headers and footers
  • not including reading order of regions
  • not including text lines (contours / baselines)
  • not including text content (plain) and text style (formatting)

@bertsky bertsky mentioned this issue Jan 12, 2021
4 tasks
@ajjimeno
Copy link
Member

ajjimeno commented Jan 12, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants