Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline to cleanup and normalise content of txt files #210

Closed
2 tasks
Tracked by #196
ktagowski opened this issue Feb 10, 2022 · 1 comment
Closed
2 tasks
Tracked by #196

Pipeline to cleanup and normalise content of txt files #210

ktagowski opened this issue Feb 10, 2022 · 1 comment

Comments

@ktagowski
Copy link
Collaborator

ktagowski commented Feb 10, 2022

We want to prepare cleanup and normalisation methods for any dirty alike text. Imagine texts after OCR.

  • Cleanup things such as: many new lines; sentences divided in couple of lines; strange encodings symbols; divided words into chunks
  • Consider checking if content is written in one language or multiple and optionally remove not main language content by default everything which is not in Polish should be removed.

Please sync with @laugustyniak or @pedrito87 in case of cleaning after OCR texts.

@laugustyniak
Copy link
Collaborator

Dropping due to inactivity and direction shifts

@laugustyniak laugustyniak closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants