Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve C4 filter and dedup #124

Merged
merged 22 commits into from
Mar 20, 2024
Merged

Improve C4 filter and dedup #124

merged 22 commits into from
Mar 20, 2024

Conversation

guipenedo
Copy link
Collaborator

@guipenedo guipenedo commented Mar 13, 2024

  • heavily refactored sentence dedup
  • performance improvements to sentence dedup for large scale execution
  • added new options for sentence dedup, more similar to the actual C4 code (split on lines instead of paragraphs)
  • extracted config into a new dataclass
  • rewrote C4 style filters based on the official code

@guipenedo guipenedo marked this pull request as ready for review March 19, 2024 15:37
@guipenedo guipenedo merged commit 55c6b1c into main Mar 20, 2024
4 checks passed
@guipenedo guipenedo deleted the c4-dedup branch March 20, 2024 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant