-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentence deduplication output #261
Comments
Hi, can you share the script you used? normally you would add a Writer after the filter and choose yourself where to save the final output |
Yes sure! I used this script: https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py That I modified like this: """ to run deduplication we need to run three different pipelines, modify sentence dedup hyper params heresent_dedup_config = SentDedupConfig( FINDER_WORKERS = 10 # this will speed up/parallelize step 2 ##1. create a signature for each sentence in each doc #2. reads all the signatures and loads them to check for duplicates. #3. reads a documentpipeline and removes duplicated sentences found before
if name == "main": |
You're indeed missing a writer after the filter:
|
Thank you! Now I've got the keys to understand the deduplication process 🙏 |
Suggestion: maybe this should be added to the original script https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py as it is missing there too! |
Good catch, I've added it |
Hi,
I started to use datatrove for deduplication. If I managed to understand the minhash_deduplication script, I've got difficulties understanding the outputs of sentence_deduplication.py.
All I obtain are 'intermediate', 'sent_dups' and 'sent_sigs' folders.
1/ 'sent_sigs' is supposed to contain a signature for each document. I've got 15 docs, and only 9 output folders in here, with 3 c4_sig files in each that I can't read.
2/ 'sent_dups' contains also 9 folders, where I've got 2 c4_dup files in each. what does these files contain extactly?
3/ where is the output of SentenceDedupFilter? The final stats seem to be okay :
" Stats: {total: 15, doc_len: 259 [min=33, max=106, 64.75±35/doc], removed_sentences: 32 [min=2, max=5, 2.91±1/doc], original_sentences: 36 [min=2, max=5, 3.27±1/doc]}"
but I can't exactly figure out how since I can't find any new version of the documents with the removed_sentences.
Could you provide any help on that?
Thanks
The text was updated successfully, but these errors were encountered: