Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence deduplication output #261

Closed
solene-evain opened this issue Jul 30, 2024 · 6 comments
Closed

Sentence deduplication output #261

solene-evain opened this issue Jul 30, 2024 · 6 comments

Comments

@solene-evain
Copy link

Hi,

I started to use datatrove for deduplication. If I managed to understand the minhash_deduplication script, I've got difficulties understanding the outputs of sentence_deduplication.py.

All I obtain are 'intermediate', 'sent_dups' and 'sent_sigs' folders.

1/ 'sent_sigs' is supposed to contain a signature for each document. I've got 15 docs, and only 9 output folders in here, with 3 c4_sig files in each that I can't read.

2/ 'sent_dups' contains also 9 folders, where I've got 2 c4_dup files in each. what does these files contain extactly?

3/ where is the output of SentenceDedupFilter? The final stats seem to be okay :
" Stats: {total: 15, doc_len: 259 [min=33, max=106, 64.75±35/doc], removed_sentences: 32 [min=2, max=5, 2.91±1/doc], original_sentences: 36 [min=2, max=5, 3.27±1/doc]}"
but I can't exactly figure out how since I can't find any new version of the documents with the removed_sentences.

Could you provide any help on that?
Thanks

@guipenedo
Copy link
Collaborator

Hi, can you share the script you used? normally you would add a Writer after the filter and choose yourself where to save the final output

@solene-evain
Copy link
Author

Yes sure!

I used this script: https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py

That I modified like this:
`from datatrove.executor.base import PipelineExecutor
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups
from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import GopherQualityFilter, LanguageFilter
from datatrove.pipeline.readers import JsonlReader, WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
from datatrove.utils.typeshelper import Languages
from datatrove.pipeline.writers.disk_base import DiskWriter

"""
example on how to use sentence-deduplication. sentence-deduplication implements deduplication as in:
https://jmlr.org/papers/v21/20-074.html
'To deduplicate the data set, we discarded all but one of any three-sentence span
occurring more than once in the data set.'

to run deduplication we need to run three different pipelines,
pipeline 1:
implements usual extraction + quality filtering, it ends with SentenceDedupSignature, preprended by a writer.
pipeline 2:
implements only SentenceFindDedups
pipeline 3:
implements SentenceDedupFilter prepended by a reader of the same writer-kind used during stage 1. after the
SentenceDedupFilter.
"""

modify sentence dedup hyper params here

sent_dedup_config = SentDedupConfig(
n_sentences=2,
split_sentences=True, # set to False to split on \n instead
only_dedup_in_index=True,
min_doc_words=1,
)

FINDER_WORKERS = 10 # this will speed up/parallelize step 2

##1. create a signature for each sentence in each doc
def run_example():
pipeline_1 = [
#WarcReader(data_folder="warc/", limit=1000),
JsonlReader(data_folder="./", paths_file="path_file.txt"),
#Trafilatura(),
#GopherQualityFilter(min_stop_words=0),
#LanguageFilter(language_threshold=0.5, languages=(Languages.english,)),
JsonlWriter("sd_out/intermediate/"),
SentenceDedupSignature(output_folder="sd_out/sent_sigs/", config=sent_dedup_config, language=Languages.french, finder_workers=FINDER_WORKERS),
]

#2. reads all the signatures and loads them to check for duplicates.
pipeline_2 = [SentenceFindDedups(data_folder="sd_out/sent_sigs/", output_folder="sd_out/sent_dups/", config=sent_dedup_config)]

#3. reads a documentpipeline and removes duplicated sentences found before
pipeline_3 = [
JsonlReader(data_folder="sd_out/intermediate/"),
SentenceDedupFilter(data_folder="sd_out/sent_dups/", config=sent_dedup_config, language=Languages.french),
]

executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=4, tasks=4)

executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=FINDER_WORKERS)

executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=4, tasks=4)

print(executor_1.run())
print(executor_2.run())
print(executor_3.run())

if name == "main":
run_example()
`

@guipenedo
Copy link
Collaborator

You're indeed missing a writer after the filter:

pipeline_3 = [
JsonlReader(data_folder="sd_out/intermediate/"),
SentenceDedupFilter(data_folder="sd_out/sent_dups/", config=sent_dedup_config, language=Languages.french),
JsonlWriter(data_folder="sd_out/final_output/")
]

@solene-evain
Copy link
Author

Thank you! Now I've got the keys to understand the deduplication process 🙏

@solene-evain
Copy link
Author

Suggestion: maybe this should be added to the original script https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py as it is missing there too!

@guipenedo
Copy link
Collaborator

Good catch, I've added it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants