Sentence deduplication output #261

solene-evain · 2024-07-30T12:29:10Z

Hi,

I started to use datatrove for deduplication. If I managed to understand the minhash_deduplication script, I've got difficulties understanding the outputs of sentence_deduplication.py.

All I obtain are 'intermediate', 'sent_dups' and 'sent_sigs' folders.

1/ 'sent_sigs' is supposed to contain a signature for each document. I've got 15 docs, and only 9 output folders in here, with 3 c4_sig files in each that I can't read.

2/ 'sent_dups' contains also 9 folders, where I've got 2 c4_dup files in each. what does these files contain extactly?

3/ where is the output of SentenceDedupFilter? The final stats seem to be okay :
" Stats: {total: 15, doc_len: 259 [min=33, max=106, 64.75±35/doc], removed_sentences: 32 [min=2, max=5, 2.91±1/doc], original_sentences: 36 [min=2, max=5, 3.27±1/doc]}"
but I can't exactly figure out how since I can't find any new version of the documents with the removed_sentences.

Could you provide any help on that?
Thanks

guipenedo · 2024-07-30T12:30:25Z

Hi, can you share the script you used? normally you would add a Writer after the filter and choose yourself where to save the final output

solene-evain · 2024-07-30T12:35:00Z

Yes sure!

I used this script: https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py

That I modified like this:
`from datatrove.executor.base import PipelineExecutor
from datatrove.executor.local import LocalPipelineExecutor
from datatrove.pipeline.dedup import SentenceDedupFilter, SentenceDedupSignature, SentenceFindDedups
from datatrove.pipeline.dedup.sentence_dedup import SentDedupConfig
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import GopherQualityFilter, LanguageFilter
from datatrove.pipeline.readers import JsonlReader, WarcReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
from datatrove.utils.typeshelper import Languages
from datatrove.pipeline.writers.disk_base import DiskWriter

"""
example on how to use sentence-deduplication. sentence-deduplication implements deduplication as in:
https://jmlr.org/papers/v21/20-074.html
'To deduplicate the data set, we discarded all but one of any three-sentence span
occurring more than once in the data set.'

to run deduplication we need to run three different pipelines,
pipeline 1:
implements usual extraction + quality filtering, it ends with SentenceDedupSignature, preprended by a writer.
pipeline 2:
implements only SentenceFindDedups
pipeline 3:
implements SentenceDedupFilter prepended by a reader of the same writer-kind used during stage 1. after the
SentenceDedupFilter.
"""

modify sentence dedup hyper params here

sent_dedup_config = SentDedupConfig(
n_sentences=2,
split_sentences=True, # set to False to split on \n instead
only_dedup_in_index=True,
min_doc_words=1,
)

FINDER_WORKERS = 10 # this will speed up/parallelize step 2

##1. create a signature for each sentence in each doc
def run_example():
pipeline_1 = [
#WarcReader(data_folder="warc/", limit=1000),
JsonlReader(data_folder="./", paths_file="path_file.txt"),
#Trafilatura(),
#GopherQualityFilter(min_stop_words=0),
#LanguageFilter(language_threshold=0.5, languages=(Languages.english,)),
JsonlWriter("sd_out/intermediate/"),
SentenceDedupSignature(output_folder="sd_out/sent_sigs/", config=sent_dedup_config, language=Languages.french, finder_workers=FINDER_WORKERS),
]

#2. reads all the signatures and loads them to check for duplicates.
pipeline_2 = [SentenceFindDedups(data_folder="sd_out/sent_sigs/", output_folder="sd_out/sent_dups/", config=sent_dedup_config)]

#3. reads a documentpipeline and removes duplicated sentences found before
pipeline_3 = [
JsonlReader(data_folder="sd_out/intermediate/"),
SentenceDedupFilter(data_folder="sd_out/sent_dups/", config=sent_dedup_config, language=Languages.french),
]

executor_1: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_1, workers=4, tasks=4)

executor_2: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_2, workers=1, tasks=FINDER_WORKERS)

executor_3: PipelineExecutor = LocalPipelineExecutor(pipeline=pipeline_3, workers=4, tasks=4)

print(executor_1.run())
print(executor_2.run())
print(executor_3.run())

if name == "main":
run_example()
`

guipenedo · 2024-07-30T12:37:10Z

You're indeed missing a writer after the filter:

pipeline_3 = [
JsonlReader(data_folder="sd_out/intermediate/"),
SentenceDedupFilter(data_folder="sd_out/sent_dups/", config=sent_dedup_config, language=Languages.french),
JsonlWriter(data_folder="sd_out/final_output/")
]

solene-evain · 2024-07-30T12:46:22Z

Thank you! Now I've got the keys to understand the deduplication process 🙏

solene-evain · 2024-07-30T13:08:39Z

Suggestion: maybe this should be added to the original script https://github.com/huggingface/datatrove/blob/b5443d2b8ef473262bc97b3d7717a217b6eaf1f3/examples/sentence_deduplication.py as it is missing there too!

guipenedo · 2024-08-28T10:28:44Z

Good catch, I've added it

guipenedo closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence deduplication output #261

Sentence deduplication output #261

solene-evain commented Jul 30, 2024

guipenedo commented Jul 30, 2024

solene-evain commented Jul 30, 2024

guipenedo commented Jul 30, 2024

solene-evain commented Jul 30, 2024

solene-evain commented Jul 30, 2024

guipenedo commented Aug 28, 2024

Sentence deduplication output #261

Sentence deduplication output #261

Comments

solene-evain commented Jul 30, 2024

guipenedo commented Jul 30, 2024

solene-evain commented Jul 30, 2024

modify sentence dedup hyper params here

guipenedo commented Jul 30, 2024

solene-evain commented Jul 30, 2024

solene-evain commented Jul 30, 2024

guipenedo commented Aug 28, 2024