Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assign more cpu to single task to speed it up for local executor? #214

Open
barbara-su opened this issue Jun 11, 2024 · 5 comments
Open

Comments

@barbara-su
Copy link

I am using the local executor. My machine has 48 Cpus with 348 Ram. Any idea how to speed this up? Currently one single task (task=1, running for 1 warc.gz file, with size ~1g) takes half an hour. This is my executor code, borrowed from the fineweb example. Also, I have 200 warc.gz files to process. Is setting tasks = 200 the correct way?

main_processing_executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader(
            f"tur_subsubset",
            compression="gzip",
            glob_pattern="*.warc.gz",  
        ),
        URLFilter(),
        Trafilatura(favour_precision=True, timeout=10),
        LanguageFilter(languages=(Languages.turkish)),
        GopherRepetitionFilter(),
        GopherQualityFilter(),
        C4QualityFilter(filter_no_terminal_punct=False),
        FineWebQualityFilter(),
        JsonlWriter(f"{FILTERING_OUTPUT_PATH}/output/{DUMP_TO_PROCESS}"),
    ],
    
    tasks=200,
    workers=44,
    logging_dir=f"{MAIN_OUTPUT_PATH}/logs/base_processing/{DUMP_TO_PROCESS}",
)
@guipenedo
Copy link
Collaborator

guipenedo commented Jun 12, 2024

This should be the most optimized setup yes. You can optionally increase workers a little bit (depending on what the remaining 4 cpus are busy doing).
Do note that we do not recommend using the default (english) values for GopherQualityFilter and FineWebQualityFilter if you are processing Turkish data. You should probably tune/adapt the options of those blocks to your language

@barbara-su
Copy link
Author

Thank you!!!!!

@barbara-su
Copy link
Author

Also, how to set the parameter tasks? For instance, if I have 10000 files, should I set tasks = 10000?

@guipenedo
Copy link
Collaborator

You can yes. If tasks > nb of files, than the excess tasks will not perform any work as we do not currently split files

@barbara-su
Copy link
Author

How about the case where tasks < nb of files? Will all the files be processed? Will the execution speed be faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants