You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have implemented a pipeline to process the Common Crawl (CC) data, similar to the FineWeb example in the example folder. The main issue I'm encountering is that, when reading files from CC, the connection sometimes times out, causing the execution to stop.
Here is the error message I receive:
File "/opt/conda/lib/python3.10/site-packages/aiobotocore/httpsession.py", line 259, in send
raise ConnectTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://commoncrawl.s3.us-east-1.amazonaws.com/crawl-data/CC-MAIN-2023-50/segments/1700679518883.99/warc/CC-MAIN-20231211210408-20231212000408-00000.warc.gz"
Is it possible to change some parameters to mitigate this problem?
Thanks!
The text was updated successfully, but these errors were encountered:
This can happen when the commoncrawl bucket is under heavy traffic or you yourself are sending a lot of requests. I recommend using a large number of tasks (possibly even 1 task = 1 file) so that when you have these errors you can just relaunch the processing and it will run only the missing tasks/files, without wasting a lot of compute
Hi,
I have implemented a pipeline to process the Common Crawl (CC) data, similar to the FineWeb example in the example folder. The main issue I'm encountering is that, when reading files from CC, the connection sometimes times out, causing the execution to stop.
Here is the error message I receive:
Is it possible to change some parameters to mitigate this problem?
Thanks!
The text was updated successfully, but these errors were encountered: