-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script. #235
Comments
Hi, |
Hi, @hynky1999 |
Then can you check s3://data-refine/base_processing/base_processing/output/ if it cotntains any folders ? |
there is no any output folder |
Strange so if you do |
it cause error ! |
Hello @hynky1999 |
hello @hynky1999 |
Hey, we don't have any community forum as of right now. |
which logs? |
I will send all files |
hi @hynky1999 |
Ahh, okay seems like none of the files get's throught extraction. Could you try increasing the timeout to 1 sec ? |
I will try. thank you |
hi @hynky1999 |
hello @hynky1999 |
hi @hynky1999 I think server spec is problem |
hello @hynky1999 |
I am good thank you for asking :) |
|
Yeah, we haven't released on pypi for a while thus we don't have locked dependency for numpy. |
I will try. thank you |
+3.10 should be fine |
hello @hynky1999 |
Hi, could you try processing more samples ? 10k+ ? (setting the limit variable in reader) |
hello @hynky1999 |
Hi, I don't want to resolve this issue anywhere out of the gh issues. What's the state of your problem now ? Can't see any logs in the google drive folder you sent. |
how are you @hynky1999 |
So now you can see the output ? |
Hello, Datatrove enthusiasts,
Nice to meet you all.
Recently, I've been working on the Datatrove library and I'm trying to run a sample script,
process_common_crawl_dump.py
from the following link: Datatrove GitHub.I've made a couple of changes to the script: I've reduced the number of tasks from 8000 to 4 and updated
randomize_start_duration
torandomize_start
. However, after running the script, I encountered some issues.Here is the accounting history that I received:
Additionally, I believe these logs are stored on my S3:
I was expecting to get an output as a result, but there is no any output directories or files.
only I got logs files
For reference, here is my
slurm.conf
file:I've tried running the script multiple times, but I always get the same result. I'm not sure if this is the right place to ask for help, but I would appreciate any assistance from fellow Datatrove lovers.
Thank you!
The text was updated successfully, but these errors were encountered: