-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: GPU hangs while computing Jaccard #3926
Comments
@rlratzel FYI |
@BradReesWork FYI |
Sorry - I was out of the office Why are you doing
Are you breaking the list of vertices into chunks and then process a portion at a time? Batchsize should be 100K+ or this will be very very slow. I see that you set it to "10" which is too small |
Yes, else it will go OOM
Tried but it fails on larger graphs
because it makes me sure that it will not go OOM whatever the graph size @jnke2016 Do we have any updates on this ? |
@Manoj-red-hat Still trying to reproduce the error with our CAPI only. I was able to reproduce the error on a single node with 8 GPUs with the python API |
@BradReesWork he shared the datasets in the description. But I can reproduce the error irrespective of their datasets. In fact, i was able to reproduce it with the karate datasets. I am still looking in this issue and trying to reproduce it in our CAPI |
I found the datasets in LDBC |
wondering if there needs to be a "gc" call within that loop to force dask/python to clean up. I have it running on the scale 22 graph with iteration size of 10 and have not encountered an error - other then it taking a really long time. Now there is the option to compute the two hop vertex pairs in DASK - simple outer join. then you could pass blocks from that list into Jaccard |
That's one the first thing I tried within the two_hop_neighbor function and it didn't help. I also added a sleep between calls to give dask enough time to cleanup and explicitly deleted a list of future that is not explicitly deleted but still had the bug. |
on How many GPUs are you running the reproducer? To reproduce it quite constantly you need at least 8 |
@jnke2016 |
I'm using two GPIs. Let me jump on a 8 gpu system and try again |
@jnke2016 if this only breaks on 8-gpus do you think one worker could be staved and crashing? |
@BradReesWork So at some point some workers hits 100% utilization causing the hang |
@Manoj-red-hat yep this is exactly what I am doing. |
this line is also causing issues "vid_list.sort()" Likewise the "print(vid_list)" is also not needed since it will dump 2 million records to the buffer |
I can now get it to crash based on setting topK and batchsize to higher values - trying to resolve the issue |
Ok. I can reproduce the error with a batch size of 1 with the karate datasets with the python API. |
That's what makes me think that it is not a datasets size issue but an actual bug. Furthermore, the error is not reproducible on 2 GPUs that's why I started investigating the CAPI to see if it is not a dask related issue (as I couldn't see anything wrong there so far). |
@BradReesWork @Manoj-red-hat I narrowed down the error to this small reproducer on 8 GPUs which is just looping over the
I am mimicking the above with the CAPI and MPI to see if I can reproduce. The above reproducer is easier to work with. |
Another important observation I made was that if you run |
So I ran several tests on the CAPI with MPI and all succeeded. Still investigating. |
@BradReesWork @Manoj-red-hat After further digging, I can safely rule out the issue being from the C or C++ API and is indeed a dask related one. I found out that we either have a race condition when it fails where only few dask workers are calling the PLC API causing a hang(100 % utilization for those that reach) because all workers should eventually call it: or some dask workers die and never call the PLC API (causing the other workers to hang waiting on them). To confirm what is stated above, simply restrict the number of workers making the call and you can reproduce the same bug on even 4 GPUs. I am still investigating what happened to the missing workers. |
It indeed looks like a race condition. I have a fix for that but it creates a serialization error when running on 4 GPUs or below. Not sure why but I am still looking into it |
This PR leverages `client.map` to simultaneously launch processes in order to avoid hangs closes #3926 Authors: - Joseph Nke (https://github.com/jnke2016) - Rick Ratzel (https://github.com/rlratzel) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #4080
Version
23.10a Nightly
Which installation method(s) does this occur on?
Conda
Describe the bug.
GPU hangs while computing Jaccard
Minimum reproducible example
Relevant log output
Environment details
No response
Other/Misc.
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: