[BUG]: GPU hangs while computing Jaccard #3926

Manoj-red-hat · 2023-10-11T15:16:12Z

Version

23.10a Nightly

Which installation method(s) does this occur on?

Conda

Describe the bug.

GPU hangs while computing Jaccard

Minimum reproducible example

dask_cudf
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cugraph
import cugraph.dask as dask_cugraph
import cugraph.dask.comms.comms as Comms
from cugraph.generators.rmat import rmat
import csv
import time
import pandas as pd

# Step 1
# Run command on master node in mg_utils: `./run-dask-process.sh scheduler --ucx`
# Step 2
# Run command on master node in mg_utils: `/run-dask-process.sh workers --ucx`
# Step 3
# Run command on slave nodes in mg_utils: `/run-dask-process.sh workers --ucx`

### Change ucx accordingly 
client = Client('ucx://10.141.0.46:8786')
client
Comms.initialize(p2p=True)

# For ldbc 22
graph_number ='22'
input_data_path = '/share/common/'+graph_number+'/graph500-'+graph_number+'.e'
result_data_path = '/share/common/'+graph_number+'/jaccard_cuda_res.csv'

chunksize = dask_cugraph.get_chunksize(input_data_path)
e_list = dask_cudf.read_csv(
    input_data_path,
    delimiter=' ',
    blocksize=chunksize,
    names=['src', 'dst'],
    dtype=['int64', 'int64'],
)

# create graph from input data
G = cugraph.Graph(directed=False)
G.from_dask_cudf_edgelist(e_list, source='src', destination='dst',store_transposed=False)
print(input_data_path)
def compute_jaccard(G, file_name: str, topK: int, vid_list: list = None, batchsize=1, size=2396657):
    headerList = ['first', 'second','jaccard_coeff']
    with open(file_name, 'w') as file:
        csv.DictWriter(file, delimiter=',',fieldnames=headerList).writeheader()

    print(file_name ,topK)
    jaccard_res = None
    if not vid_list:
        print("Computing vids")
        vid_list = G.nodes().compute().to_arrow().to_pylist()
        vid_list.sort()
        print(vid_list)
    size = len(vid_list)
    print(size)
    start_time = time.time()
    write_time = 0
    with open(file_name, 'a') as f:
        for i in range(0, size, batchsize):
            l=vid_list[i:i+batchsize]
            pairs = G.get_two_hop_neighbors(start_vertices=l)
            jaccard_res = dask_cugraph.jaccard(G, pairs)
            jaccard_groupby = jaccard_res.compute().to_pandas().groupby('first')
            

            for j in vid_list[i:i+batchsize]:
                try:
                    df = jaccard_groupby.get_group(j).sort_values('jaccard_coeff', ascending=False)[:topK+1].reset_index(drop=True).drop(0)
                    t = time.time()
                    df.to_csv(f, header=False, index=False)
                    write_time+=time.time()-t
                except KeyError:
                    continue
            print("---Iteration: %s ---" % (i))
    f.close()
    print("--- %s Write seconds ---" % (write_time))
    print("--- %s seconds ---" % (time.time() - start_time))

compute_jaccard(G,result_data_path,10, vid_list = None)

Data
```373293056 369098752
373293056 371195906
373293057 371195904
373293057 371195905
377487360 369098752
377487360 371195906
377487360 373293056
377487362 377487360
377487362 377487361
383778816 373293057
383778817 369098752
383778817 371195906
383778817 373293056
383778817 373293057
383778817 377487360
385875968 373293057
385875969 377487360
385875970 377487360
385875970 377487361
385875970 377487362
390070272 377487360
392167424 377487360
392167424 385875970
396361728 373293057
398458880 383778817
400556032 396361728
400556033 373293057
400556033 377487360
400556033 383778816
400556033 383778817
403701760 377487360
403701761 369098752
403701761 371195906
403701761 373293056
403701761 377487360
403701761 383778816
403701761 383778817
403701761 390070272
403701761 403701760
405798912 377487360
405798912 403701761
422576128 373293057
422576128 396361728
422576128 407896064
422576128 416284674
407896064 369098752
407896064 371195904
407896064 371195905
407896064 373293057
407896064 383778816
407896064 385875968
407896064 390070272
407896064 396361728
407896064 398458880
407896064 400556033
412090368 400556032
418381824 373293057
418381824 407896064
416284672 373293057
416284672 407896064
416284673 373293057
416284673 377487360
416284673 398458880
416284673 400556032
416284673 407896064
416284673 412090368
416284674 407896064
424673280 377487360
424673280 377487362
424673280 392167424
430964736 383778817
430964736 407896064
435159040 373293056
435159040 377487360
435159041 383778817
435159041 396361728
435159041 407896064
435159041 412090368```

Relevant log output

[1697035413.329434] [node047:266058:0]          parser.c:1989 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1697035413.329434] [node047:266058:0]          parser.c:1989 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
/share/common/22/graph500-22.e
/share/common/22/jaccard_cuda_res.csv 10
Computing vids
[369098752, 371195904, 371195905, 371195906, 373293056, 373293057, 377487360, 377487361, 377487362, 383778816, 383778817, 385875968, 385875969, 385875970, 390070272, 392167424, 396361728, 398458880, 400556032, 400556033, 403701760, 403701761, 405798912, 407896064, 412090368, 416284672, 416284673, 416284674, 418381824, 422576128, 424673280, 430964736, 435159040, 435159041]
34
---Iteration: 0 ---
---Iteration: 1 ---
---Iteration: 2 ---
---Iteration: 3 ---
---Iteration: 4 ---
---Iteration: 5 ---
---Iteration: 6 ---
---Iteration: 7 ---
---Iteration: 8 ---
---Iteration: 9 ---
---Iteration: 10 ---
---Iteration: 11 ---
---Iteration: 12 ---
---Iteration: 13 ---
---Iteration: 14 ---
---Iteration: 15 ---
---Iteration: 16 ---
---Iteration: 17 ---
---Iteration: 18 ---
---Iteration: 19 ---
---Iteration: 20 ---
---Iteration: 21 ---
---Iteration: 22 ---
---Iteration: 23 ---
---Iteration: 24 ---
---Iteration: 25 ---
---Iteration: 26 ---
---Iteration: 27 ---
---Iteration: 28 ---
---Iteration: 29 ---
^CTraceback (most recent call last):
  File "/home/tigergraph/tmp.py", line 79, in <module>
    compute_jaccard(G,result_data_path,10, vid_list = None)
  File "/home/tigergraph/tmp.py", line 61, in compute_jaccard
    pairs = G.get_two_hop_neighbors(start_vertices=l)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 855, in get_two_hop_neighbors
    wait(result)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/client.py", line 5421, in wait
    result = client.sync(_wait, fs, timeout=timeout, return_when=return_when)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 359, in sync
    return sync(
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 422, in sync
    wait(10)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 411, in wait
    return e.wait(timeout)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/threading.py", line 607, in wait
    signaled = self._cond.wait(timeout)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Exception ignored in: <function Comms.__del__ at 0x7ffaf129cc10>
Traceback (most recent call last):
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/client.py", line 3008, in run
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 359, in sync
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 382, in sync
RuntimeError: IOLoop is closed
^C

Environment details

No response

Other/Misc.

No response

Code of Conduct

I agree to follow cuGraph's Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report

Manoj-red-hat · 2023-10-11T15:16:42Z

@rlratzel FYI

Manoj-red-hat · 2023-10-11T15:19:36Z

@BradReesWork FYI

BradReesWork · 2023-10-23T19:49:12Z

Sorry - I was out of the office

Why are you doing

    for i in range(0, size, batchsize):
        l=vid_list[i:i+batchsize]
        pairs = G.get_two_hop_neighbors(start_vertices=l)
        jaccard_res = dask_cugraph.jaccard(G, pairs)
        jaccard_groupby = jaccard_res.compute().to_pandas().groupby('first')

Are you breaking the list of vertices into chunks and then process a portion at a time? Batchsize should be 100K+ or this will be very very slow. I see that you set it to "10" which is too small

Manoj-red-hat · 2023-10-26T05:49:42Z

Hi @BradReesWork

Are you breaking the list of vertices into chunks and then process a portion at a time?

Yes, else it will go OOM

Batchsize should be 100K+ or this will be very slow

Tried but it fails on larger graphs
If we can somehow calculate memory overhead as per batch and graph size than we can configure batchsize dynamically

I see that you set it to "10" which is too small

because it makes me sure that it will not go OOM whatever the graph size

@jnke2016 Do we have any updates on this ?

jnke2016 · 2023-10-26T11:34:19Z

Do we have any updates on this ?

@Manoj-red-hat Still trying to reproduce the error with our CAPI only. I was able to reproduce the error on a single node with 8 GPUs with the python API

jnke2016 · 2023-10-26T16:58:32Z

@BradReesWork he shared the datasets in the description. But I can reproduce the error irrespective of their datasets. In fact, i was able to reproduce it with the karate datasets. I am still looking in this issue and trying to reproduce it in our CAPI

BradReesWork · 2023-10-26T17:47:41Z

I found the datasets in LDBC

BradReesWork · 2023-10-26T18:23:41Z

wondering if there needs to be a "gc" call within that loop to force dask/python to clean up.

I have it running on the scale 22 graph with iteration size of 10 and have not encountered an error - other then it taking a really long time.

Now there is the option to compute the two hop vertex pairs in DASK - simple outer join. then you could pass blocks from that list into Jaccard

jnke2016 · 2023-10-26T18:28:32Z

wondering if there needs to be a "gc" call within that loop to force dask/python to clean up.

That's one the first thing I tried within the two_hop_neighbor function and it didn't help. I also added a sleep between calls to give dask enough time to cleanup and explicitly deleted a list of future that is not explicitly deleted but still had the bug.

jnke2016 · 2023-10-26T18:29:40Z

I have it running on the scale 22 graph with iteration size of 10 and have not encountered an error - other then it taking a really long time.

on How many GPUs are you running the reproducer? To reproduce it quite constantly you need at least 8

Manoj-red-hat · 2023-10-27T15:11:00Z

@jnke2016
I am curious about your approach to replicating this on CAPI. Will you be simulating the 2 Hop Neighbour using Cuda MPI across 8 GPUs? I attempted this, but I couldn't locate any references for executing native CAPIs on a multi-node GPU configuration.

BradReesWork · 2023-10-27T19:14:21Z

I'm using two GPIs. Let me jump on a 8 gpu system and try again

BradReesWork · 2023-10-27T19:19:36Z

@jnke2016 if this only breaks on 8-gpus do you think one worker could be staved and crashing?

jnke2016 · 2023-10-27T19:51:23Z

if this only breaks on 8-gpus do you think one worker could be staved and crashing

@BradReesWork So at some point some workers hits 100% utilization causing the hang

jnke2016 · 2023-10-27T19:52:22Z

Will you be simulating the 2 Hop Neighbour using Cuda MPI across 8 GPUs?

@Manoj-red-hat yep this is exactly what I am doing.

BradReesWork · 2023-10-27T20:05:31Z

this line is also causing issues "vid_list.sort()"
There really is not need to sort, and the sorted list is just being dumped to the screen

Likewise the "print(vid_list)" is also not needed since it will dump 2 million records to the buffer

BradReesWork · 2023-10-27T20:13:36Z

I can now get it to crash based on setting topK and batchsize to higher values - trying to resolve the issue

jnke2016 · 2023-10-27T20:26:27Z

Ok. I can reproduce the error with a batch size of 1 with the karate datasets with the python API.

jnke2016 · 2023-10-27T20:30:05Z

That's what makes me think that it is not a datasets size issue but an actual bug. Furthermore, the error is not reproducible on 2 GPUs that's why I started investigating the CAPI to see if it is not a dask related issue (as I couldn't see anything wrong there so far).

jnke2016 · 2023-10-27T20:55:54Z

@BradReesWork @Manoj-red-hat I narrowed down the error to this small reproducer on 8 GPUs which is just looping over the two_hop_neighbors using the karate datasets and a number of seeds = 5 and it eventually fails

import cugraph
from cugraph.testing.mg_utils import start_dask_client, stop_dask_client
import dask_cudf
import cugraph.dask as dask_cugraph


if __name__ == "__main__":
    setup_objs = start_dask_client()

    client = setup_objs[0]

    worker_list = list(client.scheduler_info()['workers'].keys())
    print("the number of workers = ", len(worker_list))

    input_data_path = "/home/nfs/jnke/debug_jaccard/cugraph/datasets/karate.csv"

    chunksize = dask_cugraph.get_chunksize(input_data_path)
    e_list = dask_cudf.read_csv(
        input_data_path,
        delimiter=' ',
        blocksize=chunksize,
        names=['src', 'dst'],
        dtype=['int32', 'int32'],
    )

    # create graph from input data
    G = cugraph.Graph(directed=False)
    G.from_dask_cudf_edgelist(e_list, source='src', destination='dst',store_transposed=False)

    for i in range(200):
        print("iteration - ", i)
        seed = i
        start_vertices = G.select_random_vertices(
            random_state=seed, num_vertices=5).compute().reset_index(drop=True)

        pairs = G.get_two_hop_neighbors(start_vertices=start_vertices)

    stop_dask_client(*setup_objs)

I am mimicking the above with the CAPI and MPI to see if I can reproduce. The above reproducer is easier to work with.

jnke2016 · 2023-10-27T21:05:55Z

Another important observation I made was that if you run two_hop_neighbors(the above reproducer) without seeds (which gives you all possible pairs that are two hops apart and therefore even more expensive), it succeeds which reinforce my believe that the bug is in the C or C++ API.

jnke2016 · 2023-10-27T23:38:50Z

So I ran several tests on the CAPI with MPI and all succeeded. Still investigating.

jnke2016 · 2023-10-28T17:42:49Z

@BradReesWork @Manoj-red-hat After further digging, I can safely rule out the issue being from the C or C++ API and is indeed a dask related one. I found out that we either have a race condition when it fails where only few dask workers are calling the PLC API causing a hang(100 % utilization for those that reach) because all workers should eventually call it: or some dask workers die and never call the PLC API (causing the other workers to hang waiting on them).

To confirm what is stated above, simply restrict the number of workers making the call and you can reproduce the same bug on even 4 GPUs.

I am still investigating what happened to the missing workers.

jnke2016 · 2023-10-30T09:09:57Z

found out that we either have a race condition when it fails where only few dask workers are calling the PLC API causing a hang(100 % utilization for those that reach) because all workers should eventually call it

It indeed looks like a race condition. I have a fix for that but it creates a serialization error when running on 4 GPUs or below. Not sure why but I am still looking into it

This PR leverages `client.map` to simultaneously launch processes in order to avoid hangs closes #3926 Authors: - Joseph Nke (https://github.com/jnke2016) - Rick Ratzel (https://github.com/rlratzel) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #4080

Manoj-red-hat added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 11, 2023

BradReesWork assigned rlratzel and jnke2016 Oct 23, 2023

BradReesWork removed the ? - Needs Triage Need team to review and classify label Oct 23, 2023

kingmesal unassigned rlratzel Nov 6, 2023

jnke2016 mentioned this issue Jan 8, 2024

Fix Jaccard hang #4080

Merged

rapids-bot bot closed this as completed in #4080 Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: GPU hangs while computing Jaccard #3926

[BUG]: GPU hangs while computing Jaccard #3926

Manoj-red-hat commented Oct 11, 2023 •

edited

Loading

Manoj-red-hat commented Oct 11, 2023

Manoj-red-hat commented Oct 11, 2023

BradReesWork commented Oct 23, 2023

Manoj-red-hat commented Oct 26, 2023

jnke2016 commented Oct 26, 2023

jnke2016 commented Oct 26, 2023

BradReesWork commented Oct 26, 2023

BradReesWork commented Oct 26, 2023

jnke2016 commented Oct 26, 2023

jnke2016 commented Oct 26, 2023

Manoj-red-hat commented Oct 27, 2023

BradReesWork commented Oct 27, 2023

BradReesWork commented Oct 27, 2023

jnke2016 commented Oct 27, 2023 •

edited

Loading

jnke2016 commented Oct 27, 2023

BradReesWork commented Oct 27, 2023 •

edited

Loading

BradReesWork commented Oct 27, 2023

jnke2016 commented Oct 27, 2023

jnke2016 commented Oct 27, 2023 •

edited

Loading

jnke2016 commented Oct 27, 2023

jnke2016 commented Oct 27, 2023 •

edited

Loading

jnke2016 commented Oct 27, 2023

jnke2016 commented Oct 28, 2023

jnke2016 commented Oct 30, 2023

[BUG]: GPU hangs while computing Jaccard #3926

[BUG]: GPU hangs while computing Jaccard #3926

Comments

Manoj-red-hat commented Oct 11, 2023 • edited Loading

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Environment details

Other/Misc.

Code of Conduct

Manoj-red-hat commented Oct 11, 2023

Manoj-red-hat commented Oct 11, 2023

BradReesWork commented Oct 23, 2023

Manoj-red-hat commented Oct 26, 2023

jnke2016 commented Oct 26, 2023

jnke2016 commented Oct 26, 2023

BradReesWork commented Oct 26, 2023

BradReesWork commented Oct 26, 2023

jnke2016 commented Oct 26, 2023

jnke2016 commented Oct 26, 2023

Manoj-red-hat commented Oct 27, 2023

BradReesWork commented Oct 27, 2023

BradReesWork commented Oct 27, 2023

jnke2016 commented Oct 27, 2023 • edited Loading

jnke2016 commented Oct 27, 2023

BradReesWork commented Oct 27, 2023 • edited Loading

BradReesWork commented Oct 27, 2023

jnke2016 commented Oct 27, 2023

jnke2016 commented Oct 27, 2023 • edited Loading

jnke2016 commented Oct 27, 2023

jnke2016 commented Oct 27, 2023 • edited Loading

jnke2016 commented Oct 27, 2023

jnke2016 commented Oct 28, 2023

jnke2016 commented Oct 30, 2023

Manoj-red-hat commented Oct 11, 2023 •

edited

Loading

jnke2016 commented Oct 27, 2023 •

edited

Loading

BradReesWork commented Oct 27, 2023 •

edited

Loading

jnke2016 commented Oct 27, 2023 •

edited

Loading

jnke2016 commented Oct 27, 2023 •

edited

Loading