Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: GPU hangs while computing Jaccard #3926

Closed
2 tasks done
Manoj-red-hat opened this issue Oct 11, 2023 · 24 comments · Fixed by #4080
Closed
2 tasks done

[BUG]: GPU hangs while computing Jaccard #3926

Manoj-red-hat opened this issue Oct 11, 2023 · 24 comments · Fixed by #4080
Assignees
Labels
bug Something isn't working

Comments

@Manoj-red-hat
Copy link

Manoj-red-hat commented Oct 11, 2023

Version

23.10a Nightly

Which installation method(s) does this occur on?

Conda

Describe the bug.

GPU hangs while computing Jaccard

image

Minimum reproducible example

dask_cudf
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cugraph
import cugraph.dask as dask_cugraph
import cugraph.dask.comms.comms as Comms
from cugraph.generators.rmat import rmat
import csv
import time
import pandas as pd

# Step 1
# Run command on master node in mg_utils: `./run-dask-process.sh scheduler --ucx`
# Step 2
# Run command on master node in mg_utils: `/run-dask-process.sh workers --ucx`
# Step 3
# Run command on slave nodes in mg_utils: `/run-dask-process.sh workers --ucx`

### Change ucx accordingly 
client = Client('ucx://10.141.0.46:8786')
client
Comms.initialize(p2p=True)

# For ldbc 22
graph_number ='22'
input_data_path = '/share/common/'+graph_number+'/graph500-'+graph_number+'.e'
result_data_path = '/share/common/'+graph_number+'/jaccard_cuda_res.csv'

chunksize = dask_cugraph.get_chunksize(input_data_path)
e_list = dask_cudf.read_csv(
    input_data_path,
    delimiter=' ',
    blocksize=chunksize,
    names=['src', 'dst'],
    dtype=['int64', 'int64'],
)

# create graph from input data
G = cugraph.Graph(directed=False)
G.from_dask_cudf_edgelist(e_list, source='src', destination='dst',store_transposed=False)
print(input_data_path)
def compute_jaccard(G, file_name: str, topK: int, vid_list: list = None, batchsize=1, size=2396657):
    headerList = ['first', 'second','jaccard_coeff']
    with open(file_name, 'w') as file:
        csv.DictWriter(file, delimiter=',',fieldnames=headerList).writeheader()

    print(file_name ,topK)
    jaccard_res = None
    if not vid_list:
        print("Computing vids")
        vid_list = G.nodes().compute().to_arrow().to_pylist()
        vid_list.sort()
        print(vid_list)
    size = len(vid_list)
    print(size)
    start_time = time.time()
    write_time = 0
    with open(file_name, 'a') as f:
        for i in range(0, size, batchsize):
            l=vid_list[i:i+batchsize]
            pairs = G.get_two_hop_neighbors(start_vertices=l)
            jaccard_res = dask_cugraph.jaccard(G, pairs)
            jaccard_groupby = jaccard_res.compute().to_pandas().groupby('first')
            

            for j in vid_list[i:i+batchsize]:
                try:
                    df = jaccard_groupby.get_group(j).sort_values('jaccard_coeff', ascending=False)[:topK+1].reset_index(drop=True).drop(0)
                    t = time.time()
                    df.to_csv(f, header=False, index=False)
                    write_time+=time.time()-t
                except KeyError:
                    continue
            print("---Iteration: %s ---" % (i))
    f.close()
    print("--- %s Write seconds ---" % (write_time))
    print("--- %s seconds ---" % (time.time() - start_time))

compute_jaccard(G,result_data_path,10, vid_list = None)

Data
```373293056 369098752
373293056 371195906
373293057 371195904
373293057 371195905
377487360 369098752
377487360 371195906
377487360 373293056
377487362 377487360
377487362 377487361
383778816 373293057
383778817 369098752
383778817 371195906
383778817 373293056
383778817 373293057
383778817 377487360
385875968 373293057
385875969 377487360
385875970 377487360
385875970 377487361
385875970 377487362
390070272 377487360
392167424 377487360
392167424 385875970
396361728 373293057
398458880 383778817
400556032 396361728
400556033 373293057
400556033 377487360
400556033 383778816
400556033 383778817
403701760 377487360
403701761 369098752
403701761 371195906
403701761 373293056
403701761 377487360
403701761 383778816
403701761 383778817
403701761 390070272
403701761 403701760
405798912 377487360
405798912 403701761
422576128 373293057
422576128 396361728
422576128 407896064
422576128 416284674
407896064 369098752
407896064 371195904
407896064 371195905
407896064 373293057
407896064 383778816
407896064 385875968
407896064 390070272
407896064 396361728
407896064 398458880
407896064 400556033
412090368 400556032
418381824 373293057
418381824 407896064
416284672 373293057
416284672 407896064
416284673 373293057
416284673 377487360
416284673 398458880
416284673 400556032
416284673 407896064
416284673 412090368
416284674 407896064
424673280 377487360
424673280 377487362
424673280 392167424
430964736 383778817
430964736 407896064
435159040 373293056
435159040 377487360
435159041 383778817
435159041 396361728
435159041 407896064
435159041 412090368```

Relevant log output

[1697035413.329434] [node047:266058:0]          parser.c:1989 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1697035413.329434] [node047:266058:0]          parser.c:1989 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
/share/common/22/graph500-22.e
/share/common/22/jaccard_cuda_res.csv 10
Computing vids
[369098752, 371195904, 371195905, 371195906, 373293056, 373293057, 377487360, 377487361, 377487362, 383778816, 383778817, 385875968, 385875969, 385875970, 390070272, 392167424, 396361728, 398458880, 400556032, 400556033, 403701760, 403701761, 405798912, 407896064, 412090368, 416284672, 416284673, 416284674, 418381824, 422576128, 424673280, 430964736, 435159040, 435159041]
34
---Iteration: 0 ---
---Iteration: 1 ---
---Iteration: 2 ---
---Iteration: 3 ---
---Iteration: 4 ---
---Iteration: 5 ---
---Iteration: 6 ---
---Iteration: 7 ---
---Iteration: 8 ---
---Iteration: 9 ---
---Iteration: 10 ---
---Iteration: 11 ---
---Iteration: 12 ---
---Iteration: 13 ---
---Iteration: 14 ---
---Iteration: 15 ---
---Iteration: 16 ---
---Iteration: 17 ---
---Iteration: 18 ---
---Iteration: 19 ---
---Iteration: 20 ---
---Iteration: 21 ---
---Iteration: 22 ---
---Iteration: 23 ---
---Iteration: 24 ---
---Iteration: 25 ---
---Iteration: 26 ---
---Iteration: 27 ---
---Iteration: 28 ---
---Iteration: 29 ---
^CTraceback (most recent call last):
  File "/home/tigergraph/tmp.py", line 79, in <module>
    compute_jaccard(G,result_data_path,10, vid_list = None)
  File "/home/tigergraph/tmp.py", line 61, in compute_jaccard
    pairs = G.get_two_hop_neighbors(start_vertices=l)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 855, in get_two_hop_neighbors
    wait(result)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/client.py", line 5421, in wait
    result = client.sync(_wait, fs, timeout=timeout, return_when=return_when)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 359, in sync
    return sync(
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 422, in sync
    wait(10)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 411, in wait
    return e.wait(timeout)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/threading.py", line 607, in wait
    signaled = self._cond.wait(timeout)
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Exception ignored in: <function Comms.__del__ at 0x7ffaf129cc10>
Traceback (most recent call last):
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/client.py", line 3008, in run
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 359, in sync
  File "/home/tigergraph/miniconda3/envs/rapids-23.10/lib/python3.10/site-packages/distributed/utils.py", line 382, in sync
RuntimeError: IOLoop is closed
^C

Environment details

No response

Other/Misc.

No response

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@Manoj-red-hat Manoj-red-hat added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 11, 2023
@Manoj-red-hat
Copy link
Author

@rlratzel FYI

@Manoj-red-hat
Copy link
Author

@BradReesWork FYI

@BradReesWork BradReesWork removed the ? - Needs Triage Need team to review and classify label Oct 23, 2023
@BradReesWork
Copy link
Member

Sorry - I was out of the office

Why are you doing

    for i in range(0, size, batchsize):
        l=vid_list[i:i+batchsize]
        pairs = G.get_two_hop_neighbors(start_vertices=l)
        jaccard_res = dask_cugraph.jaccard(G, pairs)
        jaccard_groupby = jaccard_res.compute().to_pandas().groupby('first')

Are you breaking the list of vertices into chunks and then process a portion at a time? Batchsize should be 100K+ or this will be very very slow. I see that you set it to "10" which is too small

@Manoj-red-hat
Copy link
Author

Hi @BradReesWork

Are you breaking the list of vertices into chunks and then process a portion at a time?

Yes, else it will go OOM

Batchsize should be 100K+ or this will be very slow

Tried but it fails on larger graphs
If we can somehow calculate memory overhead as per batch and graph size than we can configure batchsize dynamically

I see that you set it to "10" which is too small

because it makes me sure that it will not go OOM whatever the graph size

@jnke2016 Do we have any updates on this ?

@jnke2016
Copy link
Contributor

Do we have any updates on this ?

@Manoj-red-hat Still trying to reproduce the error with our CAPI only. I was able to reproduce the error on a single node with 8 GPUs with the python API

@jnke2016
Copy link
Contributor

@BradReesWork he shared the datasets in the description. But I can reproduce the error irrespective of their datasets. In fact, i was able to reproduce it with the karate datasets. I am still looking in this issue and trying to reproduce it in our CAPI

@BradReesWork
Copy link
Member

I found the datasets in LDBC

@BradReesWork
Copy link
Member

wondering if there needs to be a "gc" call within that loop to force dask/python to clean up.

I have it running on the scale 22 graph with iteration size of 10 and have not encountered an error - other then it taking a really long time.

Now there is the option to compute the two hop vertex pairs in DASK - simple outer join. then you could pass blocks from that list into Jaccard

@jnke2016
Copy link
Contributor

wondering if there needs to be a "gc" call within that loop to force dask/python to clean up.

That's one the first thing I tried within the two_hop_neighbor function and it didn't help. I also added a sleep between calls to give dask enough time to cleanup and explicitly deleted a list of future that is not explicitly deleted but still had the bug.

@jnke2016
Copy link
Contributor

I have it running on the scale 22 graph with iteration size of 10 and have not encountered an error - other then it taking a really long time.

on How many GPUs are you running the reproducer? To reproduce it quite constantly you need at least 8

@Manoj-red-hat
Copy link
Author

@jnke2016
I am curious about your approach to replicating this on CAPI. Will you be simulating the 2 Hop Neighbour using Cuda MPI across 8 GPUs? I attempted this, but I couldn't locate any references for executing native CAPIs on a multi-node GPU configuration.

@BradReesWork
Copy link
Member

I'm using two GPIs. Let me jump on a 8 gpu system and try again

@BradReesWork
Copy link
Member

@jnke2016 if this only breaks on 8-gpus do you think one worker could be staved and crashing?

@jnke2016
Copy link
Contributor

jnke2016 commented Oct 27, 2023

if this only breaks on 8-gpus do you think one worker could be staved and crashing

@BradReesWork So at some point some workers hits 100% utilization causing the hang

@jnke2016
Copy link
Contributor

Will you be simulating the 2 Hop Neighbour using Cuda MPI across 8 GPUs?

@Manoj-red-hat yep this is exactly what I am doing.

@BradReesWork
Copy link
Member

BradReesWork commented Oct 27, 2023

this line is also causing issues "vid_list.sort()"
There really is not need to sort, and the sorted list is just being dumped to the screen

Likewise the "print(vid_list)" is also not needed since it will dump 2 million records to the buffer

@BradReesWork
Copy link
Member

I can now get it to crash based on setting topK and batchsize to higher values - trying to resolve the issue

@jnke2016
Copy link
Contributor

Ok. I can reproduce the error with a batch size of 1 with the karate datasets with the python API.

@jnke2016
Copy link
Contributor

jnke2016 commented Oct 27, 2023

That's what makes me think that it is not a datasets size issue but an actual bug. Furthermore, the error is not reproducible on 2 GPUs that's why I started investigating the CAPI to see if it is not a dask related issue (as I couldn't see anything wrong there so far).

@jnke2016
Copy link
Contributor

@BradReesWork @Manoj-red-hat I narrowed down the error to this small reproducer on 8 GPUs which is just looping over the two_hop_neighbors using the karate datasets and a number of seeds = 5 and it eventually fails

import cugraph
from cugraph.testing.mg_utils import start_dask_client, stop_dask_client
import dask_cudf
import cugraph.dask as dask_cugraph


if __name__ == "__main__":
    setup_objs = start_dask_client()

    client = setup_objs[0]

    worker_list = list(client.scheduler_info()['workers'].keys())
    print("the number of workers = ", len(worker_list))

    input_data_path = "/home/nfs/jnke/debug_jaccard/cugraph/datasets/karate.csv"

    chunksize = dask_cugraph.get_chunksize(input_data_path)
    e_list = dask_cudf.read_csv(
        input_data_path,
        delimiter=' ',
        blocksize=chunksize,
        names=['src', 'dst'],
        dtype=['int32', 'int32'],
    )

    # create graph from input data
    G = cugraph.Graph(directed=False)
    G.from_dask_cudf_edgelist(e_list, source='src', destination='dst',store_transposed=False)

    for i in range(200):
        print("iteration - ", i)
        seed = i
        start_vertices = G.select_random_vertices(
            random_state=seed, num_vertices=5).compute().reset_index(drop=True)

        pairs = G.get_two_hop_neighbors(start_vertices=start_vertices)

    stop_dask_client(*setup_objs)

I am mimicking the above with the CAPI and MPI to see if I can reproduce. The above reproducer is easier to work with.

@jnke2016
Copy link
Contributor

jnke2016 commented Oct 27, 2023

Another important observation I made was that if you run two_hop_neighbors(the above reproducer) without seeds (which gives you all possible pairs that are two hops apart and therefore even more expensive), it succeeds which reinforce my believe that the bug is in the C or C++ API.

@jnke2016
Copy link
Contributor

So I ran several tests on the CAPI with MPI and all succeeded. Still investigating.

@jnke2016
Copy link
Contributor

@BradReesWork @Manoj-red-hat After further digging, I can safely rule out the issue being from the C or C++ API and is indeed a dask related one. I found out that we either have a race condition when it fails where only few dask workers are calling the PLC API causing a hang(100 % utilization for those that reach) because all workers should eventually call it: or some dask workers die and never call the PLC API (causing the other workers to hang waiting on them).

To confirm what is stated above, simply restrict the number of workers making the call and you can reproduce the same bug on even 4 GPUs.

I am still investigating what happened to the missing workers.

@jnke2016
Copy link
Contributor

found out that we either have a race condition when it fails where only few dask workers are calling the PLC API causing a hang(100 % utilization for those that reach) because all workers should eventually call it

It indeed looks like a race condition. I have a fix for that but it creates a serialization error when running on 4 GPUs or below. Not sure why but I am still looking into it

rapids-bot bot pushed a commit that referenced this issue Jan 20, 2024
This PR leverages `client.map` to simultaneously launch processes in order to avoid hangs

closes #3926

Authors:
  - Joseph Nke (https://github.com/jnke2016)
  - Rick Ratzel (https://github.com/rlratzel)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)

URL: #4080
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants