Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Graph Creation Failure at Scale #4076

Closed
2 tasks done
jnke2016 opened this issue Jan 3, 2024 · 1 comment
Closed
2 tasks done

[BUG]: Graph Creation Failure at Scale #4076

jnke2016 opened this issue Jan 3, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@jnke2016
Copy link
Contributor

jnke2016 commented Jan 3, 2024

Version

23.12

Which installation method(s) does this occur on?

Docker, Conda, Pip, Source

Describe the bug.

A failure occurs when creating a graph passed scale 27 which corresponds to 2.1+ billion edges and this is irrespective of the cluster size. In fact, clusters of size 16 up to 256 GPUs attempted to create the graph that was generated with RMAT but none succeeded.

Minimum reproducible example

def trim_memory() -> int:
    libc = ctypes.CDLL("libc.so.6")
    return libc.malloc_trim(0)

if __name__ == "__main__":
    print("setting up the cluster", flush=True)
    setup_objs = start_dask_client()
    client = setup_objs[0]
    print("Done setting up the cluster", flush=True)

    scale = 27
    edgefactor = 16
    seed = 2

    dask_df = generate_edgelist_rmat(
        scale=scale, edgefactor=edgefactor, seed=seed, unweighted=True, mg=True
    )

    dask_df = dask_df.to_backend("pandas").persist()
    dask_df = dask_df.to_backend("cudf") # delayed

    dask_df = dask_df.astype('int64')

    directed = False

    G = cugraph.Graph(directed=directed)

    #client.run(trim_memory)
    
    G.from_dask_cudf_edgelist(
        dask_df, source="src", destination="dst"
    )

    print("the number of nodes = ", G.number_of_nodes(), flush=True)
    
    print("dask_df = \n", dask_df.head())

    result_louvain, mod_score = dask_cugraph.louvain(G)


    print("result louvain = \n", result_louvain.head())
    print("mod score = ", mod_score)


    stop_dask_client(*setup_objs)

Relevant log output

terminate called after throwing an instance of 'raft::logic_error'
  what():  NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=303: 


NOTE: One worker failed with the error below and it seems to be at the graph creation
[16500106 rows x 2 columns]], <pylibcugraph.graph_properties.GraphProperties object at 0x14d82c5da1d0>, 'src', 'dst', False, dtype('int32'), None, None, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from cugraph_mg_graph_create(): CUGRAPH_UNKNOWN_ERROR NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=500: ')"

Note: Unsual high host memory usage
2023-12-15 05:53:33,945 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 44.19 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:53:44,012 - distributed.utils_perf - INFO - full garbage collection released 2.13 GiB from 371 reference cycles (threshold: 9.54 MiB)
2023-12-15 05:53:50,484 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 44.15 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:00,618 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 48.13 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:10,773 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 48.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:13,125 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 50.65 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:19,048 - distributed.worker.memory - WARNING - Worker is at 78% memory usage. Resuming worker. Process memory: 49.69 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:20,843 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 49.44 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:23,891 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 50.67 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:27,048 - distributed.worker.memory - WARNING - Worker is at 79% memory usage. Resuming worker. Process memory: 50.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:28,736 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 50.53 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:29,005 - distributed.worker.memory - WARNING - Worker is at 79% memory usage. Resuming worker. Process memory: 50.32 GiB -- Worker memory limit: 62.97 GiB
2023-12-15 05:54:29,208 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 50.56 GiB -- Worker memory limit: 62.97 GiB

Environment details

23.12 MNMG Nightly squash file

Other/Misc.

Several attempts were made to identify or isolate the issue without success such as:

  1. Running the client on a separate process
  2. Moving the edgelist created from the RMAT generator to host memory with the goal of freeing device memory
  3. Checked for integer overflow
  4. Trimming memory to reduce host memory usage (In fact, a significant high host memory usage was observed on the failing node prior to the failure: https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os
  5. Tested raft PR 1928 which ensures that nccl identifies the correct rank

NOTE: Only one node fails, causing the other nodes to wait for its completion thereby, creating a hang. The node with rank 0 is always the one failing.

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@jnke2016 jnke2016 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 3, 2024
@ChuckHastings ChuckHastings removed the ? - Needs Triage Need team to review and classify label Jan 31, 2024
@rlratzel
Copy link
Contributor

rlratzel commented Feb 5, 2024

Closing now that these PRs are merged:

@rlratzel rlratzel closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants