Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudaErrorIllegalAddress in CAGRA graph::optimize on 4GB+ graph #375

Open
achirkin opened this issue Oct 2, 2024 · 0 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@achirkin
Copy link
Contributor

achirkin commented Oct 2, 2024

Describe the bug
CAGRA fails with cudaErrorIllegalAddress in cuvs::neighbors::cagra::detail::graph::optimize when the total (internal) graph size exceeds the uint32_t range.

Steps/Code to reproduce bug
This is observed on 100M+ subset of BIGANN dataset (but passes to completion with 10M subset), CAGRA-build with all default settings

./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH --build --benchmark_min_time=1x --benchmark_min_warmup_time=0 --benchmark_counters_tabular --override_kv=graph_build_algo:"IVF_PQ" --override_kv=ivf_pq_search_refine_ratio:1  --override_kv=dataset_memory_type:"host" bigann-100M.json

Debug output and error trace

[I] [18:00:08.800354] Using the dataset file '<...>/bigann-1B/base.1B.u8bin'
2024-10-02T18:00:09+02:00
Running <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH
Run on (32 X 5272.66 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 6.81, 3.30, 2.37
command_line: <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH --build --benchmark_min_time=1x --benchmark_min_warmup_time=0 --benchmark_counters_tabular --override_kv=graph_build_algo:"IVF_PQ" --override_kv=ivf_pq_search_refine_ratio:1 --data_prefix=<...> --override_kv=dataset_memory_type:"host" ../raft/bench/ann/conf/bigann-100M.json
dataset: bigann-100M
dim: 128
distance: euclidean
gpu_driver_version: 12.4
gpu_gpuDirectRDMASupported: 0
gpu_hostNativeAtomicSupported: 0
gpu_mem_bus_width: 384
gpu_mem_freq: 9751000000.000000
gpu_mem_global_size: 25183584256
gpu_mem_shared_size: 102400
gpu_name: NVIDIA GeForce RTX 3090
gpu_pageableMemoryAccess: 1
gpu_pageableMemoryAccessUsesHostPageTables: 0
gpu_runtime_version: 12.2
gpu_sm_count: 82
gpu_sm_freq: 1695000000.000000
host_cores_used: 16
host_cpu_freq_max: 3400000000
host_cpu_freq_min: 2200000000
host_pagesize: 4096
host_processors_sysconf: 32
host_processors_used: 32
host_total_ram_size: 67320037376
host_total_swap_size: 134216675328
n_records: 100000000
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
using ivf_pq::index_params nrows 100000000, dim 128, n_lits 10000, pq_dim 32
[I] [18:03:23.457163] Calling raft::matrix::sample_rows(100000000, 128)
[I] [18:07:18.101338] raft::matrix::sample_rows - done
[I] [18:23:23.577272] optimizing graph
[I] [18:23:23.580016] # Pruning kNN graph (size=100000000, degree=128)

[I] [18:23:47.529341] # Pruning kNN Graph on GPUs
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
raft_cagra.default/process_time/real_time ERROR OCCURRED: 'CUDA error encountered at: file=<...>/raft/cpp/include/raft/util/cudart_utils.hpp line=148: call='cudaMemcpyAsync(dst, src, len * sizeof(Type), cudaMemcpyDefault, stream)', Reason=cudaErrorIllegalAddress:an illegal memory access was encountered
Obtained 21 stack frames
#1 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH: raft::exception::collect_call_stack() +0x38 [0x5f72e63904c8]
#2 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0x56 [0x5f72e63ff0d6]
#3 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH: void raft::copy<unsigned int>(unsigned int*, unsigned int const*, unsigned long, rmm::cuda_stream_view) +0x196 [0x5f72e63ff296]
#4 in <...>/cuvs/cpp/build/libcuvs.so: cuvs::neighbors::cagra::detail::device_matrix_view_from_host<unsigned int, long>::device_matrix_view_from_host(raft::resources const&, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >) +0x19e [0x7cd5a01459ae]
#5 in <...>/cuvs/cpp/build/libcuvs.so: void cuvs::neighbors::cagra::detail::graph::optimize<unsigned int, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >(raft::resources const&, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, bool) +0x853 [0x7cd5a01571a3]
#6 in <...>/cuvs/cpp/build/libcuvs.so: cuvs::neighbors::cagra::index<unsigned char, unsigned int> cuvs::neighbors::cagra::detail::build<unsigned char, unsigned int, raft::host_device_accessor<std::experimental::default_accessor<unsigned char const>, (raft::memory_type)0> >(raft::resources const&, cuvs::neighbors::cagra::index_params const&, std::experimental::mdspan<unsigned char const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned char const>, (raft::memory_type)0> >) +0x550 [0x7cd5a025a4d0]
#7 in <...>/cuvs/cpp/build/libcuvs.so: cuvs::neighbors::cagra::build(raft::resources const&, cuvs::neighbors::cagra::index_params const&, std::experimental::mdspan<unsigned char const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned char const>, (raft::memory_type)0> >) +0x2f [0x7cd5a024453f]
#8 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xe64aa) [0x5f72e641a4aa]
#9 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xbe630) [0x5f72e63f2630]
#10 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x7b3d5) [0x5f72e63af3d5]
#11 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x122e40) [0x5f72e6456e40]
#12 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x1099bc) [0x5f72e643d9bc]
#13 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x10afb4) [0x5f72e643efb4]
#14 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x10bbf1) [0x5f72e643fbf1]
#15 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xedf6d) [0x5f72e6421f6d]
#16 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xeee9f) [0x5f72e6422e9f]
#17 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xef008) [0x5f72e6423008]
#18 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0xbb737) [0x5f72e63ef737]
#19 in /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7cd59f22a1ca]
#20 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0x8b [0x7cd59f22a28b]
#21 in <...>/cuvs/./cpp/build/bench/ann/CUVS_CAGRA_ANN_BENCH(+0x4d73b) [0x5f72e638173b]
@achirkin achirkin added the bug Something isn't working label Oct 2, 2024
@achirkin achirkin self-assigned this Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

1 participant