Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault with gpu enabled on python3.9 and ubuntu 21 #10210

Closed
mic-p opened this issue Aug 15, 2021 · 2 comments
Closed

segfault with gpu enabled on python3.9 and ubuntu 21 #10210

mic-p opened this issue Aug 15, 2021 · 2 comments
Assignees
Labels
stat:awaiting response Waiting on input from the contributor

Comments

@mic-p
Copy link

mic-p commented Aug 15, 2021

Hi all,
I'm trying tensorflow into my laptop with:

  • ubuntu 21 desktop just installed
  • python3.9, tensorflow installed with pip3 install tensorflow (v. 2.6.0)
  • cuDNN just downloaded from nvidia (8.2.2 (July 6th, 2021), for CUDA 11.4)
  • GPU: product: GM108M [GeForce MX130]

root@mic:~# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 48C P0 N/A / N/A | 419MiB / 2004MiB | 5% Default |
| | | N/A |

but when I try to use the gpu, I receive a segfault.
The same code, executed with the same machine but with gpu disabled (export CUDA_VISIBLE_DEVICES="" ; python3 ts_yah.py), works like a charm

Tried to debug the script with gdb, here you can find the bt output:

Thread 38 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffeacff9640 (LWP 3124)]
0x00007ffff7fd6ec0 in ?? () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0 0x00007ffff7fd6ec0 in ?? () from /lib64/ld-linux-x86-64.so.2
#1 0x00007ffff7fdef96 in ?? () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7d4e288 in __GI__dl_catch_exception (exception=0x7ffeacff72a0, operate=0x7ffff7fdece0, args=0x7ffeacff72c0) at dl-error-skeleton.c:208
#3 0x00007ffff7fde6ed in ?? () from /lib64/ld-linux-x86-64.so.2
#4 0x00007ffff7fa634c in dlopen_doit (a=a@entry=0x7ffeacff74f0) at dlopen.c:66
#5 0x00007ffff7d4e288 in __GI__dl_catch_exception (exception=exception@entry=0x7ffeacff7490, operate=0x7ffff7fa62f0 <dlopen_doit>, args=0x7ffeacff74f0) at dl-error-skeleton.c:208
#6 0x00007ffff7d4e353 in __GI__dl_catch_error (objname=0x7ffe90007200, errstring=0x7ffe90007208, mallocedp=0x7ffe900071f8, operate=, args=) at dl-error-skeleton.c:227
#7 0x00007ffff7fa6b89 in _dlerror_run (operate=operate@entry=0x7ffff7fa62f0 <dlopen_doit>, args=args@entry=0x7ffeacff74f0) at dlerror.c:170
#8 0x00007ffff7fa63d8 in __dlopen (file=, mode=) at dlopen.c:87
#9 0x00007fff4c2e974b in cudnnCreate () from /usr/lib/cuda/lib64/libcudnn.so.8
#10 0x00007fff8129f770 in cudnnCreate () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#11 0x00007fff8126b7c2 in stream_executor::gpu::CudnnSupport::Init() () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#12 0x00007fff8126c2d7 in stream_executor::initialize_cudnn()::{lambda(stream_executor::internal::StreamExecutorInterface*)#1}::operator()(stream_executor::internal::StreamExecutorInterface*) const [clone .isra.587] ()
from /usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#13 0x00007fff8613b283 in stream_executor::gpu::GpuExecutor::CreateDnn() () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007fff91c4d189 in stream_executor::StreamExecutor::AsDnn() () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x00007fff91c4d361 in stream_executor::StreamExecutor::createRnnDescriptor(int, int, int, int, int, stream_executor::dnn::RnnInputMode, stream_executor::dnn::RnnDirectionMode, stream_executor::dnn::RnnMode, stream_executor::dnn::DataType, stream_executor::dnn::AlgorithmConfig const&, float, unsigned long, stream_executor::ScratchAllocator*, bool) () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#16 0x00007fff8adb2f13 in tensorflow::Status tensorflow::CudnnRNNKernelCommon::GetCachedRnnDescriptor(tensorflow::OpKernelContext*, tensorflow::(anonymous namespace)::CudnnRnnModelShapes const&, stream_executor::dnn::RnnInputMode const&, stream_executor::dnn::AlgorithmConfig const&, tensorflow::gtl::FlatMap<std::pair<tensorflow::(anonymous namespace)::CudnnRnnModelShapes, absl::lts_20210324::optional<stream_executor::dnn::AlgorithmDesc> >, tensorflow::(anonymous namespace)::RnnScratchSpace, tensorflow::(anonymous namespace)::CudnnRnnConfigHasher, tensorflow::(anonymous namespace)::CudnnRnnConfigComparator>, stream_executor::dnn::RnnDescriptor**, bool) [clone .constprop.477] ()
from /usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#17 0x00007fff8adb3791 in tensorflow::CudnnRNNForwardOp<Eigen::GpuDevice, float>::ComputeAndReturnAlgorithm(tensorflow::OpKernelContext
, stream_executor::dnn::AlgorithmConfig*, bool, bool, int) ()
from /usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#18 0x00007fff8adabb96 in tensorflow::CudnnRNNForwardOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#19 0x00007fff8081a3b9 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#20 0x00007fff80910b73 in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::SimplePropagatorState::Process(tensorflow::SimplePropagatorState::TaggedNode, long) ()
from /usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#21 0x00007fff85dfa1b1 in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#22 0x00007fff85df6ec3 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /usr/local/lib/python3.9/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#23 0x00007fff80dd9665 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /usr/local/lib/python3.9/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#24 0x00007ffff7ded450 in start_thread (arg=0x7ffeacff9640) at pthread_create.c:473
#25 0x00007ffff7d0dd53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

without gdb:
2021-08-15 17:07:58.551144: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-15 17:07:58.556926: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-15 17:07:58.557276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2021-08-15 17:07:59.388972: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-15 17:07:59.389485: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-15 17:07:59.389844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-15 17:07:59.390101: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-15 17:07:59.856256: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-15 17:07:59.856605: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-15 17:07:59.856890: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-15 17:07:59.857142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1258 MB memory: -> device: 0, name: NVIDIA GeForce MX130, pci bus id: 0000:01:00.0, compute capability: 5.0
2021-08-15 17:08:00.342971: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-15 17:08:00.343008: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-08-15 17:08:00.343036: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-08-15 17:08:00.499252: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-15 17:08:00.501123: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
2021-08-15 17:08:00.560773: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/300
Segmentation fault (core dumped)

What's now?
Thanks

@ymodak
Copy link
Contributor

ymodak commented Aug 17, 2021

TF 2.6 prebuilt binaries support cuda 11.2 and cudnn 8.1
https://www.tensorflow.org/install/source#gpu
Have you build TF from source fo cuda 11.4? If using pip install can you please switch back to cuda 11.2 and check if the issue persists? Thanks!

@ymodak ymodak added the stat:awaiting response Waiting on input from the contributor label Aug 17, 2021
@mic-p
Copy link
Author

mic-p commented Aug 18, 2021

Hi,
thanks for your comment.
I installed all the packages from the already build sources: pip in the case of TF, cuda from the ubuntu archives and cuDNN from nvidia sites.

I haven't seen that the TF is compatible only with cuDNN 8.1 and I installed the last found on the nvidia website.
Now, with the right version 8.1 (cuda 11.2), all are flight like a concorde ;)

Thanks a lot and please close the issue

Michele

@ymodak ymodak closed this as completed Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Waiting on input from the contributor
Projects
None yet
Development

No branches or pull requests

2 participants