CUDA OOM error not caught with auto microbatching #3397

JAEarly · 2024-06-12T15:07:45Z

When using device_train_microbatch_size="auto", I get a CUDA OOM memory error that is not caught but I believe it should be.

The _is_cuda_oom method in trainer/trainer.py detects CUDA OOM errors using if 'CUDA out of memory' in str(e).
It seems the error I got has the message CUDA error: out of memory, so is not caught by the _is_cuda_oom error.
I suspect this is dependent on CUDA version; it has worked on one system with v12.1 but failed on another with v12.4.

Potential fixes:

Change _is_cuda_oom to use if 'out of memory' in str(e)
Change _is_cuda_oom to check for both CUDA out of memory and CUDA error: out of memory.

Happy to open an MR for either of the above fixes.

Environment

Where it failed:

---- COMPOSER ENV ----

Collecting system information...
---------------------------------
System Environment Report
Created: 2024-06-12 14:14:28 UTC
---------------------------------

PyTorch information
-------------------
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.217-205.860.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.183.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               48
On-line CPU(s) list:                  0-47
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             7
BogoMIPS:                             4999.98
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            768 KiB (24 instances)
L1i cache:                            768 KiB (24 instances)
L2 cache:                             24 MiB (24 instances)
L3 cache:                             35.8 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnx==1.16.1
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.3.0
[pip3] torch-optimizer==0.3.0
[pip3] torch-tb-profiler==0.4.3
[pip3] torchaudio==2.3.0
[pip3] torchlibrosa==0.1.0
[pip3] torchmetrics==1.3.2
[pip3] torchvision==0.18.0
[pip3] triton==2.3.0
[conda] Could not collect


Composer information
--------------------
Composer version: 0.22.0
Composer commit hash: None
Host processor model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Host processor core count: 24
Number of nodes: 1
Accelerator model name: Tesla T4
Accelerators per node: 1
CUDA Device Count: 1


---- NVCC ----
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Where it worked:

---- COMPOSER ENV ----
Collecting system information...
---------------------------------
System Environment Report
Created: 2024-06-12 13:59:33 UTC
---------------------------------

PyTorch information
-------------------
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.24.1
Libc version: glibc-2.35

Python version: 3.10.14 (main, Apr 12 2024, 10:51:19) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.192-183.736.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB

Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
CPU family:                         6
Model:                              79
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           1
CPU max MHz:                        3000.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           4600.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
Hypervisor vendor:                  Xen
Virtualization type:                full
L1d cache:                          512 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           4 MiB (16 instances)
L3 cache:                           45 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] hs-infrastructure-flake8-plugins==0.2.0
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.16.1
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.3.0
[pip3] torch-optimizer==0.3.0
[pip3] torch-tb-profiler==0.4.3
[pip3] torchaudio==2.3.0
[pip3] torchlibrosa==0.1.0
[pip3] torchmetrics==1.3.2
[pip3] torchvision==0.18.0
[pip3] triton==2.3.0
[conda] Could not collect


Composer information
--------------------
Composer version: 0.22.0
Composer commit hash: None
Host processor model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Host processor core count: 16
Number of nodes: 1
Accelerator model name: Tesla V100-SXM2-16GB
Accelerators per node: 1
CUDA Device Count: 2


---- NVCC ----
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

The text was updated successfully, but these errors were encountered:

mvpatel2000 · 2024-06-12T15:17:50Z

@JAEarly thanks for flagging this! I would prefer option (2), which is a bit more explicit. We'd love a PR

JAEarly · 2024-06-13T09:57:55Z

@JAEarly thanks for flagging this! I would prefer option (2), which is a bit more explicit. We'd love a PR

Please see #3400

JAEarly · 2024-06-14T13:17:44Z

Closing as merged in #3400

JAEarly added the bug Something isn't working label Jun 12, 2024

JAEarly mentioned this issue Jun 13, 2024

Check for 'CUDA error: out of memory' when auto-microbatching #3400

Merged

7 tasks

JAEarly closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM error not caught with auto microbatching #3397

CUDA OOM error not caught with auto microbatching #3397

JAEarly commented Jun 12, 2024

mvpatel2000 commented Jun 12, 2024

JAEarly commented Jun 13, 2024

JAEarly commented Jun 14, 2024

CUDA OOM error not caught with auto microbatching #3397

CUDA OOM error not caught with auto microbatching #3397

Comments

JAEarly commented Jun 12, 2024

mvpatel2000 commented Jun 12, 2024

JAEarly commented Jun 13, 2024

JAEarly commented Jun 14, 2024