CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) #3114

emilwallner · 2024-04-25T16:15:46Z

🐛 Describe the bug

Hey,

First of all, thanks for creating such a fantastic open-source production server.

I'm reaching out due an unexpected issue I can't solve. I've been running a torch serve server in production for over a year (several million requests per week) and it's been working great, however, a few weeks ago it started crashing every 1-5 days.

I enabled export CUDA_LAUNCH_BLOCKING=1, and it gives me a CUDA error: device-side assert triggered, and CUDA out of memory when I move my data to the GPU. I also log, torch.cuda.max_memory_allocated(), and torch.cuda.memory_allocated().

I thought some unique edge case caused a memory leak, some mismatched shapes or NaN values when I moved to the GPU, or allocating too much memory. However, the models use 6180 MiB / 23028 MiB, and torch.cuda.max_memory_allocated logs around 366 MB.

When I SSH into an instance that has crashed it looks like this:

Screen.Recording.2024-04-25.at.22.35.24.mov

The memory is at 6180 MiB, the GPU utilization flickers between 0-16%, and it gives me the CUDA error: device-side assert triggered, and CUDA out of memory.

Unfortunately, I can't find a way to reproduce the error, it happens at random every 1-5 days, and I have to reset the server and allocate a new instance. I've done everything I can think of to check the data before allocating it to the GPU, and reducing any memory overload or potential memory leak.

Error logs

Installation instructions

torchserve==0.10.0

Docker image: nvcr.io/nvidia/pytorch:22.12-py3

Ubuntu 20.04 including Python 3.8
NVIDIA CUDA® 11.8.0
NVIDIA cuBLAS 11.11.3.6
NVIDIA cuDNN 8.7.0.84
NVIDIA NCCL 2.15.5 (optimized for NVIDIA NVLink®)
NVIDIA RAPIDS™ 22.10.01 (For x86, only these libraries are included: cudf, xgboost, rmm, cuml, and cugraph.)
Apex
rdma-core 36.0
NVIDIA HPC-X 2.13
OpenMPI 4.1.4+
GDRCopy 2.3
TensorBoard 2.9.0
Nsight Compute 2022.3.0.0
Nsight Systems 2022.4.2.1
NVIDIA TensorRT™ 8.5.1
Torch-TensorRT 1.1.0a0
NVIDIA DALI® 1.20.0
MAGMA 2.6.2
JupyterLab 2.3.2 including Jupyter-TensorBoard
TransformerEngine 0.3.0

Model Packaing


   def create_pil_image(self, image_data):
          try:
              image = Image.open(io.BytesIO(image_data)).convert("RGB")
              return image
          except IOError as e:
              # If the image data is not valid or not provided, create a blank image.
              width, height = 776, 776  # Set desired width and height for the blank image
              color = (255, 255, 255)  # Set desired color for the blank image (white in this case)
              image = Image.new("RGB", (width, height), color)
              return image
    
     def preprocess_and_stack_images(self, images):
        preprocessed_images = []
        for i, img in enumerate(images):
            try:
                preprocessed_img = self.resize_tensor(img)
                if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1:
                    # Log information about the image that doesn't meet the requirements
                    logger.info(f"Image {i} does not meet the requirements. Replacing with a blank image.")
                    preprocessed_img = torch.zeros((3, 768, 768))
            except Exception as e:
                # Log the error message and load a blank image
                logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
                preprocessed_img = torch.zeros((3, 768, 768))
            preprocessed_images.append(preprocessed_img)

        images_batch = torch.stack(preprocessed_images, dim=0)
        if len(images_batch.shape) == 3:
            images_batch = images_batch.unsqueeze(0)
        return images_batch
    
    def preprocess(self, data):

        images = []
        fns = []
        texts = []
        size = []
        merges = []
        org_images = []
        watermarks = []
        white_balance_list = []
        auto_color_list = []
        temperature_list = []
        saturation_list = []

        for row in data:

            image = row["image"]
            fn = self.decode_field(row["fn"])
            text = self.decode_field(row["text"])
            merged = self.decode_field(row["merged"])
            merged = True if merged.lower() == 'true' else False
            resolution = self.decode_field(row["resolution"])
            
            white_balance = self.decode_field(row["white_balance"])
            auto_color = self.decode_field(row["auto_color"])
            temperature = float(self.decode_field(row["temperature"]))
            saturation = float(self.decode_field(row["saturation"]))
            
            auto_color =  True if auto_color == 'true' else False
            white_balance = True if white_balance == 'true' else False
            watermark = True if 'watermarked' in resolution else False
            
            if isinstance(image, str):
                logger.info(f"Image data should not be a string. Please provide the image data as bytes.")
                width, height = 224, 224  # Set desired width and height for the blank image
                color = (255, 255, 255)  # Set desired color for the blank image (white in this case)
                image = Image.new("RGB", (width, height), color)
            if isinstance(image, (bytearray, bytes)):
                image = self.create_pil_image(image)
                image = self.resize_image(image, resolution)
            
            org_images.append(image)
            texts.append(text)
            images.append(image)
            fns.append(fn)
            merges.append(merged)
            watermarks.append(watermark)
            white_balance_list.append(white_balance)
            temperature_list.append(temperature)
            saturation_list.append(saturation)
            auto_color_list.append(auto_color)
        
        texts_raw = self.tokenizer(texts) #type(torch.int32)
        texts = self.token_embedding(texts_raw).type(torch.float16) 
        texts = texts + self.positional_embedding.type(torch.float16)
        
        images_batch = self.preprocess_and_stack_images(images)

The error comes when I move the images_batch to GPU

config.properties

inference_address=http://0.0.0.0:8510
management_address=http://0.0.0.0:8511
metrics_address=https://0.0.0.0:8512
number_of_netty_threads=8
netty_client_threads=8
async_logging=true
enable_metrics_api=false
default_workers_per_model=1
max_request_size=20000000
max_response_size=20000000
job_queue_size=100
model_store=./model_store
load_models=all
models={
"palette_caption": {
"1.0": {
"defaultVersion": true,
"marName": "palette_caption.mar",
"minWorkers": 1,
"maxWorkers": 3,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 180
}
},
"palette_colorizer": {
"1.0": {
"defaultVersion": true,
"marName": "palette_colorizer.mar",
"minWorkers": 2,
"maxWorkers": 4,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 120
}
},
"palette_ref_colorizer": {
"1.0": {
"defaultVersion": true,
"marName": "palette_ref_colorizer.mar",
"minWorkers": 1,
"maxWorkers": 2,
"batchSize": 4,
"maxBatchDelay": 20,
"responseTimeout": 120
}
}
}

Versions

Pip freeze:

absl-py==1.3.0
aiohttp==3.8.4
aiosignal==1.3.1
aniso8601==9.0.1
annoy==1.17.1
ansi2html==1.9.1
anyio==4.3.0
apex==0.1
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==22.1.0
audioread==3.0.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
blinker==1.7.0
blis==0.7.9
cachetools==5.2.0
catalogue==2.0.8
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
cloudpickle==2.2.0
cmake==3.24.1.1
comm==0.1.2
confection==0.0.3
contourpy==1.0.6
cuda-python @ file:///rapids/cuda_python-11.7.0%2B0.g95a2041.dirty-cp38-cp38-linux_x86_64.whl
cudf @ file:///rapids/cudf-22.10.0a0%2B316.gad1ba132d2.dirty-cp38-cp38-linux_x86_64.whl
cugraph @ file:///rapids/cugraph-22.10.0a0%2B113.g6bbdadf8.dirty-cp38-cp38-linux_x86_64.whl
cuml @ file:///rapids/cuml-22.10.0a0%2B56.g3a8dea659.dirty-cp38-cp38-linux_x86_64.whl
cupy-cuda118 @ file:///rapids/cupy_cuda118-11.0.0-cp38-cp38-linux_x86_64.whl
cycler==0.11.0
cymem==2.0.7
Cython==0.29.32
dask @ file:///rapids/dask-2022.9.2-py3-none-any.whl
dask-cuda @ file:///rapids/dask_cuda-22.10.0a0%2B23.g62a1ee8-py3-none-any.whl
dask-cudf @ file:///rapids/dask_cudf-22.10.0a0%2B316.gad1ba132d2.dirty-py3-none-any.whl
debugpy==1.6.4
decorator==5.1.1
defusedxml==0.7.1
distributed @ file:///rapids/distributed-2022.9.2-py3-none-any.whl
entrypoints==0.4
exceptiongroup==1.0.4
execnet==1.9.0
executing==1.2.0
expecttest==0.1.3
fastapi==0.110.1
fastjsonschema==2.16.2
fastrlock==0.8.1
Flask==3.0.3
Flask-RESTful==0.3.10
fonttools==4.38.0
frozenlist==1.4.1
fsspec==2022.11.0
ftfy==6.1.1
google-auth==2.15.0
google-auth-oauthlib==0.4.6
graphsurgeon @ file:///workspace/TensorRT-8.5.1.7/graphsurgeon/graphsurgeon-0.4.6-py2.py3-none-any.whl
grpcio==1.51.1
gunicorn==20.1.0
h11==0.14.0
HeapDict==1.0.1
httptools==0.6.1
hypothesis==5.35.1
idna==3.4
importlib-metadata==5.1.0
importlib-resources==5.10.1
iniconfig==1.1.1
intel-openmp==2021.4.0
ipykernel==6.19.2
ipython==8.7.0
ipython-genutils==0.2.0
itsdangerous==2.2.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.10
jsonschema==4.17.3
jupyter-tensorboard @ git+https://github.com/cliffwoolley/jupyter_tensorboard.git@ffa7e26138b82549453306e06b535a9ac36db17a
jupyter_client==7.4.8
jupyter_core==5.1.0
jupyterlab==2.3.2
jupyterlab-pygments==0.2.2
jupyterlab-server==1.2.0
jupytext==1.14.4
kiwisolver==1.4.4
kornia==0.7.2
kornia_rs==0.1.3
langcodes==3.3.0
librosa==0.9.2
llvmlite==0.39.1
locket==1.0.0
Markdown==3.4.1
markdown-it-py==2.1.0
MarkupSafe==2.1.1
matplotlib==3.6.2
matplotlib-inline==0.1.6
mdit-py-plugins==0.3.3
mdurl==0.1.2
mistune==2.0.4
mkl==2021.1.1
mkl-devel==2021.1.1
mkl-include==2021.1.1
mock==4.0.3
mpmath==1.2.1
msgpack==1.0.4
multidict==6.0.5
murmurhash==1.0.9
nbclient==0.7.2
nbconvert==7.2.6
nbformat==5.7.0
nest-asyncio==1.5.6
networkx==2.6.3
notebook==6.4.10
numba==0.56.4
numpy==1.22.2
nvgpu==0.9.0
nvidia-dali-cuda110==1.20.0
nvidia-pyindex==1.0.9
nvtx==0.2.5
oauthlib==3.2.2
onnx @ file:///opt/pytorch/pytorch/third_party/onnx
opencv @ file:///opencv-4.6.0/modules/python/package
packaging==22.0
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
partd==1.3.0
pathy==0.10.1
pexpect==4.8.0
pickleshare==0.7.5
pillow==10.2.0
pillow-avif-plugin==1.4.2
pillow-heif==0.14.0
pkgutil_resolve_name==1.3.10
platformdirs==2.6.0
pluggy==1.0.0
polygraphy==0.43.1
pooch==1.6.0
preshed==3.0.8
prettytable==3.5.0
prometheus-client==0.15.0
prompt-toolkit==3.0.36
protobuf==3.20.1
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow @ file:///rapids/pyarrow-9.0.0-cp38-cp38-linux_x86_64.whl
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.10.1
pycocotools @ git+https://github.com/nvidia/cocoapi.git@8b8fd68576675c3ee77402e61672d65a7d826ddf#subdirectory=PythonAPI
pycparser==2.21
pydantic==1.9.2
Pygments==2.13.0
pylibcugraph @ file:///rapids/pylibcugraph-22.10.0a0%2B113.g6bbdadf8.dirty-cp38-cp38-linux_x86_64.whl
pylibraft @ file:///rapids/pylibraft-22.10.0a0%2B81.g08abc72.dirty-cp38-cp38-linux_x86_64.whl
pynvml==11.4.1
pyparsing==3.0.9
pyrsistent==0.19.2
pytest==7.2.0
pytest-rerunfailures==10.3
pytest-shard==0.1.2
pytest-xdist==3.1.0
python-dateutil==2.8.2
python-dotenv==1.0.1
python-hostlist==1.22
python-multipart==0.0.5
pytorch-quantization==2.1.2
pytz==2022.6
PyYAML==6.0
pyzmq==24.0.1
raft-dask @ file:///rapids/raft_dask-22.10.0a0%2B81.g08abc72.dirty-cp38-cp38-linux_x86_64.whl
regex==2022.10.31
requests==2.28.2
requests-oauthlib==1.3.1
resampy==0.4.2
rmm @ file:///rapids/rmm-22.10.0a0%2B38.ge043158.dirty-cp38-cp38-linux_x86_64.whl
rsa==4.9
scikit-learn @ file:///rapids/scikit_learn-0.24.2-cp38-cp38-manylinux2010_x86_64.whl
scipy==1.6.3
Send2Trash==1.8.0
six==1.16.0
smart-open==6.3.0
sniffio==1.3.1
sortedcontainers==2.4.0
soundfile==0.11.0
soupsieve==2.3.2.post1
spacy==3.4.4
spacy-legacy==3.0.10
spacy-loggers==1.0.4
sphinx-glpi-theme==0.3
srsly==2.4.5
stack-data==0.6.2
starlette==0.37.2
sympy==1.11.1
tabulate==0.9.0
tbb==2021.7.1
tblib==1.7.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorrt @ file:///workspace/TensorRT-8.5.1.7/python/tensorrt-8.5.1.7-cp38-none-linux_x86_64.whl
termcolor==2.4.0
terminado==0.17.1
thinc==8.1.5
threadpoolctl==3.1.0
tinycss2==1.2.1
tinydb==4.7.0
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
torch==1.14.0a0+410ce96
torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/py/dist/torch_tensorrt-1.3.0a0-cp38-cp38-linux_x86_64.whl
torchserve==0.10.0
torchtext @ git+https://github.com/pytorch/text@fae8e8cabf7adcbbc2f09c0520216288fd53f33b
torchvision @ file:///opt/pytorch/vision
tornado==6.1
tqdm==4.64.1
traitlets==5.7.1
transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@73166c4e3f6cf0e754045ba22ff461ef96453aeb
treelite @ file:///rapids/treelite-2.4.0-py3-none-manylinux2014_x86_64.whl
treelite-runtime @ file:///rapids/treelite_runtime-2.4.0-py3-none-manylinux2014_x86_64.whl
typer==0.7.0
types-python-dateutil==2.9.0.20240316
typing_extensions==4.11.0
ucx-py @ file:///rapids/ucx_py-0.27.0a0%2B29.ge9e81f8-cp38-cp38-linux_x86_64.whl
uff @ file:///workspace/TensorRT-8.5.1.7/uff/uff-0.6.9-py2.py3-none-any.whl
urllib3==1.26.13
uvicorn==0.20.0
uvloop==0.19.0
wasabi==0.10.1
watchfiles==0.21.0
wcwidth==0.2.5
webencodings==0.5.1
websockets==12.0
Werkzeug==3.0.2
xdoctest==1.0.2
xgboost @ file:///rapids/xgboost-1.6.2-cp38-cp38-linux_x86_64.whl
yarl==1.9.4
zict==2.2.0
zipp==3.11.0

Repro instructions

Unfortunately, I can't find a way to reproduce the error, it randomly appears every 1-5 days.

Possible Solution

There are a few things that are a bit odd about this issue:

The server has been running fine for over a year, I only made a few updates a few months back, and all of a sudden it started crashing frequently
I thought it was some edge case that crashed the server, but it only crashes some of the running instances
It happens randomly every 1-5 days, that's why I assumed it was some memory leak, but I can't find any evidence of it
I get a device-side assert triggered, and CUDA out of memory, however the available memory seems to be plenty, and I check for any NaN value or wrong shape before placing it on the GPU.

I've run out of ideas, any thought or feedback would be much appreciated.

The text was updated successfully, but these errors were encountered:

mreso · 2024-04-25T18:27:13Z

Hi @emilwallner,
thanks for the extensive issue report.

My thought on this are:

You're looking at the server after the crash, right? Meaning that the worker process has died, gets restarted and and thus memory is back to normal.
I can't find the line from your stack trace in your code but I assume that its basically the next line from your code. Detach does not create a copy of the data so you should still be having a single batch on device.
You're resizing the images with a resolution coming from the requests and then re-resizing the tensor in preprocess_and_stack_images to (3,768,768). Then you're stacking them along the channel dimension creating e.g. (6,768,768) before you add a batch dimension with unsqueeze. Not sure about your model by maybe it does something funky when it gets (1,6,768,768) instead of(2,3,768,768).
What is your batch size? Did you try using batch_size=1 for some time?
In the video there are multiple processes on the GPU, do you use multiple worker for the same model?

Thats all I have for now but happy to continue spitballing and iterating over this until you find s solution!

Best
Matthias

emilwallner · 2024-04-26T10:54:40Z

Really, really appreciate your input, @mreso!

The worker crashes and returns 507 and doesn't recover.
Yeah, I added detach to make sure requires_grad is set to False
Yeah, that could be it
I switched the batch size to 1 following your suggestion. Also, I check that it has the correct type, and final batch size.
Yes, multiple workers per model.

I also realized CUDA_LAUNCH_BLOCKING 1 reduces performance by about 70%, so I'll turn it off for now.

Here's my updated check:

  def preprocess_and_stack_images(self, images):
        preprocessed_images = []

        for i, img in enumerate(images):
            try:
                preprocessed_img = self.resize_tensor(img)
              
                if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1 or preprocessed_img.dtype != torch.float32:
                    # Log information about the image that doesn't meet the requirements
                    logger.info(f"Image {i} does not meet the requirements. Replacing with a blank image.")
                    preprocessed_img = torch.zeros((3, 768, 768))
            except Exception as e:
                # Log the error message and load a blank image
                logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
                preprocessed_img = torch.zeros((3, 768, 768))

            preprocessed_images.append(preprocessed_img)

        images_batch = torch.stack(preprocessed_images, dim=0)

        if len(images_batch.shape) == 3:
            images_batch = images_batch.unsqueeze(0)

        # Second test: Check if the size is (1, 3, 768, 768)
        if images_batch.shape != (1, 3, 768, 768):
            # Log information about the batch that doesn't meet the requirements
            logger.info(f"Batch size {images_batch.shape} does not match the required shape (1, 3, 768, 768). Replacing with a blank batch.")
            images_batch = torch.zeros((1, 3, 768, 768))
        

        return images_batch

Again, really appreciate the brainstorming — let’s keep at it until we crack this!

mreso · 2024-04-29T19:58:01Z

Yeah, performance will suffer significant from CUDA_LAUNCH_BLOCKING as kernels will not run asynchronously. So only activate if really necessary for debugging.

You could try to run the model in a notebook with a (1,6,768,768) input and observe the memory usage compared to (2,3,768,768). Wondering why this actually seem to to work in the first place.

emilwallner · 2024-05-01T13:52:29Z

I haven’t tried the (1,6,768,768) input yet, but since our model is based on three channels, it should throw an error during execution.

Now, I double-check the size (1,3,768,768), dtype, and ensured the values are in the correct range. Despite that, I’m still hitting a CUDA error: device-side assert triggered when moving the batch with images_batch = images_batch.to(self.device).detach()

Got any more suggestions on what might be causing this?

ptrblck · 2024-05-16T13:57:12Z

Cross-post from here with a stacktrace pointing to a real indexing error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) #3114

CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) #3114

emilwallner commented Apr 25, 2024

mreso commented Apr 25, 2024

emilwallner commented Apr 26, 2024

mreso commented Apr 29, 2024

emilwallner commented May 1, 2024

ptrblck commented May 16, 2024

CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) #3114

CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) #3114

Comments

emilwallner commented Apr 25, 2024

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

mreso commented Apr 25, 2024

emilwallner commented Apr 26, 2024

mreso commented Apr 29, 2024

emilwallner commented May 1, 2024

ptrblck commented May 16, 2024