Production issues with Pytorch 2.0.1-1.5.9 #1409

jxtps · 2023-08-31T03:39:53Z

I've been working on upgrading from "pytorch-platform" % "1.10.2-1.5.7" to "pytorch-platform" % "2.0.1-1.5.9" and ran into some snags. Everything worked great when just testing one or two images, but:

After <10 minutes of production load with 2.0.1, the process would abruptly exit. Unfortunately I don't have core dumps, and there were no fatal error logs generated. 1.10.2 has been running reasonably smoothly (occasional crashes) over the last year.
I then created a stress test where 3 threads continuously make calls to pytorch (no production load, fully artificial). On 2.0.1 the instance started swapping to disk & ground to a barely responsive crawl after just a few minutes. On 1.10.2 it ran smoothly for 3+ hours (until I stopped the test manually).

My calling code and overall environment is the same for both versions:

Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1040-aws x86_64)
OpenJDK 64-Bit Server VM (build 17.0.8+7-Ubuntu-120.04.2, mixed mode, sharing)
AWS g5.2xlarge instance
I'm doing image processing, so it's megapixel image in, megapixel image out type pytorch payloads.

I'm using libtorch, so https://download.pytorch.org/libtorch/cu117/libtorch-shared-with-deps-2.0.1%2Bcu117.zip for 2.0.1 and https://download.pytorch.org/libtorch/cu113/libtorch-shared-with-deps-1.10.2%2Bcu113.zip for 1.10.2, with the libtorch/lib directly unzipped to usr/lib/jni and LD_LIBRARY_PATH=/usr/lib/jni:... & System.setProperty("org.bytedeco.javacpp.pathsFirst", "true");

I'm happy to produce a minimal testcase if that would help, but not sure if the changes in #1360 make this issue obsolete?

The text was updated successfully, but these errors were encountered:

HGuillemet · 2023-08-31T04:22:26Z

Could you test with the latest snapshot? And if the behavior is the same, share the code of your stress test?

jxtps · 2023-08-31T04:23:23Z

It seems like there's an off-java-heap memory leak.

When I run the stress test with 2.0.1 locally (Windows 10, Java 8) and monitor it with JVisualVM, the java heap usage looks great, but when I look in the Task Manager, the process memory just keeps climbing, and I eventually get an OutOfMemoryError: Physical memory usage is too high: physicalBytes (45055M) > maxPhysicalBytes (29128M)

The physicalBytes jumped from ~28 gigs to ~45 gigs suddenly at the end, unclear why.

When I run the stress test with 1.10.2 locally, the Task Manager memory doesn't go beyond 15 gigs.

All my ephemeral torch JavaCPP objects (= org.bytedeco.pytorch.Tensors in & out of torch) are created inside

try (PointerScope scope = new PointerScope(null)) {
    ...
    IValue ivalue = module.forward(inputs)
    ...
}

which I thought guaranteed collection at the end of the block - did something change with the PointerScope from 1.5.7 to 1.5.9? Is e.g. the return value from module.forward() no longer tracked by PointerScope?

jxtps · 2023-08-31T04:38:14Z

Could you test with the latest snapshot? And if the behavior is the same, share the code of your stress test?

Ran locally with "pytorch-platform" % "2.0.1-1.5.10-SNAPSHOT" and the process memory still grows unbounded.

The stress test is somewhat project specific - I'm right now testing against three different DL network chains, where each chain can do several operations (scaling image, process, take the result & do something with it, then send it to another network...). Conceptually it's:

Create three threads
Each thread does an infinite loop of: create random input & call one of three network chains

Then just let those threads run.

jxtps · 2023-08-31T05:03:43Z

I ran just a single input with System.setProperty("org.bytedeco.javacpp.logger.debug", "true"); and 1.10.2 allocates & releases one extra org.bytedeco.pytorch.TypeMeta compared to the snapshot, but otherwise their allocations & releases appear the same.

???

jxtps · 2023-08-31T05:20:18Z

Is PointerScope thread-safe in 1.5.7 but not in 1.5.9 / 1.5.10-SNAPSHOT?

My calling code is actually more like this:

try (PointerScope scope = new PointerScope(null)) {
    ... // Some JavaCPP allocations happen here
    device.forwardSemaphore.acquireUninterruptibly();
    try {
        IValue ivalue = module.forward(inputs)
        ...
    } finally {
        device.forwardSemaphore.release();
    }
}

I transfer the inputs to the GPU before acquiring the semaphore on the (somewhat tested) assumption that it can be done in parallel with whatever computations are happening on the GPU.

Then I acquire the semaphore just before calling module.forward() and fetching the results (since the computation is async and module.forward() returns quickly, but then the transfer of the returns appears to effectively include both the computation & the transfer).

This means that all three threads will likely have some active JavaCPP allocations while two of them wait for the semaphore to be released by the third.

jxtps · 2023-08-31T05:27:19Z

That's not it - tried a single-threaded stress test and process memory still grows unbounded with "pytorch-platform" % "2.0.1-1.5.10-SNAPSHOT"

HGuillemet · 2023-08-31T07:43:01Z

did something change with the PointerScope from 1.5.7 to 1.5.9? Is e.g. the return value from module.forward() no longer tracked by PointerScope?

Nothing that I'm aware of.

Is your stress test structured this way:

for (;;) { // Big loop
  try (PointerScope scope = new PointerScope(null)) {
    ....
  }
}

with nothing outside the pointer scope and inside the loop ?

Could you trace at each iteration Pointer.totalCount(), Pointer.totalBytes() and Pointer.physicalBytes() ?

Is you code using callbacks (Java code called from libtorch) ?

jxtps · 2023-08-31T23:18:06Z

Ok, I've narrowed it down, and it appears to be a pytorch (not JavaCPP) memory leak when using variable size input. Steps to reproduce:

First run this python code to export the network (or just download it here: test_2_0_1_cu117.zip)

import torch

class TestModel(torch.nn.Sequential):
    def __init__(self, in_chans=4, ch64: int = 64):
        super().__init__(
            torch.nn.Conv2d(in_chans, ch64, 3, padding=1),
            torch.nn.ReLU(inplace=True),
            torch.nn.Conv2d(ch64, ch64, 3, padding=1),
            torch.nn.ReLU(inplace=True),
            torch.nn.Conv2d(ch64, in_chans, 3, padding=1),
        )

if __name__ == '__main__':
    torch.jit.script(TestModel()).save(f'test_{torch.__version__.replace("+", "_").replace(".", "_")}.pt')

Then create a java project and run:

package misc;

import org.bytedeco.javacpp.Pointer;
import org.bytedeco.javacpp.PointerScope;
import org.bytedeco.pytorch.*;
import org.bytedeco.pytorch.global.torch;

import java.util.Random;

public class TestPytorch {

    public static void main(String[] args) {
        System.setProperty("org.bytedeco.javacpp.pathsFirst", "true");
        // System.setProperty("org.bytedeco.javacpp.logger.debug", "true");

        String fn = "store/models_torch/test_2_0_1_cu117.pt";
        JitModule module = torch.load(fn);

        Random rnd = new Random(42);
        long[] shape = {1, 4, 512, 512};

        Device device = new Device(torch.DeviceType.CUDA, (byte) 0);
        ScalarTypeOptional dtypeNetworkOpt = new ScalarTypeOptional(torch.ScalarType.Float);
        LayoutOptional layoutOpt = new LayoutOptional();
        DeviceOptional deviceOpt = new DeviceOptional(device);
        BoolOptional boolOpt = new BoolOptional();
        MemoryFormatOptional mfContiguous = new MemoryFormatOptional(torch.MemoryFormat.Contiguous);
        module.eval();
        module.to(device);

        for (; ; ) {
            try (PointerScope scope = new PointerScope(null);
                 NoGradGuard no_grad = new NoGradGuard()) {
                IValueVector inputs = new IValueVector();
                shape[2] = rnd.nextInt(512 - 128) + 128; // <-- Comment out to avoid the memory leak.
                shape[3] = rnd.nextInt(512 - 128) + 128; // <-- Comment out to avoid the memory leak.
                Tensor tensor = torch.empty(shape, dtypeNetworkOpt, layoutOpt, deviceOpt, boolOpt, mfContiguous);
                inputs.push_back(new IValue(tensor));
                module.forward(inputs);
            }
            System.out.printf("%,4d pointers, %,4d total bytes, %,16d physical bytes, %,16d java bytes%n",
                    Pointer.totalCount(), Pointer.totalBytes(), Pointer.physicalBytes(), java.lang.Runtime.getRuntime().totalMemory());
        }
    }
}

The physical bytes grow unbounded, but the java bytes stay flat.

If you comment out the shape[2] & shape[3] reassignments, then they both stay basically flat - there's some wobble in the physical bytes, but nothing like when the shape is randomized.

jxtps · 2023-08-31T23:37:52Z

And indeed it's already been reported in pytorch:

Memory leak in Conv1d pytorch/pytorch#98688
System memory leak when using different input size of torch.nn.Conv3d pytorch/pytorch#104701
This is a concise reproduction of my code, when the input batch is not the same size as before the batch, it will lead to memory leaks, the actual scenario is ocr accepting images of different sizes as input pytorch/pytorch#101921

Partial (?) fix, 2 weeks ago as of this writing (so will presumably be in the pending 2.1.0): pytorch/pytorch#104369

Interim workaround: set TORCH_CUDNN_V8_API_DISABLED=1 as the memory leak is caused by the v8 execution plans.

saudet added help wanted question labels Aug 31, 2023

saudet assigned HGuillemet Aug 31, 2023

jxtps closed this as completed Aug 31, 2023

saudet added bug and removed help wanted labels Aug 31, 2023

jxtps mentioned this issue Sep 2, 2023

CNN w variable sized input performance regression 1.10.2 cu113 -> 2.0.1 cu117 pytorch/pytorch#108474

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production issues with Pytorch 2.0.1-1.5.9 #1409

Production issues with Pytorch 2.0.1-1.5.9 #1409

jxtps commented Aug 31, 2023

HGuillemet commented Aug 31, 2023

jxtps commented Aug 31, 2023

jxtps commented Aug 31, 2023

jxtps commented Aug 31, 2023 •

edited

Loading

jxtps commented Aug 31, 2023

jxtps commented Aug 31, 2023

HGuillemet commented Aug 31, 2023

jxtps commented Aug 31, 2023 •

edited

Loading

jxtps commented Aug 31, 2023 •

edited

Loading

Production issues with Pytorch 2.0.1-1.5.9 #1409

Production issues with Pytorch 2.0.1-1.5.9 #1409

Comments

jxtps commented Aug 31, 2023

HGuillemet commented Aug 31, 2023

jxtps commented Aug 31, 2023

jxtps commented Aug 31, 2023

jxtps commented Aug 31, 2023 • edited Loading

jxtps commented Aug 31, 2023

jxtps commented Aug 31, 2023

HGuillemet commented Aug 31, 2023

jxtps commented Aug 31, 2023 • edited Loading

jxtps commented Aug 31, 2023 • edited Loading

jxtps commented Aug 31, 2023 •

edited

Loading

jxtps commented Aug 31, 2023 •

edited

Loading

jxtps commented Aug 31, 2023 •

edited

Loading