Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production issues with Pytorch 2.0.1-1.5.9 #1409

Closed
jxtps opened this issue Aug 31, 2023 · 9 comments
Closed

Production issues with Pytorch 2.0.1-1.5.9 #1409

jxtps opened this issue Aug 31, 2023 · 9 comments
Assignees

Comments

@jxtps
Copy link
Contributor

jxtps commented Aug 31, 2023

I've been working on upgrading from "pytorch-platform" % "1.10.2-1.5.7" to "pytorch-platform" % "2.0.1-1.5.9" and ran into some snags. Everything worked great when just testing one or two images, but:

  1. After <10 minutes of production load with 2.0.1, the process would abruptly exit. Unfortunately I don't have core dumps, and there were no fatal error logs generated. 1.10.2 has been running reasonably smoothly (occasional crashes) over the last year.
  2. I then created a stress test where 3 threads continuously make calls to pytorch (no production load, fully artificial). On 2.0.1 the instance started swapping to disk & ground to a barely responsive crawl after just a few minutes. On 1.10.2 it ran smoothly for 3+ hours (until I stopped the test manually).

My calling code and overall environment is the same for both versions:

  • Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1040-aws x86_64)
  • OpenJDK 64-Bit Server VM (build 17.0.8+7-Ubuntu-120.04.2, mixed mode, sharing)
  • AWS g5.2xlarge instance
  • I'm doing image processing, so it's megapixel image in, megapixel image out type pytorch payloads.

I'm using libtorch, so https://download.pytorch.org/libtorch/cu117/libtorch-shared-with-deps-2.0.1%2Bcu117.zip for 2.0.1 and https://download.pytorch.org/libtorch/cu113/libtorch-shared-with-deps-1.10.2%2Bcu113.zip for 1.10.2, with the libtorch/lib directly unzipped to usr/lib/jni and LD_LIBRARY_PATH=/usr/lib/jni:... & System.setProperty("org.bytedeco.javacpp.pathsFirst", "true");

I'm happy to produce a minimal testcase if that would help, but not sure if the changes in #1360 make this issue obsolete?

@HGuillemet
Copy link
Collaborator

Could you test with the latest snapshot? And if the behavior is the same, share the code of your stress test?

@jxtps
Copy link
Contributor Author

jxtps commented Aug 31, 2023

It seems like there's an off-java-heap memory leak.

When I run the stress test with 2.0.1 locally (Windows 10, Java 8) and monitor it with JVisualVM, the java heap usage looks great, but when I look in the Task Manager, the process memory just keeps climbing, and I eventually get an OutOfMemoryError: Physical memory usage is too high: physicalBytes (45055M) > maxPhysicalBytes (29128M)

The physicalBytes jumped from ~28 gigs to ~45 gigs suddenly at the end, unclear why.

When I run the stress test with 1.10.2 locally, the Task Manager memory doesn't go beyond 15 gigs.

All my ephemeral torch JavaCPP objects (= org.bytedeco.pytorch.Tensors in & out of torch) are created inside

try (PointerScope scope = new PointerScope(null)) {
    ...
    IValue ivalue = module.forward(inputs)
    ...
}

which I thought guaranteed collection at the end of the block - did something change with the PointerScope from 1.5.7 to 1.5.9? Is e.g. the return value from module.forward() no longer tracked by PointerScope?

@jxtps
Copy link
Contributor Author

jxtps commented Aug 31, 2023

Could you test with the latest snapshot? And if the behavior is the same, share the code of your stress test?

Ran locally with "pytorch-platform" % "2.0.1-1.5.10-SNAPSHOT" and the process memory still grows unbounded.

The stress test is somewhat project specific - I'm right now testing against three different DL network chains, where each chain can do several operations (scaling image, process, take the result & do something with it, then send it to another network...). Conceptually it's:

  1. Create three threads
  2. Each thread does an infinite loop of: create random input & call one of three network chains

Then just let those threads run.

@jxtps
Copy link
Contributor Author

jxtps commented Aug 31, 2023

I ran just a single input with System.setProperty("org.bytedeco.javacpp.logger.debug", "true"); and 1.10.2 allocates & releases one extra org.bytedeco.pytorch.TypeMeta compared to the snapshot, but otherwise their allocations & releases appear the same.

???

@jxtps
Copy link
Contributor Author

jxtps commented Aug 31, 2023

Is PointerScope thread-safe in 1.5.7 but not in 1.5.9 / 1.5.10-SNAPSHOT?

My calling code is actually more like this:

try (PointerScope scope = new PointerScope(null)) {
    ... // Some JavaCPP allocations happen here
    device.forwardSemaphore.acquireUninterruptibly();
    try {
        IValue ivalue = module.forward(inputs)
        ...
    } finally {
        device.forwardSemaphore.release();
    }
}

I transfer the inputs to the GPU before acquiring the semaphore on the (somewhat tested) assumption that it can be done in parallel with whatever computations are happening on the GPU.

Then I acquire the semaphore just before calling module.forward() and fetching the results (since the computation is async and module.forward() returns quickly, but then the transfer of the returns appears to effectively include both the computation & the transfer).

This means that all three threads will likely have some active JavaCPP allocations while two of them wait for the semaphore to be released by the third.

@jxtps
Copy link
Contributor Author

jxtps commented Aug 31, 2023

That's not it - tried a single-threaded stress test and process memory still grows unbounded with "pytorch-platform" % "2.0.1-1.5.10-SNAPSHOT"

@HGuillemet
Copy link
Collaborator

did something change with the PointerScope from 1.5.7 to 1.5.9? Is e.g. the return value from module.forward() no longer tracked by PointerScope?

Nothing that I'm aware of.

Is your stress test structured this way:

for (;;) { // Big loop
  try (PointerScope scope = new PointerScope(null)) {
    ....
  }
}

with nothing outside the pointer scope and inside the loop ?

Could you trace at each iteration Pointer.totalCount(), Pointer.totalBytes() and Pointer.physicalBytes() ?

Is you code using callbacks (Java code called from libtorch) ?

@jxtps
Copy link
Contributor Author

jxtps commented Aug 31, 2023

Ok, I've narrowed it down, and it appears to be a pytorch (not JavaCPP) memory leak when using variable size input. Steps to reproduce:

First run this python code to export the network (or just download it here: test_2_0_1_cu117.zip)

import torch

class TestModel(torch.nn.Sequential):
    def __init__(self, in_chans=4, ch64: int = 64):
        super().__init__(
            torch.nn.Conv2d(in_chans, ch64, 3, padding=1),
            torch.nn.ReLU(inplace=True),
            torch.nn.Conv2d(ch64, ch64, 3, padding=1),
            torch.nn.ReLU(inplace=True),
            torch.nn.Conv2d(ch64, in_chans, 3, padding=1),
        )

if __name__ == '__main__':
    torch.jit.script(TestModel()).save(f'test_{torch.__version__.replace("+", "_").replace(".", "_")}.pt')

Then create a java project and run:

package misc;

import org.bytedeco.javacpp.Pointer;
import org.bytedeco.javacpp.PointerScope;
import org.bytedeco.pytorch.*;
import org.bytedeco.pytorch.global.torch;

import java.util.Random;

public class TestPytorch {

    public static void main(String[] args) {
        System.setProperty("org.bytedeco.javacpp.pathsFirst", "true");
        // System.setProperty("org.bytedeco.javacpp.logger.debug", "true");

        String fn = "store/models_torch/test_2_0_1_cu117.pt";
        JitModule module = torch.load(fn);

        Random rnd = new Random(42);
        long[] shape = {1, 4, 512, 512};

        Device device = new Device(torch.DeviceType.CUDA, (byte) 0);
        ScalarTypeOptional dtypeNetworkOpt = new ScalarTypeOptional(torch.ScalarType.Float);
        LayoutOptional layoutOpt = new LayoutOptional();
        DeviceOptional deviceOpt = new DeviceOptional(device);
        BoolOptional boolOpt = new BoolOptional();
        MemoryFormatOptional mfContiguous = new MemoryFormatOptional(torch.MemoryFormat.Contiguous);
        module.eval();
        module.to(device);

        for (; ; ) {
            try (PointerScope scope = new PointerScope(null);
                 NoGradGuard no_grad = new NoGradGuard()) {
                IValueVector inputs = new IValueVector();
                shape[2] = rnd.nextInt(512 - 128) + 128; // <-- Comment out to avoid the memory leak.
                shape[3] = rnd.nextInt(512 - 128) + 128; // <-- Comment out to avoid the memory leak.
                Tensor tensor = torch.empty(shape, dtypeNetworkOpt, layoutOpt, deviceOpt, boolOpt, mfContiguous);
                inputs.push_back(new IValue(tensor));
                module.forward(inputs);
            }
            System.out.printf("%,4d pointers, %,4d total bytes, %,16d physical bytes, %,16d java bytes%n",
                    Pointer.totalCount(), Pointer.totalBytes(), Pointer.physicalBytes(), java.lang.Runtime.getRuntime().totalMemory());
        }
    }
}

The physical bytes grow unbounded, but the java bytes stay flat.

If you comment out the shape[2] & shape[3] reassignments, then they both stay basically flat - there's some wobble in the physical bytes, but nothing like when the shape is randomized.

@jxtps
Copy link
Contributor Author

jxtps commented Aug 31, 2023

And indeed it's already been reported in pytorch:

  1. Memory leak in Conv1d pytorch/pytorch#98688
  2. System memory leak when using different input size of torch.nn.Conv3d pytorch/pytorch#104701
  3. This is a concise reproduction of my code, when the input batch is not the same size as before the batch, it will lead to memory leaks, the actual scenario is ocr accepting images of different sizes as input pytorch/pytorch#101921

Partial (?) fix, 2 weeks ago as of this writing (so will presumably be in the pending 2.1.0): pytorch/pytorch#104369

Interim workaround: set TORCH_CUDNN_V8_API_DISABLED=1 as the memory leak is caused by the v8 execution plans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants