-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Production issues with Pytorch 2.0.1-1.5.9 #1409
Comments
Could you test with the latest snapshot? And if the behavior is the same, share the code of your stress test? |
It seems like there's an off-java-heap memory leak. When I run the stress test with The physicalBytes jumped from ~28 gigs to ~45 gigs suddenly at the end, unclear why. When I run the stress test with All my ephemeral try (PointerScope scope = new PointerScope(null)) {
...
IValue ivalue = module.forward(inputs)
...
} which I thought guaranteed collection at the end of the block - did something change with the |
Ran locally with The stress test is somewhat project specific - I'm right now testing against three different DL network chains, where each chain can do several operations (scaling image, process, take the result & do something with it, then send it to another network...). Conceptually it's:
Then just let those threads run. |
I ran just a single input with ??? |
Is My calling code is actually more like this: try (PointerScope scope = new PointerScope(null)) {
... // Some JavaCPP allocations happen here
device.forwardSemaphore.acquireUninterruptibly();
try {
IValue ivalue = module.forward(inputs)
...
} finally {
device.forwardSemaphore.release();
}
} I transfer the inputs to the GPU before acquiring the semaphore on the (somewhat tested) assumption that it can be done in parallel with whatever computations are happening on the GPU. Then I acquire the semaphore just before calling This means that all three threads will likely have some active JavaCPP allocations while two of them wait for the semaphore to be released by the third. |
That's not it - tried a single-threaded stress test and process memory still grows unbounded with |
Nothing that I'm aware of. Is your stress test structured this way: for (;;) { // Big loop
try (PointerScope scope = new PointerScope(null)) {
....
}
} with nothing outside the pointer scope and inside the loop ? Could you trace at each iteration Is you code using callbacks (Java code called from libtorch) ? |
Ok, I've narrowed it down, and it appears to be a pytorch (not JavaCPP) memory leak when using variable size input. Steps to reproduce: First run this python code to export the network (or just download it here: test_2_0_1_cu117.zip) import torch
class TestModel(torch.nn.Sequential):
def __init__(self, in_chans=4, ch64: int = 64):
super().__init__(
torch.nn.Conv2d(in_chans, ch64, 3, padding=1),
torch.nn.ReLU(inplace=True),
torch.nn.Conv2d(ch64, ch64, 3, padding=1),
torch.nn.ReLU(inplace=True),
torch.nn.Conv2d(ch64, in_chans, 3, padding=1),
)
if __name__ == '__main__':
torch.jit.script(TestModel()).save(f'test_{torch.__version__.replace("+", "_").replace(".", "_")}.pt') Then create a java project and run: package misc;
import org.bytedeco.javacpp.Pointer;
import org.bytedeco.javacpp.PointerScope;
import org.bytedeco.pytorch.*;
import org.bytedeco.pytorch.global.torch;
import java.util.Random;
public class TestPytorch {
public static void main(String[] args) {
System.setProperty("org.bytedeco.javacpp.pathsFirst", "true");
// System.setProperty("org.bytedeco.javacpp.logger.debug", "true");
String fn = "store/models_torch/test_2_0_1_cu117.pt";
JitModule module = torch.load(fn);
Random rnd = new Random(42);
long[] shape = {1, 4, 512, 512};
Device device = new Device(torch.DeviceType.CUDA, (byte) 0);
ScalarTypeOptional dtypeNetworkOpt = new ScalarTypeOptional(torch.ScalarType.Float);
LayoutOptional layoutOpt = new LayoutOptional();
DeviceOptional deviceOpt = new DeviceOptional(device);
BoolOptional boolOpt = new BoolOptional();
MemoryFormatOptional mfContiguous = new MemoryFormatOptional(torch.MemoryFormat.Contiguous);
module.eval();
module.to(device);
for (; ; ) {
try (PointerScope scope = new PointerScope(null);
NoGradGuard no_grad = new NoGradGuard()) {
IValueVector inputs = new IValueVector();
shape[2] = rnd.nextInt(512 - 128) + 128; // <-- Comment out to avoid the memory leak.
shape[3] = rnd.nextInt(512 - 128) + 128; // <-- Comment out to avoid the memory leak.
Tensor tensor = torch.empty(shape, dtypeNetworkOpt, layoutOpt, deviceOpt, boolOpt, mfContiguous);
inputs.push_back(new IValue(tensor));
module.forward(inputs);
}
System.out.printf("%,4d pointers, %,4d total bytes, %,16d physical bytes, %,16d java bytes%n",
Pointer.totalCount(), Pointer.totalBytes(), Pointer.physicalBytes(), java.lang.Runtime.getRuntime().totalMemory());
}
}
} The physical bytes grow unbounded, but the java bytes stay flat. If you comment out the shape[2] & shape[3] reassignments, then they both stay basically flat - there's some wobble in the physical bytes, but nothing like when the shape is randomized. |
And indeed it's already been reported in pytorch:
Partial (?) fix, 2 weeks ago as of this writing (so will presumably be in the pending 2.1.0): pytorch/pytorch#104369 Interim workaround: set |
I've been working on upgrading from
"pytorch-platform" % "1.10.2-1.5.7"
to"pytorch-platform" % "2.0.1-1.5.9"
and ran into some snags. Everything worked great when just testing one or two images, but:2.0.1
, the process would abruptly exit. Unfortunately I don't have core dumps, and there were no fatal error logs generated.1.10.2
has been running reasonably smoothly (occasional crashes) over the last year.2.0.1
the instance started swapping to disk & ground to a barely responsive crawl after just a few minutes. On1.10.2
it ran smoothly for 3+ hours (until I stopped the test manually).My calling code and overall environment is the same for both versions:
I'm using libtorch, so https://download.pytorch.org/libtorch/cu117/libtorch-shared-with-deps-2.0.1%2Bcu117.zip for
2.0.1
and https://download.pytorch.org/libtorch/cu113/libtorch-shared-with-deps-1.10.2%2Bcu113.zip for1.10.2
, with thelibtorch/lib
directly unzipped tousr/lib/jni
andLD_LIBRARY_PATH=/usr/lib/jni:...
&System.setProperty("org.bytedeco.javacpp.pathsFirst", "true");
I'm happy to produce a minimal testcase if that would help, but not sure if the changes in #1360 make this issue obsolete?
The text was updated successfully, but these errors were encountered: