Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Resolves memory leak caused by using CRAFT detector with detect() or readtext(). #1278

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

daniellovera
Copy link

This fix enables garbage collection to appropriately work when

def test_net(canvas_size, mag_ratio, net, image, text_threshold, link_threshold, low_text, poly, device, estimate_num_chars=False):
returns, by deleting the objects we moved to the GPU after we move the forward pass results back on the CPU.

See https://pytorch.org/blog/understanding-gpu-memory-2/#why-doesnt-automatic-garbage-collection-work for more detail.

Running torch.cuda.empty_cache() in test_net() before returning allows nvidia-smi to be accurate.

Interestingly, nvidia-smi showed that GPU memory usage per process was 204MiB upon reader initialization, and then would increase to 234MiB or 288MiB after running easyocr.reader.detect(), but then not increase beyond that point and in some cases reduce back down to 234MiB. I think this has something to do with

One note is that I tested this on a single GPU machine where I changed

net = torch.nn.DataParallel(net).to(device)
to be net = net.to(device), removing DataParallel. There's no reason this shouldn't work on multi-GPU machines, but noting it wasn't tested on one.

I also only tested this on the CRAFT detector, not DBNet.

Relevant package versions
easyocr version 1.7.1
torch version 2.2.1+cu121
torchvision 0.17.1+cu121

Hope this helps!

@daniellovera
Copy link
Author

I should clarify, this resolves GPU vRAM memory leaks. It's not resolving the CPU RAM memory leaks.

@daniellovera
Copy link
Author

Corrected to only call empty_cache() if the device in use is cuda.

@jonashaag
Copy link

The del stuff can't possibly work. It just removes the Python variable from the scope (the function) but doesn't actually remove anything from the GPU/CPU

@daniellovera
Copy link
Author

The del stuff can't possibly work. It just removes the Python variable from the scope (the function) but doesn't actually remove anything from the GPU/CPU

@jonashaag did you attempt to replicate my results? It'll take you less than 15 minutes to give it a whirl and prove if it's possible or not.

Because it did work for me, and the pytorch.org blog post I linked provides the reasoning for exactly why it does work. I'll quote here:

Why doesn’t automatic garbage collection work?
The automatic garbage collection works well when there is a lot of extra memory as is common on CPUs because it amortizes the expensive garbage collection by using Generational Garbage Collection. But to amortize the collection work, it defers some memory cleanup making the maximum memory usage higher, which is less suited to memory constrained environments. The Python runtime also has no insights into CUDA memory usage, so it cannot be triggered on high memory pressure either. It’s even more challenging as GPU training is almost always memory constrained because we will often raise the batch size to use any additional free memory.

The CPython’s garbage collection frees unreachable objects held in reference cycles via the mark-and-sweep. The garbage collection is automatically run when the number of objects exceeds certain thresholds. There are 3 generations of thresholds to help amortize the expensive costs of running garbage collection on every object. The later generations are less frequently run. This would explain why automatic collections will only clear several tensors on each peak, however there are still tensors that leak resulting in the CUDA OOM. Those tensors were held by reference cycles in later generations.

I'm not going to claim that I think it SHOULD work this way. But this isn't the first time some weird garbage collection and scoping issues across CPU/GPUs caused issues.

Again, try it and let us all know if it's actually working for you or not.

@jonashaag
Copy link

Sorry, maybe I misunderstood the reason why del is used here. Is it so that the call to empty_cache() can remove the tensors x, y, feature from GPU memory? That might work unless there are other references to the tensors that those variables reference.

@daniellovera
Copy link
Author

Sorry, maybe I misunderstood the reason why del is used here. Is it so that the call to empty_cache() can remove the tensors x, y, feature from GPU memory? That might work unless there are other references to the tensors that those variables reference.

I don't think I understand it well enough to explain it better. I also call torch.empty_cache() and torch.cuda.reset_peak_memory_stats() after the function returns. It's possible that the empty_cache() call inside the function isn't actually doing anything since the GC doesn't run until the function goes out of scope - I probably should have double checked that but I was less concerned with nvidia-smi being accurate as I was not getting CUDA OOM errors.

I'm far from an expert, but I do know that these changes resulted in halting the memory leaks I had, and I haven't had a CUDA OOM error since.

Best suggestion is that since action produces information, you give it a whirl and let us know if it works. If it doesn't work for you, then that's valuable for me to know how your machine is different than mine, so I can make further changes to avoid getting these errors again if I scale-up or swap machines.

@daniellovera
Copy link
Author

@jonashaag Hey, I'd love to know if del worked if you tried it.

@jonashaag
Copy link

Sorry, I've switched to another engine (macOS Live Text) because it's better and much faster.

I feel a bit bad to have left such a smart-ass comment initially and not contribute anything of substance here :-/

@daniellovera
Copy link
Author

It's all good. Are you using Live Text natively on the devices or can it be hosted in a way that allows it to replace EasyOCR for serving a website that's not on an Apple device?

@jonashaag
Copy link

Yes we run a Mac mini in production (via Scaleway)

If you are interested I can share some code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants