Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem of RAM memory exhaustion for datasets with unlimited axis? #1168

Open
jerabaul29 opened this issue May 31, 2022 · 9 comments
Open

Problem of RAM memory exhaustion for datasets with unlimited axis? #1168

jerabaul29 opened this issue May 31, 2022 · 9 comments

Comments

@jerabaul29
Copy link

First, credits to @BrazhnikovDmitry for finding this, I am only writing the issue but he should get credit for pointing this out :) .

It seems like opening a dataset with an unlimited dimension can cause RAM memory exhaustion and crash. For example, I have a file with an unlimited dimension of size:

time = UNLIMITED ; // (3235893 currently)

The data file is relatively big for using on my local machine (a laptop), but not huge: 1.6GB in total. My local machine has 16GB or RAM, out of which 8GB + are completely free.

When trying to open a small slice of a field of the dataset (the first index is an "instrument ID", the second index is the unlimited time dimension):

[ins] In [1]: import netCDF4 as nc4

[ins] In [2]: file_path = "wave_data_ICEX2018.nc"

[ins] In [3]: with nc4.Dataset(file_path, "r", format="NETCCDF4") as nc4_fh:
         ...:     time_gps = nc4_fh["timeIMU"][0, 0:1000]
         ...: 

all goes well.

But when trying to open the whole field:

[ins] In [5]: with nc4.Dataset(file_path, "r", format="NETCCDF4") as nc4_fh:
         ...:     data_lat = nc4_fh["timeIMU"][0, :]
Killed

all the RAM gets exhausted (while I had over 8GB of RAM available when starting the command; seems like RAM use increases almost linearly over the course of a few seconds, until it is exhausted), and the process gets killed automatically (which is great actually, because, as you can imagine, my whole system freezes when all RAM gets used, so nice that somehow the process gets killed and my system responsiveness is restored :) ).

The interesting thing is, packaging the exact same data, but with a fixed dimension size, the whole field can be open with the same [0, :] without encountering any issue, and using just a few 100s MBs of RAM.

  • any idea why this happens
  • is this well a bug (I can understand that an unlimited dimension may be less efficient than a statically sized one, but not that it is so inefficient that such a crash happens)

version and system information

  • OS: Ubuntu 20.04, fully updated

  • ipython:

Python 3.8.10 (default, Mar 15 2022, 12:22:08) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

[ins] In [1]: import netCDF4

[ins] In [2]: netCDF4.__version__
Out[2]: '1.5.3'
@jswhit2
Copy link
Contributor

jswhit2 commented May 31, 2022

Please post the data file somewhere and put the link in this issue.

@jswhit2
Copy link
Contributor

jswhit2 commented Jun 3, 2022

@jerabaul29 we really can't make any progress on diagnosing the problem without having access to the data file. Is there any problem with providing access to the dataset?

@BrazhnikovDmitry
Copy link

Hi @jswhit2! I initially encountered the problem with the memory over usage. The example data set can be found here https://www.dropbox.com/s/zk6js1cmt6p2tj9/wave_data_bad.nc?dl=0
If it is necessary I can provide the code used to generate the nc-file.

@jerabaul29
Copy link
Author

Many thanks for uploading your example file @BrazhnikovDmitry :) . @jswhit2 sorry for the absence of response on my side, I was traveling, some backlogs. I can confirm that I get the error on the exact file @BrazhnikovDmitry uploaded now :) .

@jswhit2
Copy link
Contributor

jswhit2 commented Jun 7, 2022

OK, I've got the file now, thanks. Just curious why you decided to make the 'time' unlimited dimension the rightmost dimension (last in the list of dimensions for that variable). Typically the unlimited dimension is defined as the leftmost (slowest varying) dimension. I bet that if you had done it that way accessing the data along the unlimited dimension would be much faster.

@jswhit2
Copy link
Contributor

jswhit2 commented Jun 7, 2022

On MacOS with the latest github master for both netcdf4-python and netcdf-c I don't see this problem. Here's my simple test script:

from netCDF4 import Dataset
import tracemalloc, time
def read_data():
    nc = Dataset('wave_data_bad.nc')
    data = nc["timeIMU"][0, :]
    nc.close()
tracemalloc.start()
# function call
t1 = time.perf_counter()
read_data()
t2 = time.perf_counter()
print('time = %s secs' % str(t2-t1))
# displaying the memory
print('peak memory = %s bytes' % tracemalloc.get_traced_memory()[1])
# stopping the library
tracemalloc.stop()            

>> time = 110.724687782 secs
>> peak memory = 51784442 bytes    

I'm pretty sure nothing has changed in the python module that would impact this, so perhaps it's something that could be remedied by updating the netcdf and hdf5 C libs?

@jerabaul29
Copy link
Author

Ok, interesting. I saw it on Linux Ub 20.04 fully up to date as previously mentioned, just curious, @BrazhnikovDmitry which OS and version are you using? :)

@BrazhnikovDmitry
Copy link

BrazhnikovDmitry commented Jun 7, 2022

It was my thought as well. I did not have time to update to the latest netcdf llibrary and have a check. The file was created with 4.7.4. According to Unidata/netcdf-c#1913 they fixed some memory leaks in 4.8.0.

@agkphysics
Copy link

I've also encountered this issue with unlimited dimensions but I solved it similar to #859, by increasing the chunksize of the unlimited dimension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants