Option for closing files with scipy backend #524

rabernat · 2015-08-10T12:49:23Z

This is the same as #468, which was accidentally closed. I just copied and pasted my comment below

This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed".

I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active.

This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks.

This option can be accessed with the close_files key word, which I added to api.

Timing for loading and doing a calculation with close_files=True:

count_open_files()
%time mfds =  xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=True)
count_open_files()
%time print float(mfds.variables['u'].mean())
count_open_files()

output:

3 open files
CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s
Wall time: 27.7 s
2 open files
0.0055650632367
CPU times: user 649 ms, sys: 974 ms, total: 1.62 s
Wall time: 633 ms
2 open files

Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior):

count_open_files()
%time mfds =  xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=False)
count_open_files()
%time print float(mfds.variables['u'].mean())
count_open_files()

3 open files
CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms
Wall time: 291 ms
22 open files
0.0055650632367
CPU times: user 174 ms, sys: 141 ms, total: 315 ms
Wall time: 56 ms
22 open files

This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it...

jhamman · 2015-10-13T18:49:25Z

@shoyer and @rabernat - what do we want to do with this?

rabernat · 2015-10-13T21:19:14Z

I considered this more of a proof of concept than a serious PR. It was incredibly slow, but it circumvented the ulimit restriction. Probably not the best solution.

pwolfram · 2016-04-04T01:15:09Z

@rabernat, thanks for sharing. I think this will come in handy for #798.

rabernat · 2016-04-04T15:37:16Z

Really? I figured this was a doomed concept.

@pwolfram it looks like you are doing the hard work that I did not have the time or expertise for. This will pay off big time once the distributed backend is working.

pwolfram · 2016-04-04T15:43:31Z

@rabernat, feel free to checkout my comment at the end of #798. Please let me know if you see any problems with the outline. It is quite possible I am missing something and I'd rather know sooner than later. The basic idea is that we can use a cached dict (LRU) to limit the number of opened files. This should limit the number of opens/closes that are required. If no one can poke a conceptual hole in the outline I'll implement it.

shoyer · 2016-06-24T17:45:07Z

Something like this is still worth exploring (the LRU cache), but this isn't quite the right solution. So I'm closing this PR, just to clean up the list of outstanding PRs :).

rabernat added 2 commits July 11, 2015 16:26

added close_files option to scipy backend

ce24aae

added additional file closing calls

0145d62

rabernat mentioned this pull request Aug 10, 2015

Option for closing files with scipy backend #468

Closed

shoyer mentioned this pull request Jan 19, 2016

from_hdf5 function that uses filenames rather than files dask/dask#922

Open

pwolfram mentioned this pull request Mar 24, 2016

Integration with dask/distributed (xarray backend design) #798

Closed

2 tasks

shoyer closed this Jun 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option for closing files with scipy backend #524

Option for closing files with scipy backend #524

rabernat commented Aug 10, 2015

jhamman commented Oct 13, 2015

rabernat commented Oct 13, 2015

pwolfram commented Apr 4, 2016

rabernat commented Apr 4, 2016

pwolfram commented Apr 4, 2016

shoyer commented Jun 24, 2016

Option for closing files with scipy backend #524

Option for closing files with scipy backend #524

Conversation

rabernat commented Aug 10, 2015

jhamman commented Oct 13, 2015

rabernat commented Oct 13, 2015

pwolfram commented Apr 4, 2016

rabernat commented Apr 4, 2016

pwolfram commented Apr 4, 2016

shoyer commented Jun 24, 2016