Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option for closing files with scipy backend #524

Closed
wants to merge 2 commits into from

Conversation

rabernat
Copy link
Contributor

This is the same as #468, which was accidentally closed. I just copied and pasted my comment below

This addresses issue #463, in which open_mfdataset failed when trying to open a list of files longer than my system's ulimit. I tried to find a solution in which the underlying netcdf file objects are kept closed by default and only reopened "when needed".

I ended up subclassing scipy.io.netcdf_file and overwriting the variable attribute with a property which first checks whether the file is open or closed and opens it if needed. That was the easy part. The hard part was figuring out when to close them. The problem is that a couple of different parts of the code (e.g. each individual variable and also the datastore object itself) keep references to the netcdf_file object. In the end I used the debugger to find out when during initialization the variables were actually being read and added some calls to close() in various different places. It is relatively easy to close the files up at the end of the initialization, but it was much harder to make sure that the whole array of files is never open at the same time. I also had to disable mmap when this option is active.

This solution is messy and, moreover, extremely slow. There is a factor of ~100 performance penalty during initialization for reopening and closing the files all the time (but only a factor of 10 for the actual calculation). I am sure this could be reduced if someone who understands the code better found some judicious points at which to call close() on the netcdf_file. The loss of mmap also sucks.

This option can be accessed with the close_files key word, which I added to api.

Timing for loading and doing a calculation with close_files=True:

count_open_files()
%time mfds =  xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=True)
count_open_files()
%time print float(mfds.variables['u'].mean())
count_open_files()

output:

3 open files
CPU times: user 11.1 s, sys: 17.5 s, total: 28.5 s
Wall time: 27.7 s
2 open files
0.0055650632367
CPU times: user 649 ms, sys: 974 ms, total: 1.62 s
Wall time: 633 ms
2 open files

Timing for loading and doing a calculation with close_files=False (default, should revert to old behavior):

count_open_files()
%time mfds =  xray.open_mfdataset(ddir + '/dt_global_allsat_msla_uv_2014101*.nc', engine='scipy', close_files=False)
count_open_files()
%time print float(mfds.variables['u'].mean())
count_open_files()
3 open files
CPU times: user 264 ms, sys: 85.3 ms, total: 349 ms
Wall time: 291 ms
22 open files
0.0055650632367
CPU times: user 174 ms, sys: 141 ms, total: 315 ms
Wall time: 56 ms
22 open files

This is not a very serious pull request, but I spent all day on it, so I thought I would share. Maybe you can see some obvious way to improve it...

@jhamman
Copy link
Member

jhamman commented Oct 13, 2015

@shoyer and @rabernat - what do we want to do with this?

@rabernat
Copy link
Contributor Author

I considered this more of a proof of concept than a serious PR. It was incredibly slow, but it circumvented the ulimit restriction. Probably not the best solution.

@pwolfram
Copy link
Contributor

pwolfram commented Apr 4, 2016

@rabernat, thanks for sharing. I think this will come in handy for #798.

@rabernat
Copy link
Contributor Author

rabernat commented Apr 4, 2016

Really? I figured this was a doomed concept.

@pwolfram it looks like you are doing the hard work that I did not have the time or expertise for. This will pay off big time once the distributed backend is working.

@pwolfram
Copy link
Contributor

pwolfram commented Apr 4, 2016

@rabernat, feel free to checkout my comment at the end of #798. Please let me know if you see any problems with the outline. It is quite possible I am missing something and I'd rather know sooner than later. The basic idea is that we can use a cached dict (LRU) to limit the number of opened files. This should limit the number of opens/closes that are required. If no one can poke a conceptual hole in the outline I'll implement it.

@shoyer
Copy link
Member

shoyer commented Jun 24, 2016

Something like this is still worth exploring (the LRU cache), but this isn't quite the right solution. So I'm closing this PR, just to clean up the list of outstanding PRs :).

@shoyer shoyer closed this Jun 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants