Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shallow copies become deep copies when pickling #1058

Closed
crusaderky opened this issue Oct 23, 2016 · 10 comments
Closed

shallow copies become deep copies when pickling #1058

crusaderky opened this issue Oct 23, 2016 · 10 comments

Comments

@crusaderky
Copy link
Contributor

crusaderky commented Oct 23, 2016

Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays.
This design fails when the object is pickled.

Whenever a numpy view is pickled, it becomes a regular array:

>> a = numpy.arange(2**26)
>> print(len(pickle.dumps(a)) / 2**20)
256.00015354156494
>> b = a.view()
>> print(len(pickle.dumps((a, b))) / 2**20)
512.0001964569092
>> b.base is a
True
>> a2, b2 = pickle.loads(pickle.dumps((a, b)))
>> b2.base is a2
False

This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer.
Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies.

I see a few possible solutions to this:

  1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it.
  2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's.
  3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem.
  4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that...
  5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view.

I implemented (5) as a workaround in my getstate method.
Before:

%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
2.535497265867889
Wall time: 33.3 s

Workaround:

def get_base(array):
    if not isinstance(array, numpy.ndarray):
        return array      
    elif array.base is None:
        return array
    elif array.base.dtype != array.dtype:
        return array
    elif array.base.shape != array.shape:
        return array
    else:
        return array.base

for v in cache.values():
    if isinstance(v, xarray.DataArray):
        v.data = get_base(v.data)
        for coord in v.coords.values():
            coord.data = get_base(coord.data)
    elif isinstance(v, xarray.Dataset):
        for var in v.variables():
            var.data = get_base(var.data)

After:

%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
0.9733252348378301
Wall time: 21.1 s
@shoyer
Copy link
Member

shoyer commented Oct 23, 2016

The plan is stop making default indexes with np.arange. See #1017, which is my top priority for the next major release.

I'm not confident that your work around will work properly. At the very least, you should check strides as well. Otherwise get_base(array[::-1]) would return array.

If it would really help, I'm open to making Variable(dims, array) reuse the same numpy array instead of creating a view (see as_compatible_data).

@max-sixty
Copy link
Collaborator

If I'm understanding you correctly @crusaderky, I think this is a tough problem, and one much broader than xarray. When pickling something with a reference, do you want to save the object, or the reference? If you pickle the reference, how can you guarantee to have the object available when unpickling? How would you codify the reference (memory location?)?

Is that right? Or am I misunderstanding your problem?

On this narrow case, I think not having indexes at all should solve this, though

@crusaderky
Copy link
Contributor Author

@MaximilianR, if you pickle 2 plain python objects A and B together, and one of the attributes of B is a reference to A, A does not get duplicated.

In this case there must be some specific getstate code to prevent this and/or something with the C implementation of the class

@max-sixty
Copy link
Collaborator

@crusaderky right, I see. All those views are in the same pickle object, and so shouldn't be duplicated. That is frustrating.

As per @shoyer, the easiest way is to just not have the data in the first place. So not needing indexes at all should solve your case.

@shoyer
Copy link
Member

shoyer commented Oct 25, 2016

I answered the StackOverflow question:
https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling/40247761#40247761

This was a tricky puzzle to figure out!

@crusaderky
Copy link
Contributor Author

Confirmed that #1017 fixes my specific issue, thanks!
Leaving the ticket open as other people (particularly those that work on large arrays without dask) will still be affected.

@shoyer
Copy link
Member

shoyer commented Jan 17, 2017

I think this is fixed about as well as we can hope given how pickle works for NumPy by #1128.

So I'm closing this now, but feel free to open another issue for any follow-up concerns.

@shoyer shoyer closed this as completed Jan 17, 2017
@crusaderky
Copy link
Contributor Author

Actually, I very much still am facing the problem.
The biggest issue is now when I need to invoke xarray.broadcast. In my use case, I'm broadcasting together

  • a scalar array with numpy backend, shape=(), chunks=None
  • a 1D array with dask backend, shape=(2**19,), chunks=(2**15,)

What broadcast does is transform the scalar array to a numpy array of 2**19 elements. This is actually a view on the original 0D array, so it's got negligible RAM requirements. But after pickling and unpickling, it's become a real 2**19 elements array. Add up a few hundreds of them, and I am facing GBs of wasted RAM.

A solution would be to change broadcast() to convert to dask before broadcasting, and then broadcast directly to the proper chunk size.

@shoyer
Copy link
Member

shoyer commented Feb 5, 2017

@crusaderky Yes, I think it could be reasonable to unify array types when you call broadcast() or align(), as either as optional behavior or by changing the default.

If your scalar array is the result of an expensive dask calculation, this also might be a good use case for dask's new .persist() method (dask/dask#1908), which we could add to xarray as an alternative to .compute().

@shoyer
Copy link
Member

shoyer commented Feb 5, 2017

Alternatively, it could make sense to change pickle upstream in NumPy to special case arrays with a stride of 0 along some dimension differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants