shallow copies become deep copies when pickling #1058

crusaderky · 2016-10-23T23:12:03Z

Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays.
This design fails when the object is pickled.

Whenever a numpy view is pickled, it becomes a regular array:

>> a = numpy.arange(2**26)
>> print(len(pickle.dumps(a)) / 2**20)
256.00015354156494
>> b = a.view()
>> print(len(pickle.dumps((a, b))) / 2**20)
512.0001964569092
>> b.base is a
True
>> a2, b2 = pickle.loads(pickle.dumps((a, b)))
>> b2.base is a2
False

This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer.
Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k * 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies.

I see a few possible solutions to this:

Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it.
Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's.
Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem.
Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it as early as 2012, so I suspect there might be some pretty good reason why it works like that...
Whenever xarray performs a shallow copy, take the numpy array instead of creating a view.

I implemented (5) as a workaround in my getstate method.
Before:

%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
2.535497265867889
Wall time: 33.3 s

Workaround:

def get_base(array):
    if not isinstance(array, numpy.ndarray):
        return array      
    elif array.base is None:
        return array
    elif array.base.dtype != array.dtype:
        return array
    elif array.base.shape != array.shape:
        return array
    else:
        return array.base

for v in cache.values():
    if isinstance(v, xarray.DataArray):
        v.data = get_base(v.data)
        for coord in v.coords.values():
            coord.data = get_base(coord.data)
    elif isinstance(v, xarray.Dataset):
        for var in v.variables():
            var.data = get_base(var.data)

After:

%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
0.9733252348378301
Wall time: 21.1 s

The text was updated successfully, but these errors were encountered:

shoyer · 2016-10-23T23:27:09Z

The plan is stop making default indexes with np.arange. See #1017, which is my top priority for the next major release.

I'm not confident that your work around will work properly. At the very least, you should check strides as well. Otherwise get_base(array[::-1]) would return array.

If it would really help, I'm open to making Variable(dims, array) reuse the same numpy array instead of creating a view (see as_compatible_data).

max-sixty · 2016-10-24T17:03:08Z

If I'm understanding you correctly @crusaderky, I think this is a tough problem, and one much broader than xarray. When pickling something with a reference, do you want to save the object, or the reference? If you pickle the reference, how can you guarantee to have the object available when unpickling? How would you codify the reference (memory location?)?

Is that right? Or am I misunderstanding your problem?

On this narrow case, I think not having indexes at all should solve this, though

crusaderky · 2016-10-25T06:54:02Z

@MaximilianR, if you pickle 2 plain python objects A and B together, and one of the attributes of B is a reference to A, A does not get duplicated.

In this case there must be some specific getstate code to prevent this and/or something with the C implementation of the class

max-sixty · 2016-10-25T18:23:35Z

@crusaderky right, I see. All those views are in the same pickle object, and so shouldn't be duplicated. That is frustrating.

As per @shoyer, the easiest way is to just not have the data in the first place. So not needing indexes at all should solve your case.

shoyer · 2016-10-25T19:05:01Z

I answered the StackOverflow question:
https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling/40247761#40247761

This was a tricky puzzle to figure out!

crusaderky · 2016-11-15T21:26:52Z

Confirmed that #1017 fixes my specific issue, thanks!
Leaving the ticket open as other people (particularly those that work on large arrays without dask) will still be affected.

shoyer · 2017-01-17T01:53:18Z

I think this is fixed about as well as we can hope given how pickle works for NumPy by #1128.

So I'm closing this now, but feel free to open another issue for any follow-up concerns.

crusaderky · 2017-02-05T19:44:33Z

Actually, I very much still am facing the problem.
The biggest issue is now when I need to invoke xarray.broadcast. In my use case, I'm broadcasting together

a scalar array with numpy backend, shape=(), chunks=None
a 1D array with dask backend, shape=(2**19,), chunks=(2**15,)

What broadcast does is transform the scalar array to a numpy array of 2**19 elements. This is actually a view on the original 0D array, so it's got negligible RAM requirements. But after pickling and unpickling, it's become a real 2**19 elements array. Add up a few hundreds of them, and I am facing GBs of wasted RAM.

A solution would be to change broadcast() to convert to dask before broadcasting, and then broadcast directly to the proper chunk size.

shoyer · 2017-02-05T21:06:19Z

@crusaderky Yes, I think it could be reasonable to unify array types when you call broadcast() or align(), as either as optional behavior or by changing the default.

If your scalar array is the result of an expensive dask calculation, this also might be a good use case for dask's new .persist() method (dask/dask#1908), which we could add to xarray as an alternative to .compute().

shoyer · 2017-02-05T21:13:41Z

Alternatively, it could make sense to change pickle upstream in NumPy to special case arrays with a stride of 0 along some dimension differently.

crusaderky mentioned this issue Jan 16, 2017

8 million nodes in the dask dict dask/dask#1916

Closed

shoyer closed this as completed Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shallow copies become deep copies when pickling #1058

shallow copies become deep copies when pickling #1058

crusaderky commented Oct 23, 2016 •

edited

Loading

shoyer commented Oct 23, 2016

max-sixty commented Oct 24, 2016

crusaderky commented Oct 25, 2016

max-sixty commented Oct 25, 2016

shoyer commented Oct 25, 2016

crusaderky commented Nov 15, 2016

shoyer commented Jan 17, 2017

crusaderky commented Feb 5, 2017

shoyer commented Feb 5, 2017

shoyer commented Feb 5, 2017

shallow copies become deep copies when pickling #1058

shallow copies become deep copies when pickling #1058

Comments

crusaderky commented Oct 23, 2016 • edited Loading

shoyer commented Oct 23, 2016

max-sixty commented Oct 24, 2016

crusaderky commented Oct 25, 2016

max-sixty commented Oct 25, 2016

shoyer commented Oct 25, 2016

crusaderky commented Nov 15, 2016

shoyer commented Jan 17, 2017

crusaderky commented Feb 5, 2017

shoyer commented Feb 5, 2017

shoyer commented Feb 5, 2017

crusaderky commented Oct 23, 2016 •

edited

Loading