Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep attrs by default? (keep_attrs) #3891

Open
max-sixty opened this issue Mar 25, 2020 · 14 comments
Open

Keep attrs by default? (keep_attrs) #3891

max-sixty opened this issue Mar 25, 2020 · 14 comments
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)

Comments

@max-sixty
Copy link
Collaborator

I've held this view in low confidence for a while and wanted to socialize it to see whether there's something to it: Should we keep attrs in operations by default?

Advantages:

  • I think most of the time people want to keep attrs after operations
    • Is that right? Are there cases where it wouldn't be a reasonable default? e.g. good points here for not always keeping coords around
  • It's easy to remove them with a (currently unimplemented) drop_attrs method when people do want to remove them

Disadvantages:

  • Backward incompatible change with an expensive deprecate cycle (would be impractical to have a deprecation warning every time someone ran a function on an object with attrs I think? At least without adding a once filter warning)
  • ?

Here are some existing relevant discussions:

I think this is an easy situation to get into:

  • We make an incorrect-but-insignificant design decision; e.g. some methods don't keep attrs
  • We want to change that, but avoid breaking backward-compatibility
  • So we add kwargs and eventually a global config
  • But now we have a global config that requires global context and lots of kwargs! :(

I'm up for leaning towards breaking changes if it makes the library better: I think xarray will grow immensely, and so the narrow immediate pain is worth the broader future positive impact. Clearly if the immediate pain stops xarray growing, then it's not a good tradeoff.

@crusaderky
Copy link
Contributor

Why would you want a .drop_attrs() method? .attrs.clear() will do just fine.
I fully agree we should keep attrs by default.

@max-sixty
Copy link
Collaborator Author

Why would you want a .drop_attrs() method? .attrs.clear() will do just fine.

Yes that's fine if people are happy with .attrs.clear(). A method that returns the dataset object is useful for "fluent" method chaining.

@shoyer
Copy link
Member

shoyer commented Mar 26, 2020

See #1614 for related discussion.

I'm happy to set aside backwards compatibility concerns for now and ponder what the ideal policy would be. The original choices here were not made in a super careful way.

My longest-standing concern here is about units. One common use case for attrs is to mark the units of an array, and those aren't always preserved by naive arithmetic. But perhaps this is less of a concern now that you can use pint with xarray?

The other concern is how to combine attrs in operations that involve multiple arrays. Currently we just copy attrs from the first object, but that probably is not the most consistent (e.g., ideally arithmetic should be reflexive).

@crusaderky
Copy link
Contributor

@shoyer to me this it would make the most sense to do a union of the inputs:

  • if a key is present only in one input, it goes to the output
  • if a key is present in multiple inputs, always take the leftmost

Note how this would be different from how scalar coords are treated; scalar coords are discarded when they arrive from multiple inputs and are mismatched. The reason I don't think it's wise to do the same with attrs is that it could be uncontrollably expensive to compute equality, depending on what people loaded in them. I've personally seen them used as back-references to the whole application framework. Also there's no guarantee that they implement __eq__ or that it returns a bool; e.g. you can't compare two data structures that somewhere inside contain numpy arrays.

@TomNicholas TomNicholas added the topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) label Apr 5, 2020
@TomNicholas
Copy link
Member

TomNicholas commented Apr 5, 2020

I think this is a good question @max-sixty , and I have some opinions based on my experience with xBOUT.

Firstly I agree with you that for those users who use xarray as a convenience wrapper or for whom it's useful but not critical it makes more sense to keep attrs by default. "Drop by default because otherwise they might become inconsistent with your data" never really made sense to me, because if you care that much about attrs being consistent with data then you really need well-defined rules for how they are propagated in all cases, which we don't (yet) offer. In all other cases you would rather keep them and have to deal with the edge cases (which is why I wanted #2482 ).

As a concrete usage example of wanting to preserve attrs while not being overly-concerned if they sometimes get dropped: in xBOUT, our data requires carting around some regions attributes so that we know how to plot it later. One day this could maybe be handled by custom indexes in xBOUT, but there are probably other communities whose attrs requirements couldn't be.

After the casual wrapper case, the most important cases are:

  • Units, which IMO becomes much less relevant once pint integration is complete,
  • Data provenance,
  • CF conventions
  • Other domain-specific types of grids (like the xBOUT case, or staggered grids etc.)

At the risk of repeating what's in #1614 , I would like to see some hybrid approach, which gives a simple global default along the lines of what @crusaderky suggests, but also allows a plugin which takes over and rigorously specifies the behaviour for the users who do care. Then we can outsource the work of the complex logic to e.g. the community that actually has to preserve CF conventions, or a separate data provenance package.

(Also I made a new metadata issue label for these discussions)

@max-sixty
Copy link
Collaborator Author

Great, thanks @TomNicholas , appreciate the thoughtful reply.

One thing we could do (NB: I don't think we should do this right now, but building on the points above as ideation) is to defer to the attrs themselves. For example, in an operation dividing one dataarray by another, if they both share an attr which has a __div__ method, we call that and put the returned value on the resulting dataarray. That way, even ex-pint integration, Unit('m') / Unit('s') could evaluate to Unit('m/s'). And where units want to be dropped, they could use those methods to return None.

Re next steps on setting the default to be True, what are people's thoughts? Would we take a PR for 0.16? Would we want a deprecation warning on any operation with an attr?

@TomNicholas
Copy link
Member

TomNicholas commented Apr 11, 2020

For example, in an operation dividing one dataarray by another, if they both share an attr which has a div method, we call that and put the returned value on the resulting dataarray.

I agree that this would be very powerful, and allow users to implement all the things they want (provenance, units handling etc.), but this also seems like a big undertaking. In order to have well-defined handling of attrs through operations like merge, concat, and ufuncs, wouldn't the attr-handling interface have to be almost as complicated as xarray's actual interface? Not saying we shouldn't do it, but what's the minimum set of attr-handling hooks that would have to be defined (and implemented and tested)?

Do you think it would be useful to get input from someone who actually wants this for a complex use case? I think the most hardcore one will be data provenance, because that (a) will need complicated underlying logic, (b) ideally needs to be pretty fault-tolerant, and (c) won't be made redundant by pint or duck-array integration. There was someone on #1614 who was asking about this IIRC.

Would we want a deprecation warning on any operation with an attr?

That would be almost every operation wouldn't it?

@TomNicholas
Copy link
Member

I'm trying to imagine what the approach that delegated the largest fraction of the work to an attrs-handling plugin would be. Would it be to give the attrs plugin the input, and the name of the function/method that was being called, and let the plugin completely decide the output attrs? Or would that be under-specified?

@max-sixty
Copy link
Collaborator Author

Would we want a deprecation warning on any operation with an attr?

That would be almost every operation wouldn't it?

Right, anything involving an object with attrs... hence my reluctance. Do we think it's OK to do this on a major version without a warning?

@shoyer
Copy link
Member

shoyer commented Apr 11, 2020

I think it would probably be OK to start propagating more attrs by default as a breaking change. There's no easy way to roll this out incrementally, and I doubt too many users are relying upon metadata disappearing when they do xarray operations, given the somewhat inconsistent state of the current rules.

@keewis
Copy link
Collaborator

keewis commented Jan 25, 2021

I did not think this through carefully, but I wonder if we should extend merge_attrs to also take a function with a list of attrs as its only parameter and move towards something like combine_attrs instead of keep_attrs: setting keep_attrs seems to choose between combine_attrs="drop" and combine_attrs="override".

@keewis
Copy link
Collaborator

keewis commented Feb 5, 2021

if I remember correctly, we decided to allow passing a user-provided function to combine_attrs and to extend keep_attrs to accept a bool, a str or a function.

Something to keep in mind is that not all strategies make sense for operations that involve only a single variable, like isnull, but I guess for those all string options except drop mean "keep the attributes".

@dcherian
Copy link
Contributor

and to extend keep_attrs to accept a bool, a str or a function.

If we allow keep_attrs to be a custom function, then we could move towards some of the ideas in here: #988 . If that custom function received something like the UfuncContext in that issue, then an external library could implement data provenance handling like the history attribute, and set things like cell_methods. The context manager idea seems a little complex but doing something like

xr.set_options(keep_attrs=cf_xarray.attrs_handler)

could be OK, where all decisions are left up to the external package (here cf_xarray).

(Though what's stopping us from directly adding cell_methods attributes now for reductions, weighted, and coarsen?)

@max-sixty
Copy link
Collaborator Author

Moving from #8205 (I had searched for keep_attrs in issues, w underscore...)


@keewis writes:

It is true that it is more common to just globally set keep_attrs=True, so we might want to consider changing the default (see #3891, which you opened a long time ago).

We also (partially) changed keep_attrs to accept the names of builtin attrs merging functions and user functions (callables), which is important for operations that take more than a single variable. Though I guess in that case we could also just rename to combine_attrs.

I think the distinction of combining objects vs. a function that operates on a single object is important.

In the case I was working on, this was just a transformation of a single object, so I don't see much of a downside of separating out the ability to drop attrs into a different function.

Would there be any interest in:

  • changing the default to True
  • not adding keep_attrs for new functions
  • potentially soft-deprecating for functions which operate on a single object (the vast majority of functions)
    • i.e. remove from docs, without raising warnings everywhere. Like we did for .drop, which seems to have gone quite well...

@dcherian dcherian changed the title Keep attrs by default? Keep attrs by default? (keep_attrs) Sep 22, 2023
max-sixty added a commit to max-sixty/xarray that referenced this issue Sep 30, 2023
max-sixty added a commit that referenced this issue Jul 11, 2024
* Add a `.drop_attrs` method

Part of #3891

* Add tests

* Add explicit coords test

* Use `._replace` for half the method

* .

* Add a `deep` kwarg (default `True`?)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* api

* Update xarray/core/dataarray.py

Co-authored-by: Michael Niklas  <mick.niklas@gmail.com>

* Update xarray/core/dataset.py

Co-authored-by: Michael Niklas  <mick.niklas@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Michael Niklas <mick.niklas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)
Projects
None yet
Development

No branches or pull requests

6 participants