Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies between python/cython groupby.agg behavior #35928

Closed
wants to merge 26 commits into from

Conversation

jbrockmendel
Copy link
Member

This is pretty ugly, but tentatively is sufficient to make #34997 pass.

The upshot is that we have two problems:

  1. in libreduction setting setattr(cached_ityp, '_index_data', islider.buf) silently does the wrong thing for EA-backed indexes
  2. when we go through the non-libreduction path, we do things slightly differently, which requires more patches to get tests passing.

cc @WillAyd

jbrockmendel and others added 20 commits August 20, 2020 21:19

index = Index(sorted(result), name=self.grouper.names[0])
if isinstance(index, (DatetimeIndex, TimedeltaIndex)):
# TODO: do we _always_ want to do this?
# shouldnt this be done later in eg _wrap_aggregated_output?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd it looks like this whole block L288-L307 can be replaced by setting index = self.grouper.result_index without breaking any tests. That de-facto works for existing tests, but I'd like to confirm that it works in general. do you have a good read on if it does?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say go for it if it passes the tests - it's certainly a lot simpler

result = self._aggregate_maybe_named(func, *args, **kwargs)

index = self.grouper.result_index
assert index.name == self.grouper.names[0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this would fail if we had a name that was NA

@jbrockmendel jbrockmendel changed the title WIP: inconsistencies between python/cython groupby.agg behavior Inconsistencies between python/cython groupby.agg behavior Sep 3, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, this is not pretty.

does this have a test that is added that doesn't work now?

@jreback jreback added the Groupby label Sep 5, 2020
@jbrockmendel
Copy link
Member Author

you are right, this is not pretty.

it started off even worse.

does this have a test that is added that doesn't work now?

No. The motivation is to get #34997 working (which is needed in order to upgrade to cython3 when its eventually released)

@jbrockmendel
Copy link
Member Author

This is also prerequisite to killing off the _index_data kludge

@jbrockmendel jbrockmendel mentioned this pull request Sep 12, 2020
2 tasks
@jreback
Copy link
Contributor

jreback commented Sep 13, 2020

if u can rebase will have a look

@jbrockmendel
Copy link
Member Author

rebased per request

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests that show the inconsistency now? and how does this fix it?

ret = create_series_with_explicit_dtype(
result, index=index, dtype_if_empty=object
)
ret.name = self._selected_obj.name # test_metadata_propagation_indiv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pass the name in L284

what libreduction does.
"""
try:
return self._aggregate_named(func, *args, named=True, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this actually doing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to track down an answer to why i did this, may take a bit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK now i recall. in e.g. test_apply_columns_multilevel we do:

        cols = pd.MultiIndex.from_tuples([("A", "a", "", "one"), ("B", "b", "i", "two")])
        ind = date_range(start="2017-01-01", freq="15Min", periods=8)
        df = DataFrame(np.array([0] * 16).reshape(8, 2), index=ind, columns=cols)
        agg_dict = {col: (np.sum if col[3] == "one" else np.mean) for col in df.columns}
        result = df.resample("H").apply(lambda x: agg_dict[x.name](x))

so the function we are passing to apply is lambda x: agg_dict[x.name](x) depends on the Series name x.name but what we're actually doing ATM (and since this is tested, i guess it means we support it on purpose?) is patching the Series name to match the group name so that we end up applying a different function to each group.

But the user could also pass a function that depends on the non-patched Series name, so we end up guessing which regime we're in, which this is doing.

If it were up to me I'd shoot this name-patching thing into the sun.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grr, this is really complex.

@jbrockmendel
Copy link
Member Author

do we have tests that show the inconsistency now? and how does this fix it?

This does not fix anything that is broken in master. But under #34997 (which is needed for cy3 compat), some paths that currently go through cython will instead go through python, and would then fail.

@jreback
Copy link
Contributor

jreback commented Sep 15, 2020

do we have tests that show the inconsistency now? and how does this fix it?

This does not fix anything that is broken in master. But under #34997 (which is needed for cy3 compat), some paths that currently go through cython will instead go through python, and would then fail.

maybe i am not following everything, but do we need to use cy3 in 3.9?

@jbrockmendel
Copy link
Member Author

maybe i am not following everything, but do we need to use cy3 in 3.9?

i hope not; cy3 isnt even out yet

@jreback
Copy link
Contributor

jreback commented Sep 15, 2020

maybe i am not following everything, but do we need to use cy3 in 3.9?

i hope not; cy3 isnt even out yet

so we don't need this rn then?

@jbrockmendel
Copy link
Member Author

so we don't need this rn then?

correct AFAIK

@jbrockmendel
Copy link
Member Author

closing in favor of #34997

@jbrockmendel jbrockmendel deleted the catch-less branch September 17, 2020 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants