Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cumulative Methods with mixed frames casts to object #19296

Closed
TomAugspurger opened this issue Jan 18, 2018 · 7 comments · Fixed by #29872
Closed

Cumulative Methods with mixed frames casts to object #19296

TomAugspurger opened this issue Jan 18, 2018 · 7 comments · Fixed by #29872
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 18, 2018

Looks like we cast the entire frame to object dtype, and then perform the cumulative functions:

In [18]: x = pd.DataFrame({
    ...:     "A": [1, 2, 3],
    ...:     "B": [1, 2, 3.],
    ...:     "C": [True, False, False],
    ...: })

In [19]: x.cumsum()
Out[19]:
   A  B     C
0  1  1  True
1  3  3     1
2  6  6     1

In [20]: x.cumsum().dtypes
Out[20]:
A    object
B    object
C    object
dtype: object

In [21]: x.cummin().dtypes
Out[21]:
A    object
B    object
C    object
dtype: object

I think it'd be better to do these block-wise? The possible downside I see is that

In [24]: x[['A', 'B']].cumsum()
Out[24]:
     A    B
0  1.0  1.0
1  3.0  3.0
2  6.0  6.0

Will now be slower (presumably) since we'll have two cumsums to apply instead of one (after upcasting), but I think that'd be worth it for preserving the correct dtypes.

@TomAugspurger TomAugspurger added Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jan 18, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Jan 18, 2018
@TomAugspurger
Copy link
Contributor Author

cc @jbrockmendel this may interest you.

@jbrockmendel
Copy link
Member

Looks like the casting is done in generic._make_cum_function line y = _values_from_object(self).copy()

I'd be down for working towards making these apply block-wise, avoid some casting (and likely copying).

@TomAugspurger
Copy link
Contributor Author

Cool, thanks.

This of course came up in the context of the extension array / block stuff. We need some kind of way of knowing

  1. if an external array type supports a given operation
  2. If an external array type has an implementation of a given operation, or if they want to fall back to being cast to object ndarray and using NumPy's.

@jbrockmendel
Copy link
Member

We need some kind of way of knowing [...]

So I should check an implicit assumption I've been making: does your game-plan involve the Block subclasses inheriting from analogous Array subclasses? The problem posed would be simpler if the method lookup can be done at the Block level, with the implementation at the Array level.

I've been leaving the DataFrame case for last so I wouldn't have to figure out the dispatching logic. Between core.ops, BlockManager.eval, and BlockManager.apply there's a lot of redirection that's tough to track down. So I'm thinking some step of the approach here should involve untangling the dispatch logic.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2018

these should be dispatched like all other operations using BlockManager.apply. e.g. look up .replace/.fillna for many many other operations in generic. this pattern is already well established.

@jreback jreback modified the milestones: 0.23.0, Next Major Release Jan 19, 2018
@jreback
Copy link
Contributor

jreback commented Jan 19, 2018

don't label things for 0.23 until they are actually merged.

@jbrockmendel
Copy link
Member

@TomAugspurger side-note question w.r.t Categoricals: Series arithmetic methods are defined in core.ops:

    def na_op(x, y):
        import pandas.core.computation.expressions as expressions
        try:
            result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
        except TypeError:
            if isinstance(y, (np.ndarray, ABCSeries, pd.Index)):
                dtype = find_common_type([x.dtype, y.dtype])
                result = np.empty(x.size, dtype=dtype)
                mask = notna(x) & notna(y)
                result[mask] = op(x[mask], com._values_from_object(y[mask]))
            elif isinstance(x, np.ndarray):
                result = np.empty(len(x), dtype=x.dtype)
                mask = notna(x)
                result[mask] = op(x[mask], y)
            else:
                raise TypeError("{typ} cannot perform the operation "
                                "{op}".format(typ=type(x).__name__,
                                              op=str_rep))

The x here correponds to self.values. My hunch is that the expressions.evaluate is going to raise whenever x is a Categorical. Do you know if this is correct?

@jreback jreback modified the milestones: Contributions Welcome, 1.0 Dec 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants