-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7047: Add range-partitioning implementation for '.pivot_table()' #7048
Conversation
@@ -245,3 +245,260 @@ def mean_reduce(dfgb, **kwargs): | |||
"skew": GroupbyReduceImpl._build_skew_impl(), | |||
"sum": ("sum", "sum", lambda grp, *args, **kwargs: grp.sum(*args, **kwargs)), | |||
} | |||
|
|||
|
|||
class PivotTableImpl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.pivot_table()
is literally a groupby + fancy post-processing, so decided to put it into groupby.py
cls, qc, unique_keys, drop_column_level, pivot_kwargs | ||
): # noqa: PR01 | ||
"""Compute 'pivot_table()' using full-column-axis implementation.""" | ||
index, columns, values = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic was copied from qc.pivot_table()
------- | ||
pandas.DataFrame | ||
""" | ||
if df.index.nlevels > 1 and to_unstack is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic was copied from PandasQueryCompiler._pivot_table_tree_reduce()
to_aggregate : PandasQueryCompiler | ||
keys_to_group : PandasQueryCompiler | ||
""" | ||
if values is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic was copied from PandasQueryCompiler.pivot_table
@dchigarev this confuses me a little, because as far as I understand |
Right, the order is the following:
Agree that the comment is a bit confusing, rephrased it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also rebase on master? To be sure that the new tests are passed.
…pivot_table()' Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What do these changes do?
This PR adds a range-partitioning implementation for
.pivot_table()
method. Pivot table is literally a groupby aggregation + fancy post-processing of the result.The new implementation uses range-partitioning groupby to perform at the first stage and then applies
make_pivot_table()
to the reduced result.Range-partitioning implementation seems to outperform the old full-column implementation on a normal-size data. That's why I decided to replace the old full-column impl with range-partitioning everywhere where possible:
script to measure
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
.pivot_table()
#7047added andare passingdocs/development/architecture.rst
is up-to-date