FEAT-#7047: Add range-partitioning implementation for '.pivot_table()' #7048

dchigarev · 2024-03-11T12:46:49Z

What do these changes do?

This PR adds a range-partitioning implementation for .pivot_table() method. Pivot table is literally a groupby aggregation + fancy post-processing of the result.

The new implementation uses range-partitioning groupby to perform at the first stage and then applies make_pivot_table() to the reduced result.

Range-partitioning implementation seems to outperform the old full-column implementation on a normal-size data. That's why I decided to replace the old full-column impl with range-partitioning everywhere where possible:

script to measure

import pandas
import modin.pandas as pd
import numpy as np
from timeit import default_timer as timer

import modin.config as cfg
cfg.CpuCount.put(44)
from modin.utils import execute

nrows = [100_000, 1_000_000, 2_500_000, 5_000_000, 10_000_000]
ncols = 34
values = ["value0", [f"value{i}" for i in range(5)]]
ngroups = [10, 1_000]
impl = ["full_axis", "map_reduce", "range_part"]

def get_num_vals(val):
    return (
        len(val)
        if isinstance(val, list)
        else (1 if isinstance(values, str) else ncols - 3)
    )

columns = pandas.MultiIndex.from_product(
    [
        [
            get_num_vals(val)
            for val in values
        ],
        ngroups,
        impl,
    ],
    names=["num_values", "num_groups", "impl"],
)
total_res = pandas.DataFrame(index=nrows, columns=columns)

i = 0
total_its = len(nrows) * len(values) * len(ngroups) * len(impl)

for nrow in nrows:
    for val in values:
        for ngroup in ngroups:
            data = {
                "index1": np.tile(np.arange(ngroup), nrow // ngroup),
                "index2": np.tile(np.arange(ngroup), nrow // ngroup),
                "col1": np.tile([f"val{i}" for i in range(ngroup)], nrow // ngroup),
                **{f"value{i}": np.arange(nrow) for i in range(ncols - 3)},
            }

            for imp in impl:
                print(f"{round((i / total_its) * 100, 2)}%")
                i = i + 1
                df = pd.DataFrame(data)
                execute(df)

                t1 = timer()
                res = df.pivot_table(
                    index=["index1", "index2"],
                    columns=["col1"],
                    values=val,
                    # requires a hack in 'pivot_table()' that would dispatch to a proper implementation
                    # depending on this parameter
                    margins_name=imp,
                )
                execute(res)
                tm = timer() - t1
                print(f"{nrow=}; {val=}; {ngroup=}; {imp=}: {tm}; {res.shape}")
                total_res.loc[nrow, (get_num_vals(val), ngroup, imp)] = tm
                total_res.to_excel("pivot.xlsx")

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Add range-partitioning implementation for .pivot_table() #7047
tests ~~added and~~ are passing
module layout described at docs/development/architecture.rst is up-to-date

dchigarev · 2024-03-12T12:10:37Z

modin/core/storage_formats/pandas/groupby.py

@@ -245,3 +245,260 @@ def mean_reduce(dfgb, **kwargs):
    "skew": GroupbyReduceImpl._build_skew_impl(),
    "sum": ("sum", "sum", lambda grp, *args, **kwargs: grp.sum(*args, **kwargs)),
 }
+
+
+class PivotTableImpl:


.pivot_table() is literally a groupby + fancy post-processing, so decided to put it into groupby.py

dchigarev · 2024-03-12T12:11:59Z

modin/core/storage_formats/pandas/groupby.py

+        cls, qc, unique_keys, drop_column_level, pivot_kwargs
+    ):  # noqa: PR01
+        """Compute 'pivot_table()' using full-column-axis implementation."""
+        index, columns, values = (


the logic was copied from qc.pivot_table()

dchigarev · 2024-03-12T12:13:07Z

modin/core/storage_formats/pandas/groupby.py

+        -------
+        pandas.DataFrame
+        """
+        if df.index.nlevels > 1 and to_unstack is not None:


the logic was copied from PandasQueryCompiler._pivot_table_tree_reduce()

dchigarev · 2024-03-12T12:13:48Z

modin/core/storage_formats/pandas/groupby.py

+        to_aggregate : PandasQueryCompiler
+        keys_to_group : PandasQueryCompiler
+        """
+        if values is None:


the logic was copied from PandasQueryCompiler.pivot_table

anmyachev · 2024-03-12T15:53:44Z

This PR adds a range-partitioning implementation for .pivot_table() method and enables it by default.

@dchigarev this confuses me a little, because as far as I understand MapReduce implementation is by default, right?

dchigarev · 2024-03-13T08:24:13Z

This PR adds a range-partitioning implementation for .pivot_table() method and enables it by default.

@dchigarev this confuses me a little, because as far as I understand MapReduce implementation is by default, right?

Right, the order is the following:

Try MapReduce implementation
If can't use MapReduce, use range-partitioning impl
If can't use range-partitioning, use full-column impl

Agree that the comment is a bit confusing, rephrased it

anmyachev

Could you also rebase on master? To be sure that the new tests are passed.

modin/core/storage_formats/pandas/groupby.py

…pivot_table()' Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev

LGTM!

dchigarev commented Mar 12, 2024

View reviewed changes

dchigarev marked this pull request as ready for review March 12, 2024 12:21

dchigarev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev and a team as code owners March 12, 2024 12:21

anmyachev reviewed Mar 13, 2024

View reviewed changes

modin/core/storage_formats/pandas/groupby.py Show resolved Hide resolved

modin/core/storage_formats/pandas/groupby.py Outdated Show resolved Hide resolved

modin/core/storage_formats/pandas/groupby.py Show resolved Hide resolved

anmyachev previously approved these changes Mar 13, 2024

View reviewed changes

dchigarev added 7 commits March 14, 2024 10:53

FEAT-modin-project#7047: Add range-partitioning implementation for '.…

1e6cdc4

…pivot_table()' Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

fix failing tests

1e1fd4e

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

enable range-part impl by default

b0179c0

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

move common logic to qc level

02cd0c5

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

remove unnecessary things

3f83de4

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

raise 'no group keys' when needed

af81713

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

add more tests

fa316e1

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev dismissed anmyachev’s stale review via fa316e1 March 14, 2024 10:50

dchigarev force-pushed the pivot_rg branch from 7566ebe to fa316e1 Compare March 14, 2024 10:50

anmyachev approved these changes Mar 14, 2024

View reviewed changes

YarShev approved these changes Mar 14, 2024

View reviewed changes

YarShev merged commit 93b4e2a into modin-project:master Mar 14, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#7047: Add range-partitioning implementation for '.pivot_table()' #7048

FEAT-#7047: Add range-partitioning implementation for '.pivot_table()' #7048

dchigarev commented Mar 11, 2024 •

edited

Loading

dchigarev Mar 12, 2024

dchigarev Mar 12, 2024

dchigarev Mar 12, 2024

dchigarev Mar 12, 2024

anmyachev commented Mar 12, 2024 •

edited

Loading

dchigarev commented Mar 13, 2024 •

edited

Loading

anmyachev left a comment

anmyachev left a comment

FEAT-#7047: Add range-partitioning implementation for '.pivot_table()' #7048

FEAT-#7047: Add range-partitioning implementation for '.pivot_table()' #7048

Conversation

dchigarev commented Mar 11, 2024 • edited Loading

What do these changes do?

dchigarev Mar 12, 2024

Choose a reason for hiding this comment

dchigarev Mar 12, 2024

Choose a reason for hiding this comment

dchigarev Mar 12, 2024

Choose a reason for hiding this comment

dchigarev Mar 12, 2024

Choose a reason for hiding this comment

anmyachev commented Mar 12, 2024 • edited Loading

dchigarev commented Mar 13, 2024 • edited Loading

anmyachev left a comment

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

dchigarev commented Mar 11, 2024 •

edited

Loading

anmyachev commented Mar 12, 2024 •

edited

Loading

dchigarev commented Mar 13, 2024 •

edited

Loading