FEAT-#7283: Introduce MinRowPartitionSize and MinColumnPartitionSize #7284

YarShev · 2024-05-21T14:40:41Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Introduce MinRowPartitionSize and MinColumnPartitionSize to make partitioning more flexible #7283
tests passing
module layout described at docs/development/architecture.rst is up-to-date

modin/config/envvars.py

devin-petersohn

LGTM, thanks @YarShev

anmyachev

Overall it looks good, only one design question and a couple of small ones left.

anmyachev · 2024-05-22T19:41:37Z

modin/config/envvars.py

@@ -654,6 +654,14 @@ def get(cls) -> int:
        -------
        int
        """
+        from modin.error_message import ErrorMessage


One potential solution could be to change the class type to tuple, this would localize the partition data in one place and save us from an extra class.

By doing this we force the user to specify both sizes at a time - per row and col, while the user may want to specify/adjust a single axis only. I would leave as is.

By changing this value, the user probably already knows the default values. How else will he start from something to select new values? There should be no problem specifying two values.

The user is not forced to know the default values - this is mainly for us to get ultimate performance in general. The user may know the number of rows/cols in advance and may specify fine-grained or coarse-grained partitioning. To me, specifying an additional number looks like extra burden if I don't care about it.

To me, specifying an additional number looks like extra burden if I don't care about it.

In each particular place, depending on the context, you thought about which of the two classes to use (this is the burden of choice). Although you can transfer the selection of the desired dimension to a lower-level function, which will make the selection automatically based on the axis parameter.

Changing one of the dimensions might look like this (a bit longer, but still simple):

sizes = MinPartitionSizes.get() MinPartitionSizes.put((new_value, sizes[1]))

We should keep the things exposed to users as much as simple. I think very few people are going to dig into the internals.

Without knowing this information, it is not clear how to use these configurations, so I still believe that this is no more complicated than the current approach.

Even with a single environment variable, it's hard to imagine the user specifying a value without looking at the one we use by default. It turns out that user need to look in both situations and in this sense they are equivalent in complexity.

Also default values can be found in https://modin.readthedocs.io/en/latest/flow/modin/config.html#modin-configs-list

The main reason by which I think we should keep both MinRow/ColumnPartitionSize is that the user doesn't have to know how to distribute data along a single axis and leave this to Modin itself. Consider a case with a narrow or wide dataframe. If the user has a narrow df, say 10 columns, he/she can set MinColumnPartitionSize to 1 and leave distribution along rows to Modin, which automatically finds a better MinRowPartitionSize if the default one doesn't suit. I stick to keep the current approach.

I stick to keep the current approach.

@YarShev I still think another solution would be better, but since the majority here is for your solution, please update the PR and we'll merge it.

anmyachev · 2024-05-22T19:49:13Z

modin/core/storage_formats/pandas/parsers.py

+        min_block_size=(
+            MinRowPartitionSize.get() if axis == 0 else MinColumnPartitionSize.get()
+        ),


Looks more natural, doesn't it?

Suggested change

min_block_size=(

MinRowPartitionSize.get() if axis == 0 else MinColumnPartitionSize.get()

),

min_block_sizes=MinPartitionSizes.get(),

We split a df along a single axis. I don't understand why we have to pass a tuple of row and col axis sizes.

The function has the value of the axis parameter; the function itself will select the desired value from the tuple using it.

modin/tests/core/storage_formats/pandas/test_internals.py

modin/conftest.py

…artitionSize Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

YarShev requested review from devin-petersohn, mvashishtha, RehanSD, vnlitvinov, anmyachev, dchigarev and a team as code owners May 21, 2024 14:40

anmyachev mentioned this pull request May 21, 2024

Increase the default value for MinPartitionSize variable #6384

Open

anmyachev reviewed May 22, 2024

View reviewed changes

modin/config/envvars.py Outdated Show resolved Hide resolved

devin-petersohn previously approved these changes May 22, 2024

View reviewed changes

anmyachev reviewed May 22, 2024

View reviewed changes

YarShev dismissed devin-petersohn’s stale review via 30f2b62 May 23, 2024 13:57

YarShev and others added 3 commits June 14, 2024 18:29

FEAT-modin-project#7283: Introduce MinRowPartitionSize and MinColumnP…

3958190

…artitionSize Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

Address comment

dc3e380

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>

Update modin/tests/core/storage_formats/pandas/test_internals.py

6767481

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

YarShev force-pushed the dev/yigoshev-issue7283 branch from 30f2b62 to 6767481 Compare June 14, 2024 18:30

anmyachev approved these changes Jun 14, 2024

View reviewed changes

anmyachev merged commit 061a934 into modin-project:main Jun 14, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#7283: Introduce MinRowPartitionSize and MinColumnPartitionSize #7284

FEAT-#7283: Introduce MinRowPartitionSize and MinColumnPartitionSize #7284

YarShev commented May 21, 2024

devin-petersohn left a comment

anmyachev left a comment

anmyachev May 22, 2024

YarShev May 23, 2024 •

edited

Loading

anmyachev May 23, 2024

YarShev May 23, 2024

anmyachev May 23, 2024

YarShev May 23, 2024

anmyachev May 24, 2024

YarShev May 24, 2024

anmyachev Jun 14, 2024

YarShev Jun 14, 2024

anmyachev May 22, 2024

YarShev May 23, 2024

anmyachev May 23, 2024

FEAT-#7283: Introduce MinRowPartitionSize and MinColumnPartitionSize #7284

FEAT-#7283: Introduce MinRowPartitionSize and MinColumnPartitionSize #7284

Conversation

YarShev commented May 21, 2024

What do these changes do?

devin-petersohn left a comment

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev May 23, 2024 •

edited

Loading