Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#7283: Introduce MinRowPartitionSize and MinColumnPartitionSize #7284

Merged
merged 3 commits into from
Jun 14, 2024

Conversation

YarShev
Copy link
Collaborator

@YarShev YarShev commented May 21, 2024

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Introduce MinRowPartitionSize and MinColumnPartitionSize to make partitioning more flexible #7283
  • tests passing
  • module layout described at docs/development/architecture.rst is up-to-date

modin/config/envvars.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @YarShev

Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good, only one design question and a couple of small ones left.

@@ -654,6 +654,14 @@ def get(cls) -> int:
-------
int
"""
from modin.error_message import ErrorMessage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential solution could be to change the class type to tuple, this would localize the partition data in one place and save us from an extra class.

Copy link
Collaborator Author

@YarShev YarShev May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing this we force the user to specify both sizes at a time - per row and col, while the user may want to specify/adjust a single axis only. I would leave as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing this value, the user probably already knows the default values. How else will he start from something to select new values? There should be no problem specifying two values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user is not forced to know the default values - this is mainly for us to get ultimate performance in general. The user may know the number of rows/cols in advance and may specify fine-grained or coarse-grained partitioning. To me, specifying an additional number looks like extra burden if I don't care about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, specifying an additional number looks like extra burden if I don't care about it.

In each particular place, depending on the context, you thought about which of the two classes to use (this is the burden of choice). Although you can transfer the selection of the desired dimension to a lower-level function, which will make the selection automatically based on the axis parameter.

Changing one of the dimensions might look like this (a bit longer, but still simple):

sizes = MinPartitionSizes.get()
MinPartitionSizes.put((new_value, sizes[1]))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the things exposed to users as much as simple. I think very few people are going to dig into the internals.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without knowing this information, it is not clear how to use these configurations, so I still believe that this is no more complicated than the current approach.

Even with a single environment variable, it's hard to imagine the user specifying a value without looking at the one we use by default. It turns out that user need to look in both situations and in this sense they are equivalent in complexity.

Also default values can be found in https://modin.readthedocs.io/en/latest/flow/modin/config.html#modin-configs-list

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason by which I think we should keep both MinRow/ColumnPartitionSize is that the user doesn't have to know how to distribute data along a single axis and leave this to Modin itself. Consider a case with a narrow or wide dataframe. If the user has a narrow df, say 10 columns, he/she can set MinColumnPartitionSize to 1 and leave distribution along rows to Modin, which automatically finds a better MinRowPartitionSize if the default one doesn't suit. I stick to keep the current approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stick to keep the current approach.

@YarShev I still think another solution would be better, but since the majority here is for your solution, please update the PR and we'll merge it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +122 to +124
min_block_size=(
MinRowPartitionSize.get() if axis == 0 else MinColumnPartitionSize.get()
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks more natural, doesn't it?

Suggested change
min_block_size=(
MinRowPartitionSize.get() if axis == 0 else MinColumnPartitionSize.get()
),
min_block_sizes=MinPartitionSizes.get(),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We split a df along a single axis. I don't understand why we have to pass a tuple of row and col axis sizes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function has the value of the axis parameter; the function itself will select the desired value from the tuple using it.

modin/tests/core/storage_formats/pandas/test_internals.py Outdated Show resolved Hide resolved
modin/conftest.py Show resolved Hide resolved
YarShev and others added 3 commits June 14, 2024 18:29
…artitionSize

Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
@anmyachev anmyachev merged commit 061a934 into modin-project:main Jun 14, 2024
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce MinRowPartitionSize and MinColumnPartitionSize to make partitioning more flexible
4 participants