FEAT-#5394: Reduce amount of remote calls for Map operator #7136

Retribution98 · 2024-03-28T13:21:24Z

What do these changes do?

This PR includes an implementation of the simple method proposed in the task:
Check for partitioning before every Map call and if there're too many partitions then call the function across row/column axis so the number of remote calls would equal to the number of row/column partitions (fewer than the total amount of partitions).

But this way got slow when the Dataframe has few columnar partitions and many row partitions (much more than the Cpu count), otherwise modin would only use one remote task.
To solve this problem, another strategy was implemented. If we use columnar partitions to reduce the number of remote tasks, we can try to split them to fill all processors. If possible, we use a new implementation, otherwise the simple method.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Reduce amount of remote calls for square-like dataframes #5394
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

modin/core/dataframe/pandas/dataframe/dataframe.py

check_perfomance.py

YarShev · 2024-04-10T11:09:13Z

Is this PR ready for review?

modin/core/dataframe/pandas/partitioning/partition_manager.py

modin/core/dataframe/pandas/dataframe/dataframe.py

modin/core/dataframe/pandas/partitioning/partition_manager.py

Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>

modin/core/dataframe/pandas/dataframe/dataframe.py

modin/core/dataframe/pandas/partitioning/partition_manager.py

modin/tests/pandas/dataframe/test_map_metadata.py

modin/core/dataframe/pandas/dataframe/dataframe.py

modin/tests/pandas/dataframe/test_map_metadata.py

modin/tests/core/storage_formats/pandas/test_internals.py

anmyachev · 2024-05-02T10:03:32Z

@Retribution98 I see your graphs above, but I don’t really understand what the axes mean. Please label them.

Retribution98 · 2024-05-02T16:17:41Z

@Retribution98 I see your graphs above, but I don’t really understand what the axes mean. Please label them.

@anmyachev Thanks, updated it.

anmyachev

LGTM!

anmyachev · 2024-05-02T18:52:44Z

modin/tests/core/storage_formats/pandas/test_internals.py

+        nrows = MinPartitionSize.get() * CpuCount.get() * 2
+        data = {f"col{i}": np.ones(nrows) for i in range(ncols)}
+        df = pd.DataFrame(data)
+        partitions = df._query_compiler._modin_frame._partitions


Instead of partition_manager_class?

Suggested change

partitions = df._query_compiler._modin_frame._partitions

partitions = df._query_compiler._modin_frame._partitions

partition_mgr_cls = df._query_compiler._modin_frame._partition_mgr_cls

anmyachev · 2024-05-02T18:56:57Z

modin/tests/core/storage_formats/pandas/test_internals.py

+
+
+def test_map_partitions_joined_by_column():
+    # Set the config to 'True' inside of the context-manager


What does it mean?

anmyachev · 2024-05-02T19:00:35Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

+        kw = {
+            "num_splits": step,
+        }
+        result = np.empty(partitions.shape, dtype=cls._partition_class)


These are equivalent actions, but let's make it more explicit.

Suggested change

result = np.empty(partitions.shape, dtype=cls._partition_class)

result = np.empty(partitions.shape, dtype=object)

anmyachev · 2024-05-03T10:24:08Z

@YarShev any more comments?

YarShev

@Retribution98, LGTM, thanks!

Retribution98 requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners March 28, 2024 13:21

Retribution98 marked this pull request as draft March 28, 2024 13:21

YarShev reviewed Mar 28, 2024

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

YarShev reviewed Mar 28, 2024

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Apr 4, 2024

View reviewed changes

check_perfomance.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Apr 4, 2024

View reviewed changes

check_perfomance.py Fixed Show fixed Hide fixed

Retribution98 marked this pull request as ready for review April 10, 2024 11:25

Retribution98 commented Apr 10, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated Show resolved Hide resolved

Retribution98 force-pushed the feat_5394 branch 2 times, most recently from 3c2f61e to 8917466 Compare April 11, 2024 08:43

arunjose696 reviewed Apr 11, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated Show resolved Hide resolved

arunjose696 reviewed Apr 11, 2024

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Show resolved Hide resolved

arunjose696 reviewed Apr 11, 2024

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

arunjose696 reviewed Apr 11, 2024

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated Show resolved Hide resolved

Retribution98 added 6 commits April 16, 2024 02:57

FEAT-modin-project#5394: Reduce amount of remote calls for Map operator

950826e

Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>

Check new implementation

cb45c0a

some updates

f144e4e

close files

fcd37c2

Prepare changes for code review

c5d2ca7

Fix tests

5aa374c

YarShev reviewed Apr 16, 2024

View reviewed changes

Retribution98 force-pushed the feat_5394 branch from 8917466 to 5aa374c Compare April 16, 2024 12:03

Retribution98 changed the base branch from master to main April 29, 2024 12:29

Retribution98 force-pushed the feat_5394 branch from 7874189 to a6dd5e7 Compare April 29, 2024 12:30

YarShev reviewed Apr 29, 2024

View reviewed changes

modin/tests/pandas/dataframe/test_map_metadata.py Outdated Show resolved Hide resolved

anmyachev reviewed Apr 29, 2024

View reviewed changes

Retribution98 added 2 commits April 30, 2024 14:30

create internal test

0ba20ea

remove extra comment

85597cc

YarShev reviewed Apr 30, 2024

View reviewed changes

a litle refactoring

f195174

anmyachev previously approved these changes May 2, 2024

View reviewed changes

Apply suggestion

5ae086c

Retribution98 dismissed anmyachev’s stale review via 5ae086c May 3, 2024 10:42

anmyachev approved these changes May 3, 2024

View reviewed changes

YarShev approved these changes May 3, 2024

View reviewed changes

YarShev merged commit f8bf5b4 into modin-project:main May 3, 2024
38 checks passed

YarShev mentioned this pull request May 3, 2024

Reduce amount of remote calls for square-like dataframes #5394

Closed

Retribution98 mentioned this pull request May 8, 2024

FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators #7245

Merged

7 tasks

This was referenced Aug 15, 2024

FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators furwellness/modin#26

Closed

FEAT-#5394: Reduce amount of remote calls for Map operator furwellness/modin#32

Closed

This was referenced Aug 15, 2024

FEAT-#5394: Reduce amount of remote calls for TreeReduce and GroupByReduce operators furwellness/modin#61

Closed

FEAT-#5394: Reduce amount of remote calls for Map operator furwellness/modin#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#5394: Reduce amount of remote calls for Map operator #7136

FEAT-#5394: Reduce amount of remote calls for Map operator #7136

Retribution98 commented Mar 28, 2024 •

edited

Loading

YarShev commented Apr 10, 2024

anmyachev commented May 2, 2024

Retribution98 commented May 2, 2024 •

edited

Loading

anmyachev left a comment

anmyachev May 2, 2024

anmyachev May 2, 2024

anmyachev May 2, 2024

anmyachev commented May 3, 2024

YarShev left a comment

	partitions = df._query_compiler._modin_frame._partitions
	partitions = df._query_compiler._modin_frame._partitions
	partition_mgr_cls = df._query_compiler._modin_frame._partition_mgr_cls



		def test_map_partitions_joined_by_column():
		# Set the config to 'True' inside of the context-manager

	result = np.empty(partitions.shape, dtype=cls._partition_class)
	result = np.empty(partitions.shape, dtype=object)

FEAT-#5394: Reduce amount of remote calls for Map operator #7136

FEAT-#5394: Reduce amount of remote calls for Map operator #7136

Conversation

Retribution98 commented Mar 28, 2024 • edited Loading

What do these changes do?

YarShev commented Apr 10, 2024

anmyachev commented May 2, 2024

Retribution98 commented May 2, 2024 • edited Loading

anmyachev left a comment

Choose a reason for hiding this comment

anmyachev May 2, 2024

Choose a reason for hiding this comment

anmyachev May 2, 2024

Choose a reason for hiding this comment

anmyachev May 2, 2024

Choose a reason for hiding this comment

anmyachev commented May 3, 2024

YarShev left a comment

Choose a reason for hiding this comment

Retribution98 commented Mar 28, 2024 •

edited

Loading

Retribution98 commented May 2, 2024 •

edited

Loading