-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge #6880
FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge #6880
Conversation
Few issues I have with the approach
Can I have some suggestions on how to improve this? |
b06563a
to
69725c1
Compare
5331555
to
0a872b3
Compare
Might it be possible to call |
To call right.to_pandas on workers we would need to still send the right modin dataframe to worker. Wouldnt it still increase the memory consumption same as that in this case, why would it be better compared to the current approach? |
@arunjose696, I think @anmyachev means calling def force_materialization(self) -> "PandasDataframe":
row_partitions = self._partition_mgr_cls.row_partitions(self._partitions)
col_partition = self._partition_mgr_cls.column_partitions(row_partitions)
new_frame = np.array([col_partition[0].apply(lambda df: df, num_splits=1)])
return new_frame |
I tried this approach. by making a small change, it converts to a single partiton df . However the the memory consumption is increasing for several workers during the force_materialization even though only one remote worker is utilized. I tried checking the memory consumption of workers with below script in which I log the memory consumption in workers before and after the force materialization call. The memory consumption in multiple workers have gone up. force_materialization.py
|
As far as I remember, with this approach it is possible that intermediate partitions are created using method modin/modin/core/dataframe/pandas/partitioning/axis_partition.py Lines 131 to 139 in fe19363
I would like to consider the possibility of creating a pandas dataframe in a worker process, without creating intermediate objects. The closest implementation to what I think is the best solution here is the following code:
|
3880889
to
07522fb
Compare
I have done an implementation which makes use of to_pandas and calling it in remote function, could you check this once. |
7813297
to
d6e62ba
Compare
ff5639d
to
942a2a9
Compare
942a2a9
to
9f216d2
Compare
Signed-off-by: arunjose696 <arunjose696@gmail.com>
9f216d2
to
70cb788
Compare
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
f15250d
to
57b254d
Compare
09f5d6c
to
b6c2f3b
Compare
b6c2f3b
to
25c0904
Compare
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
@anmyachev, any comments? |
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date