-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6935: Fix Merge failed when right operand is an empty dataframe #6941
Conversation
@@ -1120,6 +1120,9 @@ def to_pandas_remote(df, partition_shape, *dfs): | |||
(df,) + dfs, partition_shape, called_from_remote=True | |||
) | |||
|
|||
if partitions.size == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arunjose696 please add also a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote the test, the fix is slightly complicated as the broadcasted right dataframes being empty will not be reconstructed as the partitions would be empty (Creating a new empty datatframe does not work as the column information is not available).
I assume I would need to modify broadcast_apply_full_axis
logic as other modin operations eg. df.compare(df2)
which use broadcast_apply_full_axis would also fail if the second dataframe is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of empty partitions, could we also send data about the indexes that we store at the dataframe level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried an approach with creating a new partition (with indexes and columns from dataframe level)in broadcast_apply_full axis if the dataframe does not have partitions, this seems to work for the test case.
3fbce22
to
f2198db
Compare
@@ -427,6 +427,23 @@ def empty(cls): | |||
""" | |||
return cls.put(pandas.DataFrame(), 0, 0) | |||
|
|||
@classmethod | |||
def partition_from_data(cls, df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty clear to use the old put
method I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this for verbosity , will remove.
98a3d7c
to
1202f53
Compare
@arunjose696 please rebase |
…ty dataframe Signed-off-by: arunjose696 <arunjose696@gmail.com>
1202f53
to
408032d
Compare
empty_partition = self._partition_mgr_cls.create_partition_from_data( | ||
pandas.DataFrame(index=df.index, columns=df.columns) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make this as a part of combine
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave out create_partition_from_data. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean leave it as is? Then this will only work for broadcast_apply_full
function; if we want to use this approach with other functions, we will have to duplicate this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to leave the create_partition_from_data
function as is. The issue is indeed in broadcast_apply_full_axis
as we dont send any data to remote function for empty partition dataframe, so I assume the issue would be better addressed in this function itself.
The combine
function is called only in merge()/join() so even if we use the logic there to create a empty partition(with index and columns), the broadcast_apply_full_axis
function would fail for other operations eg df.compare(df2)
,
To use this approach in other functions, the alternative I think would be to make get_partitions() a method of dataframe(and let it handle the case if partitions are empty) class would this be fine ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use this approach in other functions, the alternative I think would be to make get_partitions() a method of dataframe(and let it handle the case if partitions are empty) class would this be fine ?
This sounds good to me. We could do get_partitions
(I would name it as _extract_partitions) as a method of the PandasDataframe class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
empty_partition = self._partition_mgr_cls.create_partition_from_data( | ||
pandas.DataFrame(index=df.index, columns=df.columns) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty_partition = self._partition_mgr_cls.create_partition_from_data( | |
pandas.DataFrame(index=df.index, columns=df.columns) | |
) | |
return np.array([[self._partition_mgr_cls._partition_class.put( | |
pandas.DataFrame(index=df.index, columns=df.columns) | |
]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
committed this, for now but I think the readability would be better with using empty_partition
or empty_data_partition
as a variable and returning the variable. Is it a better practice to not create a variable and return directly as this would be slightly confusing?
Also wouldnt using self._partition_mgr_cls._partition_class be using the private attributes in a different class, wouldnt this be a bad practice?
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
@@ -1120,6 +1137,9 @@ def to_pandas_remote(df, partition_shape, *dfs): | |||
(df,) + dfs, partition_shape, called_from_remote=True | |||
) | |||
|
|||
if partitions.size <= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this at the beggining of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -175,6 +175,23 @@ def preprocess_func(cls, map_func): | |||
|
|||
# END Abstract Methods | |||
|
|||
@classmethod | |||
def create_partition_from_data(cls, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I keep this function as is or else we would have to use self._partition_mgr_cls._partition_class.put
which would be calling a private function from a private attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this method to create_partition_from_metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have renamed and changed arguments to take in a metadata dict as the name would suggest it accepts metadata.
5238961
to
0cc6963
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @arunjose696
0cc6963
to
11eb8ab
Compare
11eb8ab
to
82607a5
Compare
…ty dataframe (modin-project#6941) Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: arunjose696 <arunjose696@gmail.com>
…ty dataframe (modin-project#6941) Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: arunjose696 <arunjose696@gmail.com>
What do these changes do?
Fixing corner case when partitions are empty for merge.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date