-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6227: Make sure Series.unique()
with pyarrow dtype returns ArrowExtensionArray
#7042
Conversation
…e returns 'ArrowExtensionArray' Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@@ -1936,7 +1936,10 @@ def str_split(self, pat=None, n=-1, expand=False, regex=None): | |||
def unique(self, keep="first", ignore_index=True, subset=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev when combining functions drop_duplicates
and unique
into one, did you take into account that they return different data types? It seems to me that because of this, using qc.unique
in its current form is inconvenient. From my side, the best solution seems to be to split this function into several. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when #7091 drop_duplicates and unique into one, did you take into account that they return different data types
Initially, I haven't seen a problem with it, since the result of both .unique()
and .drop_duplicates()
was wrapped into a pandas dataframe anyway inside of partition class. I didn't want to add another unique-like qc method in order not to bloat QCs api (which is already quite big and sometimes redundant).
An ideal solution I see is to continue wrapping the results with a pandas dataframe inside of partitions and then make the front-end handle the conversion to the desired output type (for example, .drop_duplicates()
returns a DataFrame(result_qc)
and .unique()
returns result_qc.to_values()
). Arrow extension arrays seem to round-trip to pandas with no problems:
>>> arr = pd.array([1, 1, None], dtype="int64[pyarrow]")
>>> type(arr)
<class 'pandas.core.arrays.arrow.array.ArrowExtensionArray'>
>>> arr is pd.DataFrame(arr).squeeze().values
True
subset is None | ||
and ignore_index | ||
and len(self.columns) == 1 | ||
and keep is not False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev If we do not make this change, then the string is stored in the variable (for example, "first"
). Did you do this on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, it's a bug, thanks for the fix!
res = self.to_pandas().squeeze(axis=1) | ||
return res.unique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev I measured the new approach on small data (using your script from #7091). It seems to work faster than the previous one. Do you have the ability to measure on big sizes? (maybe you still have your env)
Env: windows 11, ray, 8 cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will always work faster, since the old full-column approach is basically '.to_pandas() + splitting + putting splits to ray storage', and in your approach it's only .to_pandas().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will always work faster, since the old full-column approach is basically '.to_pandas() + splitting + putting splits to ray storage', and in your approach it's only .to_pandas().
Yes, but I compare with range-partitioning implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the same scripts on linux gives different results (range-partitioning is still faster in a lot of cases):
I've been able to reproduce your results though when tried to run the same on windows:
I don't think that the main reasons for the difference is OS here. I think that it's a hardware difference, since for windows runs I've used my own laptop instead of a dedicated machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev thank you very much for running the benchmark!
I don't think that the main reasons for the difference is OS here. I think that it's a hardware difference, since for windows runs I've used my own laptop instead of a dedicated machine.
Interesting detail.
ArrowExtensionArray
Series.unique()
with pyarrow dtype returns ArrowExtensionArray
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
modin/tests/pandas/test_series.py
Outdated
modin_result = modin_series.unique() | ||
pandas_result = pandas_series.unique() | ||
|
||
df_equals(modin_result, pandas_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use eval_general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewritten
return ( | ||
self.__constructor__(query_compiler=self._query_compiler.unique()) | ||
.modin.to_pandas() | ||
._values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a public method that returns the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I searched but didn't find it (values
and array
are not suitable).
# return self.to_pandas().squeeze(axis=1).unique() works faster | ||
# but returns pandas type instead of query compiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to perform this in a separate issue? I guess so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date