-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7100: Add range-partitioning impl for 'nunique()' #7101
Conversation
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@pytest.mark.parametrize("axis", axis_values, ids=axis_keys) | ||
@pytest.mark.parametrize("data", test_data_values, ids=test_data_keys) | ||
@pytest.mark.parametrize("dropna", [True, False]) | ||
def test_nunique(data, axis, dropna): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests for df.nunique()
were missing
MODIN_RANGE_PARTITIONING_GROUPBY=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_groupby.py | ||
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_series.py -k "test_nunique" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section becomes a common place for range-partitioning testing since #7091
unsupported_message = "" | ||
if axis != 0: | ||
unsupported_message += ( | ||
"Range-partitioning 'nunique()' is only supported for 'axis=0'.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it difficult to support axis=1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation of range-partitioning mechanism, it's impossible. Rewriting the whole mechanism just for this method doesn't seem reasonable
if len(unsupported_message) > 0: | ||
message = ( | ||
f"Can't use range-partitioning implementation for 'nunique' because:\n{unsupported_message}" | ||
+ "Falling back to a full-axis implementation." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ "Falling back to a full-axis implementation." | |
+ "Falling back to a Reduce implementation." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed 'full-axis' -> 'full-axis reduce'
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Would this be a problem if there were identical rows in different bins? |
It would be a problem, but this can't happen by design. Identical rows (not identical values of a column) will always be in the same bin. |
What do these changes do?
Adds range-partitioning implementation for
.nunique()
method. Unfortunately, range-partitioning in.nunique()
can be used only for 1D data (pd.Series
and 1-column DataFrames). The reason is that we need each column to be the 'key' in range-partitioning, which is not possible even if we use several columns to build bins:This is not a problem for other methods, since there we want to extract unique rows rather than unique values in each column independently.
perf measurements
script to measure
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
.nunique()
#7100docs/development/architecture.rst
is up-to-date