Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1458130 Implement Index.sort_values #1901

Merged
merged 8 commits into from
Jul 16, 2024

Conversation

sfc-gh-vbudati
Copy link
Contributor

@sfc-gh-vbudati sfc-gh-vbudati commented Jul 10, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1458130

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.

    Added support for Index.sort_values. Currently, the key parameter is not supported in Snowpark pandas.

One peculiar feature about Index.sort_values is the option to return an indexer (return_indexer) - this is a numpy array with the old row numbers/positions in their sorted positions.

I reused the sort_index implementation in the query compiler since it's the exact same logic. I added a new parameter include_indexer to retrieve the indexer used to sort the Index values. It is import to note here that pandas uses quicksort (which Snowpark pandas does not support yet) while Snowpark pandas uses stable sort. Therefore, it is not guaranteed that the indexer returned by Snowpark pandas is going to be the same as the one returned by native pandas.

Examples:

>>> res = pd.Index([1, 3, 1, 1, 1, 3, 1, 1, 1, 2, 2, 2, 3]).sort_values(return_indexer=True)
>>> res[0]
Index([1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype='int64')
>>> res[1]
array([ 0,  2,  3,  4,  6,  7,  8,  9, 10, 11,  1,  5, 12])

>>> idx = pd.Index([1, 2, 3, 2, 3, 5, 6, 7, 8, 4, 4, 5, 6, 7, 1, 2, 1, 2, 3, 4, 3, 4, 5, 6, 7])
>>> res = idx.sort_values(return_indexer=True)
# Both Snowpark pandas and native pandas return the correct Index.
>>> res[0]
Index([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8], dtype='int64')
# But the indexer order returned in this example differs since Snowpark pandas uses stable sort
# while native pandas uses quicksort.
>>> res[1]
# Snowpark pandas
array([0, 14, 16, 1, 3, 15, 17, 2, 4, 18, 20, 9, 10, 19, 21, 5, 11, 22, 6, 12, 23, 7, 13, 24, 8])
                  ^ differs
# Native pandas
array([0, 14, 16, 15, 17, 3, 1, 18, 2, 4, 20, 21, 9, 10, 19, 5, 22, 11, 12, 23, 6, 7, 13, 24, 8])
                   ^ differs

Copy link
Contributor

@sfc-gh-nkumar sfc-gh-nkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty clean. Thanks for adding descriptive comments.

tests/integ/modin/index/test_sort_values.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/modin/plugin/extensions/index.py Outdated Show resolved Hide resolved
@@ -159,7 +159,8 @@ Methods
+-----------------------------+---------------------------------+----------------------------------+----------------------------------------------------+
| ``searchsorted`` | N | | |
+-----------------------------+---------------------------------+----------------------------------+----------------------------------------------------+
| ``sort_values`` | N | | |
| ``sort_values`` | P | key | Snowpark pandas currently uses stable sort when |
| | | | sorting the index values. Pandas uses quicksort. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment also apply to Series.sort_values and DataFrame.sort_values? If so, then we should update their docs too. It can be in a separate PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! Should apply there as well, I'll make a separate PR for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@sfc-gh-helmeleegy sfc-gh-helmeleegy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, Varnika.

@sfc-gh-vbudati sfc-gh-vbudati merged commit ed7c83a into main Jul 16, 2024
35 checks passed
@sfc-gh-vbudati sfc-gh-vbudati deleted the vbudati/SNOW-1458130-index-sort-values branch July 16, 2024 20:46
@github-actions github-actions bot locked and limited conversation to collaborators Jul 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants