Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#7039: Pass scalar dtype as is to astype query compiler #7152

Merged
merged 5 commits into from
Apr 11, 2024

Conversation

arunjose696
Copy link
Collaborator

@arunjose696 arunjose696 commented Apr 5, 2024

What do these changes do?

Currently in master, In astype method if the datatype passed is a scalar, the scalar is converted to a dictionary with the keys as column names and values the scalar dtype value for all keys. This causes the exception message to be different in pandas vs modin as the function passed to remote function applies astype with the dictionary instead of scalar.

This PR passes the scalar dtypes as is to QC layer and the scalar is used in remote function for astype. This would also avoid unnecessarily iterating over the column names if dtype is a scalar value.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves test_series.py::test_astype failed due to different exception messages #7039?
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@arunjose696
Copy link
Collaborator Author

arunjose696 commented Apr 8, 2024

Performance comparison for astype with scalar dtype

import modin.pandas as pd
import numpy as np
import string
from modin.utils import execute
import time
# Define the number of columns and rows
num_columns = 100000
num_rows = 5
data = {}
for i in range(num_columns):
    data[f'col_{i}'] = np.random.randint(0, 100, num_rows)

# Create DataFrame
df = pd.DataFrame(data)
execute(df)
start = time.time()
df.astype("str")
execute(df)
end = time.time()
print(end-start)

For Master 4.155833005905151 seconds
For Current 0.03567790985107422 seconds

modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/base.py Outdated Show resolved Hide resolved
if isinstance(new_dtype, pandas.CategoricalDtype):
new_dtypes[column] = LazyProxyCategoricalDtype._build_proxy(
new_dtypes[:] = LazyProxyCategoricalDtype._build_proxy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to use the same object in multiple elements of a series?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it would be the same value for dtype for all elements of the series, What would be an issue using same object?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we store a reference to the same object in several elements of the series, and if so, then changing one element will entail changing all elements, which can easily lead to errors. However, if this is not the case, then there is no problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we only store dtype values to new_dtypes[:] I think it should be okay as the cases where values of dtypes itself being mutated inplace are very unlikely. If we reassign the value for dtypes eg with new_dtypes[0]="str" this would also work fine so I think we can keep this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering also that this class does not have inplace operations other than materialization, then ok.

# Actual parent will substitute `None` at `.set_dtypes_cache`
parent=None,
column_name=column,
column_name=new_dtypes.index,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument can only be a string, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should take only string have made, have made changes to use apply() function and pass index of the column as a string.

modin/pandas/base.py Outdated Show resolved Hide resolved
modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved
@arunjose696 arunjose696 force-pushed the astype branch 2 times, most recently from 7d8a42e to bc7ce11 Compare April 10, 2024 10:28
@YarShev
Copy link
Collaborator

YarShev commented Apr 10, 2024

CI failed, please recheck.

@arunjose696 arunjose696 force-pushed the astype branch 2 times, most recently from b52daf1 to 9621963 Compare April 10, 2024 12:53
@YarShev
Copy link
Collaborator

YarShev commented Apr 10, 2024

CI is failing because of #7167 so let's wait for it to be merged.

@YarShev
Copy link
Collaborator

YarShev commented Apr 10, 2024

@anmyachev, any comments?

@YarShev
Copy link
Collaborator

YarShev commented Apr 11, 2024

@arunjose696, please rebase on current master to make CI pass.

arunjose696 and others added 2 commits April 11, 2024 07:55
…iler

Signed-off-by: arunjose696 <arunjose696@gmail.com>
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
handling categorical type

Signed-off-by: arunjose696 <arunjose696@gmail.com>
@YarShev
Copy link
Collaborator

YarShev commented Apr 11, 2024

rebased

YarShev
YarShev previously approved these changes Apr 11, 2024
@YarShev
Copy link
Collaborator

YarShev commented Apr 11, 2024

@anmyachev, any comments?

YarShev and others added 2 commits April 11, 2024 15:18
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@YarShev YarShev merged commit ad057fa into modin-project:master Apr 11, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_series.py::test_astype failed due to different exception messages
3 participants