Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UserWarning: Distributing <class 'NoneType'> object. This may take some time. #6574

Closed
anmyachev opened this issue Sep 18, 2023 · 11 comments · Fixed by #7323
Closed

UserWarning: Distributing <class 'NoneType'> object. This may take some time. #6574

anmyachev opened this issue Sep 18, 2023 · 11 comments · Fixed by #7323
Labels
Code Quality 💯 Improvements or issues to improve quality of codebase good first issue 🔰 Good for newcomers P2 Minor bugs or low-priority feature requests

Comments

@anmyachev
Copy link
Collaborator

I think that we should not give this warning if the data that will be placed in the distribution storage is small in size.

@anmyachev anmyachev added Code Quality 💯 Improvements or issues to improve quality of codebase P2 Minor bugs or low-priority feature requests labels Sep 18, 2023
@mvashishtha mvashishtha added the good first issue 🔰 Good for newcomers label Nov 7, 2023
@Jayson729
Copy link
Contributor

Hello, I'm interested in doing this as my first issue. Would that be okay? I'm assuming I would just need to add a size check in dataframe.py and series.py where the warning is, is that correct?. Also, what would be a good size to check for?

@anmyachev
Copy link
Collaborator Author

Hi @Jayson729! It will be great!

I'm assuming I would just need to add a size check in dataframe.py and series.py where the warning is, is that correct?

Correct.

Also, what would be a good size to check for?

We can assume that if constructing a pandas dataframe took half a second, then placing it into distributed storage will not take so long to inform the user about it. In this case, we will not need to also check the correlation of the execution time of from_pandas function from the shape of the dataframe and its types.

That is, the creation of a warning should be placed between the creation of a pandas dataframe and the call of from_pandas function. And this should only be done when the creation of a dataframe takes more than half a second. For this, you can use the function time (from time import time).

Note: half a second is an empirical value that we can change if necessary, but it reflects my current experience.

@Jayson729
Copy link
Contributor

Hey @anmyachev,

I just want to make sure I understand everything correctly. Your idea for timing is to put the check in this location (dataframe.py#L255), correct?

If yes, then from some testing it seems like the length of time for creating the dataframe varies pretty wildly depending on the datatype of the data passed. Numpy ndarray's seem to take longer on converting a Pandas DataFrame to a Modin DataFrame, while python lists take longer actually creating the Pandas DataFrame.

This is what I used to test this (at dataframe.py#L255):

start = time()
pandas_df = pandas.DataFrame(
    data=data, index=index, columns=columns, dtype=dtype, copy=copy
)
print(f'Time creating DataFrame: {time() - start}')

start = time()
self._query_compiler = from_pandas(pandas_df)._query_compiler
print(f'Time converting Pandas DataFrame to Modin: {time() - start}')

and then in another file this is how the DataFrames were created, with some other tests not included in this:

print('30000 x 30000 ndarray')
pd.DataFrame(np.random.rand(30000, 30000))

print('10000 x 10000 python list')
pd.DataFrame([[randint(-10000, 10000) for _ in range(10000)] for _ in range(10000)])

and then here's what was printed:

30000 x 30000 ndarray
Time creating DataFrame: 0.00010061264038085938
Time converting Pandas DataFrame to Modin: 28.473838806152344

10000 x 10000 python list
Time creating DataFrame: 19.381755113601685
Time converting Pandas DataFrame to Modin: 0.745363712310791

Due to this I think this might not be a good metric to use for size, is my thought process correct? If so, what would be a better metric for checking size? If not, I'd appreciate you letting me know what my misunderstanding is. Thanks a lot for your help!

@anmyachev
Copy link
Collaborator Author

Thanks for the numbers! It's clear that there is no reliable connection this way now.

Let's now try to use the shape from created pandas DataFrame without paying attention to the types (even those that do not support zero copying). Maybe df.shape < (10_000, 16) at the start point. We can check cases where almost all data is numeric and the edge case where all columns are of string type (strings of reasonable length, no more than 250 characters, for example).

@Jayson729
Copy link
Contributor

Doesn't this run into a problem if the creating the DataFrame itself takes a significant amount of time (such as the python list example from my last message)? I also did some tests just using sys.getsizeof() on the data before it's passed to the DataFrame, but that also seemed fairly unreliable. Currently I'm thinking we either need to do different checks for size based the datatype or maybe somehow detect the running time of creating the pandas DataFrame, maybe with another program thread? Am I mistaken in some way or do you have any other ideas?

@anmyachev
Copy link
Collaborator Author

Doesn't this run into a problem if the creating the DataFrame itself takes a significant amount of time (such as the python list example from my last message)?

This is definitely not a problem, since this is the time of the pandas constructor and users already have some expectations regarding the time of its work in different cases. So my view on this problem is that we should only give a message when the additional part (putting it into distributed storage) takes up a significant part.

@Jayson729
Copy link
Contributor

Jayson729 commented Jun 6, 2024

That makes complete sense, thanks for explaining.

I tested with both df.shape and df.size. df.size > 100_000_000 seems to be a good metric on my computer (about 1 second regardless of data type). The growth also seems to be linear, 100_000_000 takes about 1 second and 400_000_000 takes about 4.

I also did some testing in series.py, but distributing here rarely took a significant amount of time (500_000_000 size was about 1 second).

I figure we could use df.size < 25_000_000 for dataframe.py and df.size < 100_000_000 in series.py to take into account slower computers. Do these bounds seem reasonable?

@anmyachev
Copy link
Collaborator Author

I figure we could use df.size < 25_000_000 for dataframe.py and df.size < 100_000_000 in series.py to take into account slower computers. Do these bounds seem reasonable?

At first glance, it looks good!
Just to be sure. How long does from_pandas function take you for these cases?

pd.Series(np.random.randint(1_000_000, size=(100_000_000)), dtype=str)
and
pd.DataFrame(np.random.randint(1_000_000, size=(1_000_000, 25)), dtype=str)

@Jayson729
Copy link
Contributor

Good catch, these took significantly longer than their integer counterparts. The Series initialization took 17.2 seconds and the DataFrame took 6.2 seconds. A size of 10_000_000 for Series and a size of 5_000_000 for DataFrames both took about a second on my system. So for bounds when dtype=str we could probably use df.size > 2_500_000 for Series and df.size > 1_000_000 for DataFrames. Does this also seem reasonable, and are there any other data types you think we should specifically check for?

Also, sorry for the delay on responding, I was away from my computer so I couldn't test.

@anmyachev
Copy link
Collaborator Author

So for bounds when dtype=str we could probably use df.size > 2_500_000 for Series and df.size > 1_000_000 for DataFrames.

Based on the time numbers I received from you (thanks!), I believe I would like to start with these dimensions for all types and see how users like it, and then expand for some types (more for numeric types, as they support faster serialization) if there is positive feedback.

@Jayson729
Copy link
Contributor

Sounds good, I'll try to get the pull request in today. Thanks for all your help!

anmyachev pushed a commit that referenced this issue Jun 19, 2024
…e small (#7323)

Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>
arunjose696 pushed a commit to arunjose696/modin that referenced this issue Jul 11, 2024
…DataFrames are small (modin-project#7323)

Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>
arunjose696 pushed a commit to arunjose696/modin that referenced this issue Jul 11, 2024
…DataFrames are small (modin-project#7323)

Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code Quality 💯 Improvements or issues to improve quality of codebase good first issue 🔰 Good for newcomers P2 Minor bugs or low-priority feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants