UserWarning: Distributing <class 'NoneType'> object. This may take some time. #6574

anmyachev · 2023-09-18T13:01:33Z

I think that we should not give this warning if the data that will be placed in the distribution storage is small in size.

Jayson729 · 2024-06-02T02:13:52Z

Hello, I'm interested in doing this as my first issue. Would that be okay? I'm assuming I would just need to add a size check in dataframe.py and series.py where the warning is, is that correct?. Also, what would be a good size to check for?

anmyachev · 2024-06-02T12:39:19Z

Hi @Jayson729! It will be great!

I'm assuming I would just need to add a size check in dataframe.py and series.py where the warning is, is that correct?

Correct.

Also, what would be a good size to check for?

We can assume that if constructing a pandas dataframe took half a second, then placing it into distributed storage will not take so long to inform the user about it. In this case, we will not need to also check the correlation of the execution time of from_pandas function from the shape of the dataframe and its types.

That is, the creation of a warning should be placed between the creation of a pandas dataframe and the call of from_pandas function. And this should only be done when the creation of a dataframe takes more than half a second. For this, you can use the function time (from time import time).

Note: half a second is an empirical value that we can change if necessary, but it reflects my current experience.

Jayson729 · 2024-06-02T17:45:09Z

Hey @anmyachev,

I just want to make sure I understand everything correctly. Your idea for timing is to put the check in this location (dataframe.py#L255), correct?

If yes, then from some testing it seems like the length of time for creating the dataframe varies pretty wildly depending on the datatype of the data passed. Numpy ndarray's seem to take longer on converting a Pandas DataFrame to a Modin DataFrame, while python lists take longer actually creating the Pandas DataFrame.

This is what I used to test this (at dataframe.py#L255):

start = time()
pandas_df = pandas.DataFrame(
    data=data, index=index, columns=columns, dtype=dtype, copy=copy
)
print(f'Time creating DataFrame: {time() - start}')

start = time()
self._query_compiler = from_pandas(pandas_df)._query_compiler
print(f'Time converting Pandas DataFrame to Modin: {time() - start}')

and then in another file this is how the DataFrames were created, with some other tests not included in this:

print('30000 x 30000 ndarray')
pd.DataFrame(np.random.rand(30000, 30000))

print('10000 x 10000 python list')
pd.DataFrame([[randint(-10000, 10000) for _ in range(10000)] for _ in range(10000)])

and then here's what was printed:

30000 x 30000 ndarray
Time creating DataFrame: 0.00010061264038085938
Time converting Pandas DataFrame to Modin: 28.473838806152344

10000 x 10000 python list
Time creating DataFrame: 19.381755113601685
Time converting Pandas DataFrame to Modin: 0.745363712310791

Due to this I think this might not be a good metric to use for size, is my thought process correct? If so, what would be a better metric for checking size? If not, I'd appreciate you letting me know what my misunderstanding is. Thanks a lot for your help!

anmyachev · 2024-06-03T16:25:45Z

Thanks for the numbers! It's clear that there is no reliable connection this way now.

Let's now try to use the shape from created pandas DataFrame without paying attention to the types (even those that do not support zero copying). Maybe df.shape < (10_000, 16) at the start point. We can check cases where almost all data is numeric and the edge case where all columns are of string type (strings of reasonable length, no more than 250 characters, for example).

Jayson729 · 2024-06-04T22:50:44Z

Doesn't this run into a problem if the creating the DataFrame itself takes a significant amount of time (such as the python list example from my last message)? I also did some tests just using sys.getsizeof() on the data before it's passed to the DataFrame, but that also seemed fairly unreliable. Currently I'm thinking we either need to do different checks for size based the datatype or maybe somehow detect the running time of creating the pandas DataFrame, maybe with another program thread? Am I mistaken in some way or do you have any other ideas?

anmyachev · 2024-06-05T09:53:47Z

Doesn't this run into a problem if the creating the DataFrame itself takes a significant amount of time (such as the python list example from my last message)?

This is definitely not a problem, since this is the time of the pandas constructor and users already have some expectations regarding the time of its work in different cases. So my view on this problem is that we should only give a message when the additional part (putting it into distributed storage) takes up a significant part.

Jayson729 · 2024-06-06T01:19:14Z

That makes complete sense, thanks for explaining.

I tested with both df.shape and df.size. df.size > 100_000_000 seems to be a good metric on my computer (about 1 second regardless of data type). The growth also seems to be linear, 100_000_000 takes about 1 second and 400_000_000 takes about 4.

I also did some testing in series.py, but distributing here rarely took a significant amount of time (500_000_000 size was about 1 second).

I figure we could use df.size < 25_000_000 for dataframe.py and df.size < 100_000_000 in series.py to take into account slower computers. Do these bounds seem reasonable?

anmyachev · 2024-06-06T10:11:45Z

I figure we could use df.size < 25_000_000 for dataframe.py and df.size < 100_000_000 in series.py to take into account slower computers. Do these bounds seem reasonable?

At first glance, it looks good!
Just to be sure. How long does from_pandas function take you for these cases?

pd.Series(np.random.randint(1_000_000, size=(100_000_000)), dtype=str)
and
pd.DataFrame(np.random.randint(1_000_000, size=(1_000_000, 25)), dtype=str)

Jayson729 · 2024-06-11T20:02:35Z

Good catch, these took significantly longer than their integer counterparts. The Series initialization took 17.2 seconds and the DataFrame took 6.2 seconds. A size of 10_000_000 for Series and a size of 5_000_000 for DataFrames both took about a second on my system. So for bounds when dtype=str we could probably use df.size > 2_500_000 for Series and df.size > 1_000_000 for DataFrames. Does this also seem reasonable, and are there any other data types you think we should specifically check for?

Also, sorry for the delay on responding, I was away from my computer so I couldn't test.

anmyachev · 2024-06-16T16:12:10Z

So for bounds when dtype=str we could probably use df.size > 2_500_000 for Series and df.size > 1_000_000 for DataFrames.

Based on the time numbers I received from you (thanks!), I believe I would like to start with these dimensions for all types and see how users like it, and then expand for some types (more for numeric types, as they support faster serialization) if there is positive feedback.

Jayson729 · 2024-06-18T08:13:44Z

Sounds good, I'll try to get the pull request in today. Thanks for all your help!

…e small (#7323) Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>

…DataFrames are small (modin-project#7323) Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>

anmyachev added Code Quality 💯 Improvements or issues to improve quality of codebase P2 Minor bugs or low-priority feature requests labels Sep 18, 2023

mvashishtha added the good first issue 🔰 Good for newcomers label Nov 7, 2023

Jayson729 mentioned this issue Jun 19, 2024

FEAT-#6574: UserWarning no longer displayed when Series/DataFrames are small #7323

Merged

7 tasks

anmyachev pushed a commit that referenced this issue Jun 19, 2024

FEAT-#6574: UserWarning no longer displayed when Series/DataFrames ar…

e9ab99a

…e small (#7323) Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>

anmyachev closed this as completed in #7323 Jun 19, 2024

arunjose696 pushed a commit to arunjose696/modin that referenced this issue Jul 11, 2024

FEAT-modin-project#6574: UserWarning no longer displayed when Series/…

17dd234

…DataFrames are small (modin-project#7323) Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>

arunjose696 pushed a commit to arunjose696/modin that referenced this issue Jul 11, 2024

FEAT-modin-project#6574: UserWarning no longer displayed when Series/…

3979e53

…DataFrames are small (modin-project#7323) Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UserWarning: Distributing <class 'NoneType'> object. This may take some time. #6574

UserWarning: Distributing <class 'NoneType'> object. This may take some time. #6574

anmyachev commented Sep 18, 2023

Jayson729 commented Jun 2, 2024

anmyachev commented Jun 2, 2024

Jayson729 commented Jun 2, 2024

anmyachev commented Jun 3, 2024

Jayson729 commented Jun 4, 2024

anmyachev commented Jun 5, 2024

Jayson729 commented Jun 6, 2024 •

edited

Loading

anmyachev commented Jun 6, 2024

Jayson729 commented Jun 11, 2024

anmyachev commented Jun 16, 2024

Jayson729 commented Jun 18, 2024

UserWarning: Distributing <class 'NoneType'> object. This may take some time. #6574

UserWarning: Distributing <class 'NoneType'> object. This may take some time. #6574

Comments

anmyachev commented Sep 18, 2023

Jayson729 commented Jun 2, 2024

anmyachev commented Jun 2, 2024

Jayson729 commented Jun 2, 2024

anmyachev commented Jun 3, 2024

Jayson729 commented Jun 4, 2024

anmyachev commented Jun 5, 2024

Jayson729 commented Jun 6, 2024 • edited Loading

anmyachev commented Jun 6, 2024

Jayson729 commented Jun 11, 2024

anmyachev commented Jun 16, 2024

Jayson729 commented Jun 18, 2024

Jayson729 commented Jun 6, 2024 •

edited

Loading