-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UserWarning: Distributing <class 'NoneType'> object. This may take some time. #6574
Comments
Hello, I'm interested in doing this as my first issue. Would that be okay? I'm assuming I would just need to add a size check in dataframe.py and series.py where the warning is, is that correct?. Also, what would be a good size to check for? |
Hi @Jayson729! It will be great!
Correct.
We can assume that if constructing a pandas dataframe took half a second, then placing it into distributed storage will not take so long to inform the user about it. In this case, we will not need to also check the correlation of the execution time of That is, the creation of a warning should be placed between the creation of a pandas dataframe and the call of Note: |
Hey @anmyachev, I just want to make sure I understand everything correctly. Your idea for timing is to put the check in this location (dataframe.py#L255), correct? If yes, then from some testing it seems like the length of time for creating the dataframe varies pretty wildly depending on the datatype of the data passed. Numpy ndarray's seem to take longer on converting a Pandas DataFrame to a Modin DataFrame, while python lists take longer actually creating the Pandas DataFrame. This is what I used to test this (at dataframe.py#L255): start = time()
pandas_df = pandas.DataFrame(
data=data, index=index, columns=columns, dtype=dtype, copy=copy
)
print(f'Time creating DataFrame: {time() - start}')
start = time()
self._query_compiler = from_pandas(pandas_df)._query_compiler
print(f'Time converting Pandas DataFrame to Modin: {time() - start}') and then in another file this is how the DataFrames were created, with some other tests not included in this: print('30000 x 30000 ndarray')
pd.DataFrame(np.random.rand(30000, 30000))
print('10000 x 10000 python list')
pd.DataFrame([[randint(-10000, 10000) for _ in range(10000)] for _ in range(10000)]) and then here's what was printed:
Due to this I think this might not be a good metric to use for size, is my thought process correct? If so, what would be a better metric for checking size? If not, I'd appreciate you letting me know what my misunderstanding is. Thanks a lot for your help! |
Thanks for the numbers! It's clear that there is no reliable connection this way now. Let's now try to use the shape from created pandas DataFrame without paying attention to the types (even those that do not support zero copying). Maybe df.shape < (10_000, 16) at the start point. We can check cases where almost all data is numeric and the edge case where all columns are of string type (strings of reasonable length, no more than 250 characters, for example). |
Doesn't this run into a problem if the creating the DataFrame itself takes a significant amount of time (such as the python list example from my last message)? I also did some tests just using sys.getsizeof() on the data before it's passed to the DataFrame, but that also seemed fairly unreliable. Currently I'm thinking we either need to do different checks for size based the datatype or maybe somehow detect the running time of creating the pandas DataFrame, maybe with another program thread? Am I mistaken in some way or do you have any other ideas? |
This is definitely not a problem, since this is the time of the pandas constructor and users already have some expectations regarding the time of its work in different cases. So my view on this problem is that we should only give a message when the additional part (putting it into distributed storage) takes up a significant part. |
That makes complete sense, thanks for explaining. I tested with both df.shape and df.size. I also did some testing in series.py, but distributing here rarely took a significant amount of time (500_000_000 size was about 1 second). I figure we could use |
At first glance, it looks good!
|
Good catch, these took significantly longer than their integer counterparts. The Series initialization took 17.2 seconds and the DataFrame took 6.2 seconds. A size of 10_000_000 for Series and a size of 5_000_000 for DataFrames both took about a second on my system. So for bounds when dtype=str we could probably use Also, sorry for the delay on responding, I was away from my computer so I couldn't test. |
Based on the time numbers I received from you (thanks!), I believe I would like to start with these dimensions for all types and see how users like it, and then expand for some types (more for numeric types, as they support faster serialization) if there is positive feedback. |
Sounds good, I'll try to get the pull request in today. Thanks for all your help! |
…e small (#7323) Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>
…DataFrames are small (modin-project#7323) Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>
…DataFrames are small (modin-project#7323) Signed-off-by: Jayson Willey <91502167+Jayson729@users.noreply.github.com>
I think that we should not give this warning if the data that will be placed in the distribution storage is small in size.
The text was updated successfully, but these errors were encountered: