-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv skiprows parameter bug (possible) #4543
Comments
Hello @VasilijKolomiets! Thank you for submitting the issue! I downsized a bit your reproducer. # import pandas as pd
import modin.pandas as pd
import modin.config as cfg
cfg.Engine.put("dask")
if __name__ == '__main__':
us_columns_dtype = {
'product sales': str, # to see what's trying to read in real
# 'product sales': "float64",
}
skiprows = 0
file_path = "DR_US_2021Jun6-2022Jun5CustomUnifiedTransaction.csv"
separator = ','
en_cod_ = 'utf-8'
columns_dtype = us_columns_dtype.copy()
readed_into_df = pd.read_csv(
str(file_path),
sep=separator,
encoding=en_cod_,
thousands=",",
skiprows=skiprows,
on_bad_lines="skip",
usecols=columns_dtype.keys(),
dtype=columns_dtype,
)
print(readed_into_df) Outputs:Output for (modin) prutskov@prutskovPC:~/projects/modin$ python test2.py
UserWarning: Dask execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
from distributed import Client
client = Client()
product sales
0 Jun 6, 2021 1:29:07 AM PDT
1 Jun 6, 2021 1:29:07 AM PDT
2 Jun 6, 2021 1:29:43 AM PDT
3 Jun 6, 2021 1:43:17 AM PDT
4 Jun 6, 2021 2:14:52 AM PDT
... ...
31753 Jun 5, 2022 10:38:01 PM PDT
31754 Jun 5, 2022 11:04:58 PM PDT
31755 Jun 5, 2022 11:42:56 PM PDT
31756 Jun 5, 2022 11:44:37 PM PDT
31757 Jun 5, 2022 11:54:52 PM PDT
[31758 rows x 1 columns] Output for (modin) prutskov@prutskovPC:~/projects/modin$ python test2.py
product sales
0 29.49
1 29.49
2 149.99
3 39.99
4 58.98
... ...
31753 33.99
31754 0
31755 29.99
31756 0
31757 59.99
[31758 rows x 1 columns] Commenting of The additional one problem is discovered in case Outputs 2:Output for (modin) prutskov@prutskovPC:~/projects/modin$ python test2.py
(modin) prutskov@prutskovPC:~/projects/modin$ python test2.py
UserWarning: Dask execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
from distributed import Client
client = Client()
Traceback (most recent call last):
File "test2.py", line 20, in <module>
readed_into_df = pd.read_csv(
File "/home/prutskov/projects/modin/modin/logging/logger_function.py", line 65, in run_and_log
return f(*args, **kwargs)
File "/home/prutskov/projects/modin/modin/pandas/io.py", line 140, in read_csv
return _read(**kwargs)
File "/home/prutskov/projects/modin/modin/pandas/io.py", line 61, in _read
pd_obj = FactoryDispatcher.read_csv(**kwargs)
File "/home/prutskov/projects/modin/modin/core/execution/dispatching/factories/dispatcher.py", line 185, in read_csv
return cls.__factory._read_csv(**kwargs)
File "/home/prutskov/projects/modin/modin/core/execution/dispatching/factories/factories.py", line 217, in _read_csv
return cls.io_cls.read_csv(**kwargs)
File "/home/prutskov/projects/modin/modin/logging/logger_function.py", line 65, in run_and_log
return f(*args, **kwargs)
File "/home/prutskov/projects/modin/modin/core/io/file_dispatcher.py", line 153, in read
query_compiler = cls._read(*args, **kwargs)
File "/home/prutskov/projects/modin/modin/logging/logger_function.py", line 65, in run_and_log
return f(*args, **kwargs)
File "/home/prutskov/projects/modin/modin/core/io/text/text_file_dispatcher.py", line 1004, in _read
) = cls._manage_skiprows_parameter(skiprows, header_size)
File "/home/prutskov/projects/modin/modin/logging/logger_function.py", line 65, in run_and_log
return f(*args, **kwargs)
File "/home/prutskov/projects/modin/modin/core/io/text/text_file_dispatcher.py", line 800, in _manage_skiprows_parameter
skiprows_md[0] - header_size if skiprows_md[0] > header_size else 0
IndexError: index 0 is out of bounds for axis 0 with size 0 Output for (modin) prutskov@prutskovPC:~/projects/modin$ python test2.py
product sales
0 29.49
1 29.49
2 149.99
3 39.99
4 58.98
... ...
31753 33.99
31754 0
31755 29.99
31756 0
31757 59.99
[31758 rows x 1 columns] The root cause of the second issue: modin/modin/core/io/text/text_file_dispatcher.py Lines 795 to 803 in dcee13d
The both issues are reproduced on both Ray/Dask engines. We will start working on this! |
So - now I have to reinstall modin in my conda and all will be fine? |
The fixing is in progress right now in #4544. We will notify you when the fix will be merged in master branch. After that, you will just need to install modin from master branch. |
Signed-off-by: Alexey Prutskov <lehaprutskov@gmail.com>
Hi @VasilijKolomiets! The fix has been merged into master branch. You can install Modin from master using command |
😃 great!
Thanks a lot!
Отримайте Outlook для Android<https://aka.ms/AAb9ysg>
…________________________________
From: Alexey Prutskov ***@***.***>
Sent: Tuesday, June 14, 2022 4:20:41 PM
To: modin-project/modin ***@***.***>
Cc: Василь ***@***.***>; Mention ***@***.***>
Subject: Re: [modin-project/modin] read_csv skiprows parameter bug (possible) (Issue #4543)
Hi @VasilijKolomiets<https://github.com/VasilijKolomiets>! The fix has been merged into master branch. You can install Modin from master using command pip install ***@***.***#egg=modin[all] or wait the next Modin release.
—
Reply to this email directly, view it on GitHub<#4543 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AC5CF7ICPBMFNZFGME6VXY3VPCBKTANCNFSM5YF2L2EQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Here is the code
Usage the skiprows parameter cause the next error listing
If I only commen row with skiprows=skiprows, all works. The same code with pure pandas works properly
Environment:
It seams patameter skiprows=skiprows, processing wrong way. BC in pure pandas all works.
But some files works on properly...
I have tryed read my file without parameter
dtype=columns_dtype
and the code was finished without any messages....I just was trying transfer my working pandas project to modin[dask] so please help me to avoid this issue.
So in short.
read_csv() works with parameter
skiprows
or with parameterdtype
. Using this parameters together produce crash.This issue is a copy feom my Dask issue. They have recomended me to repost my qiestion here bc they think:
And really - I had used
import dask.dataframe as pd
and error disappeared. So it looks like real modin arguments passing error. Can anybody help? Or I have to switch frommodin
todask
?DR_US_2021Jun6-2022Jun5CustomUnifiedTransaction.csv
The text was updated successfully, but these errors were encountered: