Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170

Closed
2 of 3 tasks
Taurus-Le opened this issue Apr 11, 2024 · 4 comments · Fixed by #7177
Closed
2 of 3 tasks

BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170

Taurus-Le opened this issue Apr 11, 2024 · 4 comments · Fixed by #7177
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin

Comments

@Taurus-Le
Copy link

Taurus-Le commented Apr 11, 2024

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import time

import modin.pandas as pd
import modin.config as cfg
import numpy as np
import ray
from modin.distributed.dataframe.pandas import unwrap_partitions, from_partitions
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier

ray.init()

# Config modin to partition dataframe into 5 partitions and not to partition against columns
cfg.MinPartitionSize.put(102)
cfg.NPartitions.put(5)

# Generate samples
data = np.random.rand(10000, 100)
label = [i for i in range(1, 9)] * 1250
features = ['feature' + str(i) for i in range(1, 101)]
df = pd.DataFrame(data=data, columns=features)
df['label'] = label

# Scale samples
scaler = RobustScaler()
res = scaler.fit_transform(df[[column for column in df.columns if column != 'label']].to_numpy())
frame = pd.DataFrame(res, columns=[column for column in df.columns if column != 'label'])

# Update dataframe
df.update(frame)

# Repartition to make dataframe contain only 1 partition against columns
# This will work
partitions = unwrap_partitions(df, axis=0)
df = from_partitions(partitions, axis=0)
# This will raise an error
# df = df._repartition(axis=1)

# Fit a DTC model of sklearn
clf = DecisionTreeClassifier()
features = df[df.columns.drop(['label'])].to_numpy()
clf.fit(features, label)

Issue Description

I created a dataframe whose shape is (10000,101).
In order to make the df contain only 1 partition against columns, I followed instruction from @YarShev that setting MinPartitionSize would make it.
Then I scaled the df with RobustScaler from sklearn and tried to fit a DTC model.
Yet I found the updated df was partitioned against columns again which made the fitting take about twice as long.
So I tried repartitioning the df only against columns by calling df = df._repartition(axis=1). Yet I got an IndexError.
But I managed to solve the problem by calling unwrap_partitions and from_partitions.

Expected Behavior

df._repartition(axis=1) will make the updated df contain only 1 partition against columns. And the repartitioned df could be feed into DTC.

Error Logs

Traceback (most recent call last):
  File "D:\Work\Python\RayDemo3.8\aaaa.py", line 41, in <module>
    features = df[df.columns.drop(['label'])].to_numpy()
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\pandas\base.py", line 3138, in to_numpy
    return self._to_bare_numpy(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\pandas\base.py", line 3119, in _to_bare_numpy
    return self._query_compiler.to_numpy(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\storage_formats\pandas\query_compiler.py", line 376, in to_numpy
    arr = self._modin_frame.to_numpy(**kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\dataframe\pandas\dataframe\dataframe.py", line 3882, in to_numpy
    return self._partition_mgr_cls.to_numpy(self._partitions, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\generic\partitioning\partition_manager.py", line 43, in to_numpy
    parts = RayWrapper.materialize(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\common\engine_wrapper.py", line 92, in materialize
    return ray.get(obj_id)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\worker.py", line 2667, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::_apply_list_of_funcs() (pid=10084, ip=127.0.0.1)
  File "python\ray\_raylet.pyx", line 1889, in ray._raylet.execute_task
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\implementations\pandas_on_ray\partitioning\partition.py", line 440, in _apply_list_of_funcs
    partition = func(partition, *args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\dataframe\pandas\partitioning\partition.py", line 217, in _iloc
    return df.iloc[row_labels, col_labels]
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1097, in __getitem__
    return self._getitem_tuple(key)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1594, in _getitem_tuple
    tup = self._validate_tuple_indexer(tup)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 904, in _validate_tuple_indexer
    self._validate_key(k, i)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1516, in _validate_key
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

Installed Versions

INSTALLED VERSIONS

commit : 0c3746b
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936

Modin dependencies

modin : 0.23.1.post0
ray : 2.10.0
dask : 2023.5.0
distributed : None
hdk : None

pandas dependencies

pandas : 2.0.3
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.10.0
gcsfs : None
matplotlib : 3.7.4
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : 2.0.25
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
None

@Taurus-Le Taurus-Le added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Apr 11, 2024
@YarShev YarShev removed the Triage 🩹 Issues that need triage label Apr 11, 2024
@YarShev
Copy link
Collaborator

YarShev commented Apr 11, 2024

@Taurus-Le, thank you for the detailed description of the issue. I was able to reproduce the problem on latest master.

@anmyachev anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 11, 2024
@YarShev
Copy link
Collaborator

YarShev commented Apr 11, 2024

@Taurus-Le, I looked into the problem and am not sure if we will have a quick fix for it. As a workaround, could you set NPartitions to 1 before repartition? It works for me.

...
old_val = NPartitions.get()
new_val = 1
NPartitions.put(new_val)
df._repartition(axis=1)
NPartitions.put(old_val)
...

@YarShev
Copy link
Collaborator

YarShev commented Apr 11, 2024

@anmyachev, do you think we should add an additional parameter to repartition like num_splits along the axis?

anmyachev added a commit to anmyachev/modin that referenced this issue Apr 11, 2024
…riable in remote context

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@Taurus-Le
Copy link
Author

Hi @YarShev, it worked for me as well and far faster. Thanks for your help.

anmyachev added a commit to anmyachev/modin that referenced this issue Apr 12, 2024
…riable in remote context

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
YarShev pushed a commit that referenced this issue Apr 15, 2024
…ote context (#7177)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working External Pull requests and issues from people who do not regularly contribute to modin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants