BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170

Taurus-Le · 2024-04-11T01:28:34Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import time

import modin.pandas as pd
import modin.config as cfg
import numpy as np
import ray
from modin.distributed.dataframe.pandas import unwrap_partitions, from_partitions
from sklearn.preprocessing import RobustScaler
from sklearn.tree import DecisionTreeClassifier

ray.init()

# Config modin to partition dataframe into 5 partitions and not to partition against columns
cfg.MinPartitionSize.put(102)
cfg.NPartitions.put(5)

# Generate samples
data = np.random.rand(10000, 100)
label = [i for i in range(1, 9)] * 1250
features = ['feature' + str(i) for i in range(1, 101)]
df = pd.DataFrame(data=data, columns=features)
df['label'] = label

# Scale samples
scaler = RobustScaler()
res = scaler.fit_transform(df[[column for column in df.columns if column != 'label']].to_numpy())
frame = pd.DataFrame(res, columns=[column for column in df.columns if column != 'label'])

# Update dataframe
df.update(frame)

# Repartition to make dataframe contain only 1 partition against columns
# This will work
partitions = unwrap_partitions(df, axis=0)
df = from_partitions(partitions, axis=0)
# This will raise an error
# df = df._repartition(axis=1)

# Fit a DTC model of sklearn
clf = DecisionTreeClassifier()
features = df[df.columns.drop(['label'])].to_numpy()
clf.fit(features, label)

Issue Description

I created a dataframe whose shape is (10000,101).
In order to make the df contain only 1 partition against columns, I followed instruction from @YarShev that setting MinPartitionSize would make it.
Then I scaled the df with RobustScaler from sklearn and tried to fit a DTC model.
Yet I found the updated df was partitioned against columns again which made the fitting take about twice as long.
So I tried repartitioning the df only against columns by calling df = df._repartition(axis=1). Yet I got an IndexError.
But I managed to solve the problem by calling unwrap_partitions and from_partitions.

Expected Behavior

df._repartition(axis=1) will make the updated df contain only 1 partition against columns. And the repartitioned df could be feed into DTC.

Error Logs

Traceback (most recent call last):
  File "D:\Work\Python\RayDemo3.8\aaaa.py", line 41, in <module>
    features = df[df.columns.drop(['label'])].to_numpy()
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\pandas\base.py", line 3138, in to_numpy
    return self._to_bare_numpy(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\pandas\base.py", line 3119, in _to_bare_numpy
    return self._query_compiler.to_numpy(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\storage_formats\pandas\query_compiler.py", line 376, in to_numpy
    arr = self._modin_frame.to_numpy(**kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\dataframe\pandas\dataframe\dataframe.py", line 3882, in to_numpy
    return self._partition_mgr_cls.to_numpy(self._partitions, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\generic\partitioning\partition_manager.py", line 43, in to_numpy
    parts = RayWrapper.materialize(
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\common\engine_wrapper.py", line 92, in materialize
    return ray.get(obj_id)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\worker.py", line 2667, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\ray\_private\worker.py", line 864, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(IndexError): ray::_apply_list_of_funcs() (pid=10084, ip=127.0.0.1)
  File "python\ray\_raylet.pyx", line 1889, in ray._raylet.execute_task
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\execution\ray\implementations\pandas_on_ray\partitioning\partition.py", line 440, in _apply_list_of_funcs
    partition = func(partition, *args, **kwargs)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\modin\core\dataframe\pandas\partitioning\partition.py", line 217, in _iloc
    return df.iloc[row_labels, col_labels]
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1097, in __getitem__
    return self._getitem_tuple(key)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1594, in _getitem_tuple
    tup = self._validate_tuple_indexer(tup)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 904, in _validate_tuple_indexer
    self._validate_key(k, i)
  File "D:\Work\Python\RayDemo3.8\venv\lib\site-packages\pandas\core\indexing.py", line 1516, in _validate_key
    raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

Installed Versions

INSTALLED VERSIONS

commit : 0c3746b
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 151 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936

Modin dependencies

modin : 0.23.1.post0
ray : 2.10.0
dask : 2023.5.0
distributed : None
hdk : None

pandas dependencies

pandas : 2.0.3
numpy : 1.24.4
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.10.0
gcsfs : None
matplotlib : 3.7.4
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : 2.0.25
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
None

The text was updated successfully, but these errors were encountered:

YarShev · 2024-04-11T08:56:24Z

@Taurus-Le, thank you for the detailed description of the issue. I was able to reproduce the problem on latest master.

YarShev · 2024-04-11T12:41:25Z

@Taurus-Le, I looked into the problem and am not sure if we will have a quick fix for it. As a workaround, could you set NPartitions to 1 before repartition? It works for me.

...
old_val = NPartitions.get()
new_val = 1
NPartitions.put(new_val)
df._repartition(axis=1)
NPartitions.put(old_val)
...

YarShev · 2024-04-11T12:43:38Z

@anmyachev, do you think we should add an additional parameter to repartition like num_splits along the axis?

…riable in remote context Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Taurus-Le · 2024-04-12T00:27:34Z

Hi @YarShev, it worked for me as well and far faster. Thanks for your help.

…riable in remote context Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

…ote context (#7177) Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Taurus-Le added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Apr 11, 2024

YarShev removed the Triage 🩹 Issues that need triage label Apr 11, 2024

anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 11, 2024

anmyachev added a commit to anmyachev/modin that referenced this issue Apr 11, 2024

FIX-modin-project#7170: Don't use 'MinPartitionSize' configuration va…

795cc6d

…riable in remote context Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev mentioned this issue Apr 11, 2024

FIX-#7170: Don't use MinPartitionSize configuration variable in remote context #7177

Merged

7 tasks

anmyachev added a commit to anmyachev/modin that referenced this issue Apr 12, 2024

FIX-modin-project#7170: Don't use 'MinPartitionSize' configuration va…

b9b6a83

…riable in remote context Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

YarShev pushed a commit that referenced this issue Apr 15, 2024

FIX-#7170: Don't use MinPartitionSize configuration variable in rem…

7b233e4

…ote context (#7177) Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

YarShev closed this as completed in #7177 Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170

BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170

Taurus-Le commented Apr 11, 2024 •

edited

Loading

INSTALLED VERSIONS

Modin dependencies

pandas dependencies

YarShev commented Apr 11, 2024

YarShev commented Apr 11, 2024

YarShev commented Apr 11, 2024

Taurus-Le commented Apr 12, 2024

BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170

BUG: Calling df._repartition(axis=1) on updated df will raise IndexError #7170

Comments

Taurus-Le commented Apr 11, 2024 • edited Loading

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

INSTALLED VERSIONS

Modin dependencies

pandas dependencies

YarShev commented Apr 11, 2024

YarShev commented Apr 11, 2024

YarShev commented Apr 11, 2024

Taurus-Le commented Apr 12, 2024

Taurus-Le commented Apr 11, 2024 •

edited

Loading