Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] NVTabular function is not supported for this dtype: size #1880

Open
LoMarrujo opened this issue Jul 8, 2024 · 9 comments
Open

[QST] NVTabular function is not supported for this dtype: size #1880

LoMarrujo opened this issue Jul 8, 2024 · 9 comments
Labels
question Further information is requested

Comments

@LoMarrujo
Copy link

I tried running NVTabular code related to this and this, but I could not get past the line of code with the Workflow.

The error is:

File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate
TypeError: function is not supported for this dtype: size

which occurs after calling Categorify.

Is there something I need to check in order to get NVTabular working?
Any additional information from me to solve this issue?

Thanks!

@LoMarrujo LoMarrujo added the question Further information is requested label Jul 8, 2024
@ohorban
Copy link

ohorban commented Jul 9, 2024

I have been struggling with the same exact issue for the last few days.
Example code that i wrote while trying to debug:

import pandas as pd
import nvtabular as nvt
from nvtabular import ops
import cudf

# Sample Data
data = {
    'user_id': [16908, 16908, 16908, 16908, 16908],
    'item_id': [174, 78, 94, 174, 78],
    'timestamp': [
        '2024-01-03 14:49:27',
        '2024-01-03 15:33:31',
        '2024-01-03 16:01:57',
        '2024-01-04 18:57:33',
        '2024-01-04 18:59:41'
    ],
    'event_type': [
        'example1',
        'example2',
        'example13',
        'example4',
        'example5'
    ]
}
df = pd.DataFrame(data)
df['user_id'] = df['user_id'].astype('int64')
df['item_id'] = df['item_id'].astype('int64')
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')
print(df.head())
print(df.dtypes)

cdf = cudf.DataFrame.from_pandas(df)

cat_features = ['item_id'] >> ops.Categorify()

cat_workflow = nvt.Workflow(cat_features)
cat_dataset = nvt.Dataset(cdf)

try:
    cat_transformed = cat_workflow.fit_transform(cat_dataset).to_ddf().compute()
    print("After Categorify:")
    print(cat_transformed.head())
except Exception as e:
    print(f"Error during Categorify: {e}")

print("Unique values in item_id:")
print(cdf['item_id'].unique())

Output:

Failed to fit operator <nvtabular.ops.categorify.Categorify object at 0x7fa86ddac1f0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py", line 532, in fit_phase
    stats.append(node.op.fit(node.input_columns, transformed_ddf))
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 400, in fit
    dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1551, in _category_stats
    return _groupby_to_disk(ddf, _write_uniques, options)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1406, in _groupby_to_disk
    _grouped_meta = _top_level_groupby(ddf._meta, options=options)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1017, in _top_level_groupby
    gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py", line 631, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate
TypeError: function is not supported for this dtype: size


  user_id  item_id           timestamp  \
0    16908      174 2024-01-03 14:49:27   
1    16908       78 2024-01-03 15:33:31   
2    16908       94 2024-01-03 16:01:57   
3    16908      174 2024-01-04 18:57:33   
4    16908       78 2024-01-04 18:59:41   


                                          event_type  
0  example1
1  example2
2  example3
3  example4
4  example5
user_id                int64
item_id                int64
timestamp     datetime64[ns]
event_type            object
dtype: object
Error during Categorify: function is not supported for this dtype: size
Unique values in item_id:
0    174
1     78
2     94
Name: item_id, dtype: int64

@Chevolier
Copy link

same issue

@rnyak
Copy link
Contributor

rnyak commented Aug 15, 2024

@ohorban can pls you try your pipeline without this line ( pls remove it) :
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')

In our examples, we feed a df to NVT pipelines with integer dytpe timestamp column, like here.

@anuragreddygv323
Copy link

Error during Categorify: function is not supported for this dtype: size

@rnyak
Copy link
Contributor

rnyak commented Sep 24, 2024

@anuragreddygv323 can u please provide more details?

  • what image you are using?
  • is this error coming from one of our examples or from your custom code?
  • what are the dtypes of your data?

Also we need a reproducible example to reproduce your error. thanks.

@anuragreddygv323
Copy link

Cuda 12.1
python 3.11

installed this cudf

pip install
--extra-index-url=https://pypi.nvidia.com
cudf-cu12==24.8.* dask-cudf-cu12==24.8.* cuml-cu12==24.8.*
cugraph-cu12==24.8.* cuspatial-cu12==24.8.* cuproj-cu12==24.8.*
cuxfilter-cu12==24.8.* cucim-cu12==24.8.* pylibraft-cu12==24.8.*
raft-dask-cu12==24.8.* cuvs-cu12==24.8.* nx-cugraph-cu12==24.8.*

trying to run transforemer4rec tutorial and when Im trying to categorify its throwing the above error

I ran the example on the documentation and it gives me the same error
import cudf
import nvtabular as nvt

Create toy dataset

df = cudf.DataFrame({
'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'],
'productID': [100, 101, 102, 101, 102, 103, 103],
'label': [0, 0, 1, 1, 1, 0, 0]
})
dataset = nvt.Dataset(df)

Define pipeline

CATEGORICAL_COLUMNS = ['author', 'productID']
cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify(
freq_threshold={"author": 3, "productID": 2},
num_buckets={"author": 10, "productID": 20})

Initialize the workflow and execute it

proc = nvt.Workflow(cat_features)
proc.fit(dataset)
ddf = proc.transform(dataset).to_ddf()

Print results

print(ddf.compute())

@rnyak
Copy link
Contributor

rnyak commented Sep 24, 2024

@anuragreddygv323 we dont support cudf 24.8 (yet). You can use one of our docker images:

this one : nvcr.io/nvidia/merlin/merlin-tensorflow:23.08
or this one: nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

@anuragreddygv323
Copy link

anuragreddygv323 commented Sep 24, 2024

pip install
--extra-index-url=https://pypi.nvidia.com
cudf-cu11==23.08

is throwing an error @rnyak 

@rnyak
Copy link
Contributor

rnyak commented Sep 24, 2024

Installing cudf is not enough. you need dask-cudf as well.
The cudf and dask-cudf versions in the 23.08 image are as follows:

cudf 23.4.0
dask 2023.1.1
dask-cuda 23.4.0
dask-cudf 23.4.0

I recommend you to use docker images.

Please refer to this page to install cudf: https://docs.rapids.ai/install/#pip

your driver version should be compatible with the cuda version and therefore the cudf version.

You can ask cudf related questions (like installation issues) in the rapids/cudf GH repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants