[QST] NVTabular function is not supported for this dtype: size #1880

LoMarrujo · 2024-07-08T23:09:31Z

I tried running NVTabular code related to this and this, but I could not get past the line of code with the Workflow.

The error is:

File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate
TypeError: function is not supported for this dtype: size

which occurs after calling Categorify.

Is there something I need to check in order to get NVTabular working?
Any additional information from me to solve this issue?

Thanks!

ohorban · 2024-07-09T17:43:48Z

I have been struggling with the same exact issue for the last few days.
Example code that i wrote while trying to debug:

import pandas as pd
import nvtabular as nvt
from nvtabular import ops
import cudf

# Sample Data
data = {
    'user_id': [16908, 16908, 16908, 16908, 16908],
    'item_id': [174, 78, 94, 174, 78],
    'timestamp': [
        '2024-01-03 14:49:27',
        '2024-01-03 15:33:31',
        '2024-01-03 16:01:57',
        '2024-01-04 18:57:33',
        '2024-01-04 18:59:41'
    ],
    'event_type': [
        'example1',
        'example2',
        'example13',
        'example4',
        'example5'
    ]
}
df = pd.DataFrame(data)
df['user_id'] = df['user_id'].astype('int64')
df['item_id'] = df['item_id'].astype('int64')
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')
print(df.head())
print(df.dtypes)

cdf = cudf.DataFrame.from_pandas(df)

cat_features = ['item_id'] >> ops.Categorify()

cat_workflow = nvt.Workflow(cat_features)
cat_dataset = nvt.Dataset(cdf)

try:
    cat_transformed = cat_workflow.fit_transform(cat_dataset).to_ddf().compute()
    print("After Categorify:")
    print(cat_transformed.head())
except Exception as e:
    print(f"Error during Categorify: {e}")

print("Unique values in item_id:")
print(cdf['item_id'].unique())

Output:

Failed to fit operator <nvtabular.ops.categorify.Categorify object at 0x7fa86ddac1f0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py", line 532, in fit_phase
    stats.append(node.op.fit(node.input_columns, transformed_ddf))
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 400, in fit
    dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1551, in _category_stats
    return _groupby_to_disk(ddf, _write_uniques, options)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1406, in _groupby_to_disk
    _grouped_meta = _top_level_groupby(ddf._meta, options=options)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1017, in _top_level_groupby
    gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py", line 631, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate
TypeError: function is not supported for this dtype: size


  user_id  item_id           timestamp  \
0    16908      174 2024-01-03 14:49:27   
1    16908       78 2024-01-03 15:33:31   
2    16908       94 2024-01-03 16:01:57   
3    16908      174 2024-01-04 18:57:33   
4    16908       78 2024-01-04 18:59:41   


                                          event_type  
0  example1
1  example2
2  example3
3  example4
4  example5
user_id                int64
item_id                int64
timestamp     datetime64[ns]
event_type            object
dtype: object
Error during Categorify: function is not supported for this dtype: size
Unique values in item_id:
0    174
1     78
2     94
Name: item_id, dtype: int64

Chevolier · 2024-08-15T05:45:38Z

same issue

rnyak · 2024-08-15T12:42:59Z

@ohorban can pls you try your pipeline without this line ( pls remove it) :
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')

In our examples, we feed a df to NVT pipelines with integer dytpe timestamp column, like here.

anuragreddygv323 · 2024-09-23T22:11:31Z

Error during Categorify: function is not supported for this dtype: size

rnyak · 2024-09-24T14:32:47Z

@anuragreddygv323 can u please provide more details?

what image you are using?
is this error coming from one of our examples or from your custom code?
what are the dtypes of your data?

Also we need a reproducible example to reproduce your error. thanks.

anuragreddygv323 · 2024-09-24T14:39:19Z

Cuda 12.1
python 3.11

installed this cudf

pip install
--extra-index-url=https://pypi.nvidia.com
cudf-cu12==24.8.* dask-cudf-cu12==24.8.* cuml-cu12==24.8.*
cugraph-cu12==24.8.* cuspatial-cu12==24.8.* cuproj-cu12==24.8.*
cuxfilter-cu12==24.8.* cucim-cu12==24.8.* pylibraft-cu12==24.8.*
raft-dask-cu12==24.8.* cuvs-cu12==24.8.* nx-cugraph-cu12==24.8.*

trying to run transforemer4rec tutorial and when Im trying to categorify its throwing the above error

I ran the example on the documentation and it gives me the same error
import cudf
import nvtabular as nvt

Create toy dataset

df = cudf.DataFrame({
'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'],
'productID': [100, 101, 102, 101, 102, 103, 103],
'label': [0, 0, 1, 1, 1, 0, 0]
})
dataset = nvt.Dataset(df)

Define pipeline

CATEGORICAL_COLUMNS = ['author', 'productID']
cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify(
freq_threshold={"author": 3, "productID": 2},
num_buckets={"author": 10, "productID": 20})

Initialize the workflow and execute it

proc = nvt.Workflow(cat_features)
proc.fit(dataset)
ddf = proc.transform(dataset).to_ddf()

Print results

print(ddf.compute())

rnyak · 2024-09-24T19:06:27Z

@anuragreddygv323 we dont support cudf 24.8 (yet). You can use one of our docker images:

this one : nvcr.io/nvidia/merlin/merlin-tensorflow:23.08
or this one: nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

anuragreddygv323 · 2024-09-24T20:33:40Z

pip install
--extra-index-url=https://pypi.nvidia.com
cudf-cu11==23.08

is throwing an error @rnyak

rnyak · 2024-09-24T22:33:41Z

Installing cudf is not enough. you need dask-cudf as well.
The cudf and dask-cudf versions in the 23.08 image are as follows:

cudf 23.4.0
dask 2023.1.1
dask-cuda 23.4.0
dask-cudf 23.4.0

I recommend you to use docker images.

Please refer to this page to install cudf: https://docs.rapids.ai/install/#pip

your driver version should be compatible with the cuda version and therefore the cudf version.

You can ask cudf related questions (like installation issues) in the rapids/cudf GH repo.

LoMarrujo added the question Further information is requested label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] NVTabular function is not supported for this dtype: size #1880

[QST] NVTabular function is not supported for this dtype: size #1880

LoMarrujo commented Jul 8, 2024

ohorban commented Jul 9, 2024

Chevolier commented Aug 15, 2024

rnyak commented Aug 15, 2024

anuragreddygv323 commented Sep 23, 2024

rnyak commented Sep 24, 2024

anuragreddygv323 commented Sep 24, 2024

rnyak commented Sep 24, 2024 •

edited

Loading

anuragreddygv323 commented Sep 24, 2024 •

edited

Loading

rnyak commented Sep 24, 2024

[QST] NVTabular function is not supported for this dtype: size #1880

[QST] NVTabular function is not supported for this dtype: size #1880

Comments

LoMarrujo commented Jul 8, 2024

ohorban commented Jul 9, 2024

Chevolier commented Aug 15, 2024

rnyak commented Aug 15, 2024

anuragreddygv323 commented Sep 23, 2024

rnyak commented Sep 24, 2024

anuragreddygv323 commented Sep 24, 2024

Create toy dataset

Define pipeline

Initialize the workflow and execute it

Print results

rnyak commented Sep 24, 2024 • edited Loading

anuragreddygv323 commented Sep 24, 2024 • edited Loading

rnyak commented Sep 24, 2024

rnyak commented Sep 24, 2024 •

edited

Loading

anuragreddygv323 commented Sep 24, 2024 •

edited

Loading