-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]when setting dask.config.set({"dataframe.backend": "cudf"}), ddf.explode("col1") and apply customized function cannot work correctly anymore? #16458
Comments
Hi @Huilin-Li! There is actually a lot going on in your example. So, let's break things down a bit: You are trying to use both
The The error you see for Overall, I appreciate that you are raising these issues and sharing the details of your workflow! I'm only suggesting that you share simpler stand-alone reproducers so that it will be much easier for someone to jump in and help. |
@rjzamora Hi, please check this much simpler example. I tested it, and it can reproduce the same error. I also have some findings I want to share with you. I am thinking there might be some problems in
example
error
|
Sorry for the delay @Huilin-Li (I have been away). Thank you for sharing a simpler example - There may be Dask related issues in explode, but the primary problem you are reporting here has nothing to do with Dask. You are just finding that UDF support (i.e. You will find a similar error if you remove the Dask and Parquet-related code altogether. Even if you simplify the logic to avoid using any import cudf
ser = cudf.Series(["AAAAA","BB","CCC","DD","EEEEEE"]*20)
def myfunc(s):
arg = 2
to_int = {"A":10, "B": 12, "C":13, "D":14, "E":15}
res = []
res_tmp = 0
for i in range(arg):
res_tmp = (res_tmp << 4) + to_int[s[i]]
res.append(res_tmp)
for j in range(len(s)-arg):
res_tmp = (res_tmp >> 4)*arg + to_int[s[j+arg]]
res.append(res_tmp)
return res
ser.apply(myfunc) @brandon-b-miller - Perhaps you have some relevant advice on this subject? |
Hi, @rjzamora @brandon-b-miller , may I ask for any suggestions about this problem? Thanks in advance! |
Hi @Huilin-Li , string support within UDFs is somewhat limited for now. However looking over your UDF, it seems to consist of features that are mostly on the roadmap. For now there's a couple things missing:
I wish I had a better answer for getting this UDF to run with today's cuDF, so you might have to fall back on higher level |
Describe the bug
I want cuDF can help a lot in speeding up the calculation process (My dataset is pretty large, e.g. 5 billions rows). However,
ddf.explode("col1")
doesn't work correctly after settingdask.config.set({"dataframe.backend": "cudf"})
, although the calculation workflow works well before settingdask.config.set({"dataframe.backend": "cudf"})
.Steps/Code to reproduce bug
The dataset is
test.fa
file, and it looks like thisSTEP1: read into pandas and save as parquet file
STEP2: apply a customized function
FIRST ERROR
SECOND ERROR
The
ddf.explode('mykmers')
cannot work correctly.Expected behavior
If I didn't set
dask.config.set({"dataframe.backend": "cudf"})
, the calculation works well.exp_mykmers.compute()
will be likeEnvironment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
The text was updated successfully, but these errors were encountered: