Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]when setting dask.config.set({"dataframe.backend": "cudf"}), ddf.explode("col1") and apply customized function cannot work correctly anymore? #16458

Open
Huilin-Li opened this issue Aug 1, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@Huilin-Li
Copy link

Huilin-Li commented Aug 1, 2024

Describe the bug
I want cuDF can help a lot in speeding up the calculation process (My dataset is pretty large, e.g. 5 billions rows). However, ddf.explode("col1") doesn't work correctly after setting dask.config.set({"dataframe.backend": "cudf"}), although the calculation workflow works well before setting dask.config.set({"dataframe.backend": "cudf"}).

Steps/Code to reproduce bug
The dataset is test.fa file, and it looks like this

>UniRef90_UPI0004F0D1C6
MMWLFLTIACLMCFTAKSYANPEVAMDVGEIVRYHGYPYEEHEVVTDDGYYLTVQRIPHSKDNPESISPSHEAEAQGSSMFCPPPKAAVLLQHGLVLEGSNWVTNLPNNSLGFILADAGYDVWIGNSRGNSWSRKHKELEFHQKFAACSFHEMAMYDLPATINYILQKTGQEQLYYVAYSQGTTTGFIAFSSIPELDRKIKMFFALAPITVNSNMKSPLVRVFDLPEVLVKLILGHSVVFDNTEVLKKVISSMCTYSIFRSLCSLVLYLPGGFTSSLNVSRIDVYLSRYPDSTSLQNMLHWRQLYQTGEFKHYDYGSENMLHYNQSTPPFYELENMKAPLAAWYGGKDWISAPEDVNLTLPRITNIAYRKYIPDFVHFDFLWGKQVYDQVYKEMLQLMEKST
>UniRef90_A0A2A4Z8K5
MSKKVQKVQNKALDAACLFAFDESAKHDGSPDVGALGLKAASLARIXSLGMPVPAGFVLTTDFSRQFNLENKLPEGADALIKAGIAELEAKLKRQLGGTEXPLLIAIRAGAPVHLAGLMPAILNLGLNDQTCAALAEETGDLRFALDCYRRFIESYSIAVLGVGEDLFEDIFEEVRSEGGLTSISEFESNDYQVIIDRYKACILDNTDKEFPQCVFEQLRGGIGAAFKSWNGYRARSQRRINEISDDIGLAVTVQSMVFGNRNQQSATGVIQSRNPNTGQAVVSGTYLTYAQGPEFFGQYRTPKPLTLGDKNADSHVESLEERMPEMFDQLVETARQLELACGDMLDIEFTIENNELYILEAVSPKRSDRAELVVAVDLAKAGVISMEDALMRVDPKSIEQLLHPSLDPEAPKTVLARGLPASPGAASGEIVFNSEEAEERRALGKNVILVKVETSPEDVYGIHAAQGILTIRGGTTSHAAVAARIMARPCVTGANTVSIDAENETLSASGFTLQKGDMITIDGTSGQIYTGQVPTIEASFSDEFYTLMKWADKVRKLKIRVNTETPELAIKAQGMGAEGIGLCRTEHMFFDKKRIVSVREMILAEDEVGRKRALDKLLPMQRKDFVDLFKAMSGYPVTIRLLDPPLHEFLPKSDQDIMDTANAIGIDHKTILNRLESMSETNPMLGHRGCRLAITHPEIYDMQVRAIFEATILAEQETGDPTMPEIMIPFVSTKAELVFLKDRIEKIADEVANIHGARPAFKFGTMIELPRACLRAEDLAELSDFFSFGSNDLTQTTYGISRDDSARFLNSYTRRAIIPHDPFVSIDKDGVGELVKIAVQRGLKGNKNLSIGVCGEHGGDPYSIHFYGDVGLDYVSCAPFRVPVAKLAAAQNAIITKAKSS
>UniRef90_UPI0004644EFB
MKTINCSPLSLKVRLISIVILIFVFSLWALTFAITQSLKQDIKELLIEQQNSAASYIAADIDSEVAQRITLLNQNAKLVSQYVGSLGQTREFLKGRIGLQALFQDGIVAIDKEGLGIAEFPSGIGREGAHFNTREYFQEAMTTGKTVIGKPRKSSFTNHPVVAIAAPILNASGQQVGVLAGFTSLSETSLFGQVDRSGVEKAATIIISDPLHQLIVFSSKTADILRPLINHDASVGNNANDTKVLSEGKTIPTTGWVVQIVMPAEEAFMPIRHMETVIYEIALFLTLLSSGGVWFLVKHALRPLDKVTHTIRLMAEDAAGNMHALPRKGDNEIRELTDNFNLLVKQRLRSEAALRQSEARLARAELASKSGNWEFHLREQKVIASIGAKSIYGLHKEEYEFTEIKKAALSEYRTMLDAAMKALIEDDIPYNVEFRIRTLDTGELRDIHSIAYFDKEKQIIFGVVQDVTERLNIQRTLEQEELRRRIFLEQSQEGVAVLRQDGSLAEWNPAFAQMLGYSEQEMGHLNVKDWDSKLKHEEIDDITHTLGLGHLSIETQHRRKDGRYYDVEVNISGVEWADQYYLFCLHHDITDRKQSELALRESEARFRAIIEASPIPYALNDEHFNITYLNPAFVRTFGYTLQDIPTIADWWPKAYPDPAYQQQIMTDWMAHMAKAEREKQTFEPIEANIRCKDGETRTVLVAAEPLNGSFHELHVVSFFDITSIKKAEASQRLAATVFSHAREGILITGADGTILDVNGMFSEITGYSRDDVIGKSAQMFNPAKHAKTSYAHMWRALKRNGYWAGEMWNCRKSGALFPEMVTISAVRDQQGNTQQYVVLFSDISEAKAHEHRLETMAHYDPLTGLPNRSLLSDRLQQAMAQSSRYKKSIAVCYLDLDGFKQVNDTYGHEVGDQLLIALAAQMQQTLRKSDTLARIGGDEFVVVLDGLVDRESSLASAERLVQAAAHPVMVGELQLQVSASLGITFYPQEVAIDADQLLRQADHAMYLAKQSGKNRYCLFKTYYTEVV
>UniRef90_A0A5Q4EG38
MASLRKARRLIKATVAEWQEQEVSLLASALAYSTVFSLAPLMILVIMLLGMFFGETTAREQIVSQLDDLVGDDGADLLATAITNLRDQANEGPLQLILNLGFFLFGASSVFAGIQNSLDRIWDVKPEPGRHVFHFLRKRLLSAAMILAIAFLLLVSSVANTLLAAATASLNEWLPAMGSLWQILSWVISFVVIAAVFAAIYTVLPDADIHWQDTLIGAMLTAGLFMIGQWLFGIFLDLVDIGSGYGVAGSFLVIITWIFYAAVVLFTGAVFTKVYARRYGLPIIPSDFAVSTVEDRPERCPED
>UniRef90_UPI001CE095B8
MAVLQHNAVSSVVLQGVTVWEEEGDTDQEEVRSSPPWSEERCEELWDRVEGVRHKLTRILHPAKLTPYLRQCKVIDEQDEDEVLNSTQYPLRISKAGRLLDILRGQGQRGLQAFMESLEFYHPEQYTQLTGEQPTQRCSLILDEEGPEGLTQFLLLEVRKLREQLRNSRLCERRLSQRCRMAEEERGRAERKAQELRHDRLQLERLRQDWESASRELGKLKDRHLEQAVKYSRALEEQGKASSRERELLRQVEELKSRLTEEEKQTIDTPGYNTPAKSTSLFSNEVNGSAPALPEKPLHCTDVQKAENKGTQMRDSVPATGVIALMDILQQDRRESAEQRQELCDIITRVQGELQSTEEHRDKLESQCKQLQLKVRTLQLDWETEQKRSVSYFNQIMELEKERDQALHSRDSLQLEYTDCLLDKNRLRKSIAELQANMEQQQRELERERERSREQMEQSSPCPHCSHLSLCSEDQCYGPCCSLGLDMRPPANSTRLLLRKMPSRGQANENSEDSRSTSEENLFSSTEDNEKEINRLSTFPFPPCMNSINRRFNTEFDLESGGSDENDNITGEQSEPSLWDSWNSLHSHLFPPDLVNLPAVSSHQPNPSVPRIPPRSPSSSPPTSPKYRRASLADDITIVGGNVTGIFVSHVRPGSAAEQCGLKEGSELLELDRVLFGGGSVLLAQCTAEVAHFSLQWWTEPSTLKHQSNPEAYSKLCSQISSPTFVGADSFYVRVNLNMEPHGDPPSLGVSCDDIIHVTDTRYNGKYHWHCSLVDPRTAKPLQAGTMPNYNRAQQLLLVRLRKMALEQKDLKKKVFLKKAPGRVRLVKAVDPGCRGIGSTQQVLYTLSKRHEEHLIPYSVVQPARVQTKRPVIFSPSLLSRGLIERLLQPAESGLKFNTCPPEPIQASERRDKRVFLLDSCSPEQPLGIRLQSIQDVISQDKHCLLELGLPSVEGLLRQGIYPIVIHIHPKNKKHKKLRKFFPRCGEESIMEEVCHAEELQLETLPLLYYTLEPNTWSSTDELLAAIRNAIHSQQSAVAWVELDRLQ

STEP1: read into pandas and save as parquet file

import pandas as pd
from Bio import SeqIO
import dask.dataframe as dd
# 1. read fasta data and save as parquet for faster access in the following steps
def read_to_parquet(fastapath, parquestpath, npa):
    rep = [] # this list is assigning each protein name an integer for easy use.
    rep_ = 0  
    identifiers = []
    seq = []
    with open(fastapath) as fasta_file:
        for seq_record in SeqIO.parse(fasta_file, 'fasta'): 
            identifiers.append(seq_record.id)
            seq.append("".join(seq_record.seq))
            rep_ = rep_ + 1
            rep.append(rep_)

    pdf = pd.DataFrame({"ID": identifiers, "seq": seq, "rep": rep}).drop_duplicates(subset='seq')
    ddf = dd.from_pandas(pdf, npartitions=npa)
    ddf.to_parquet(parquestpath, engine="pyarrow" )  
    return pdf
# read and save
pdf = read_to_parquet(fastapath=fasta_path, parquestpath=parquest_path, npa=5)
# read parquet into dask.dataframe
ddf = dd.read_parquet(parquest_path)
# output is like, for example

STEP2: apply a customized function

def kmer_bitint_trans(df, k):
    return apply_series(df.seq, k)
def apply_series(series, k):
    return process_row(s=series, k = k)
def process_row(s, k):  
    aa_to_int={'Q': 4, 'W': 7, 'E': 4, 'R': 3, 'T': 12, 'Y': 6, 'U': 13, 'I': 2, 'O': 13, 'P': 11, 'A': 12, 'S': 12, 'D': 5, 'F': 6,
            'G': 9, 'H': 10, 'J': 13, 'K': 3, 'L': 1, 'Z': 13, 'X': 13, 'C': 8, 'V': 2, 'B': 13, 'N': 5, 'M': 1, '*': 13}
    N = len(s)
    kmers = []
    if N <= k:
        kmer_former = 0
        for N_idx in range(N):
            kmer_former = (kmer_former << 4) + aa_to_int[s[N_idx]]
        kmers.append(kmer_former)
        return kmers
    else:
        kmer_former = 0
        for k_idx in range(k):
            kmer_former = (kmer_former << 4) + aa_to_int[s[k_idx]] 
        kmers.append(kmer_former)
        for i in range(N-k):
            kmer_former = (kmer_former << 4)^(aa_to_int[s[i]]<<4*k) + aa_to_int[s[i+k]]
            kmers.append(kmer_former)
        return kmers

res = ddf.apply(kmer_bitint_trans, axis=1, k = 12, meta=("mykmers", 'int')) 
# here, although the return of kmer_bitint_trans is list, I have to set "int" in meta. If not, the return doesn't display as list. Not sure why

FIRST ERROR

# Here, I want to print `res` to check where is wrong by `res.compute()`. 
# I find  `res.compute()` doesn't work if setting  `dask.config.set({"dataframe.backend": "cudf"})`, and the error is 
Traceback (most recent call last):
  File "/storage/lihuilin/MYANAWORK/myflsh.py", line 50, in <module>
    print(res.compute())
          ^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_collection.py", line 476, in compute
    return DaskMethodsMixin.compute(out, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask/base.py", line 376, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask/base.py", line 662, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_expr.py", line 3758, in _execute_task
    return dask.core.get(graph, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/dataframe.py", line 4683, in apply
    return self._apply(func, _get_row_kernel, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/indexed_frame.py", line 3429, in _apply
    raise ValueError("UDFs using **kwargs are not yet supported.")
ValueError: UDFs using **kwargs are not yet supported.

SECOND ERROR

# Since `res.compute()` doesn't work, although I could assign it to ddf by
ddf["mykmers"] = res
# step3: explode mykmers column
exp_mykmers = ddf.explode('mykmers')
# raise the second error
Traceback (most recent call last):
  File "/storage/lihuilin/MYANAWORK/myflsh.py", line 56, in <module>
    exp_mykmers = ddf.explode('mykmers')
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_collection.py", line 3261, in explode
    return new_collection(expr.ExplodeFrame(self, column=column))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_collection.py", line 4779, in new_collection
    meta = expr._meta
           ^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_expr.py", line 496, in _meta
    return self.operation(*args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask/utils.py", line 1241, in __call__
    return getattr(__obj, self.method)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/dataframe.py", line 7531, in explode
    return super()._explode(column, ignore_index)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/indexed_frame.py", line 5188, in _explode
    if not isinstance(self._data[explode_column].dtype, ListDtype):
                      ~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/column_accessor.py", line 148, in __getitem__
    return self._data[key]
           ~~~~~~~~~~^^^^^
TypeError: unhashable type: 'list'

The ddf.explode('mykmers') cannot work correctly.

Expected behavior
If I didn't set dask.config.set({"dataframe.backend": "cudf"}), the calculation works well. exp_mykmers.compute() will be like

                       ID                                                seq   rep          mykmers
0                  target  HITNVGEMKHYLCGCCAAFNNVAITFPIQKVLFRQQLYGIKTRDAI...     0  178967684397665
0                  target  HITNVGEMKHYLCGCCAAFNNVAITFPIQKVLFRQQLYGIKTRDAI...     0   48733183256088
0                  target  HITNVGEMKHYLCGCCAAFNNVAITFPIQKVLFRQQLYGIKTRDAI...     0  216780978676105
0                  target  HITNVGEMKHYLCGCCAAFNNVAITFPIQKVLFRQQLYGIKTRDAI...     0   90795938289816
0                  target  HITNVGEMKHYLCGCCAAFNNVAITFPIQKVLFRQQLYGIKTRDAI...     0   45360129083784
...                   ...                                                ...   ...              ...
1000  UniRef90_A0A5C4VNQ7  MQLRYVFTELRTGLRRNLSMHLAVILTLFVSLSLAGIGILVQREAT...  1000   30866790226995
1000  UniRef90_A0A5C4VNQ7  MQLRYVFTELRTGLRRNLSMHLAVILTLFVSLSLAGIGILVQREAT...  1000  212393666921270
1000  UniRef90_A0A5C4VNQ7  MQLRYVFTELRTGLRRNLSMHLAVILTLFVSLSLAGIGILVQREAT...  1000   20598950212450
1000  UniRef90_A0A5C4VNQ7  MQLRYVFTELRTGLRRNLSMHLAVILTLFVSLSLAGIGILVQREAT...  1000   48108226688547
1000  UniRef90_A0A5C4VNQ7  MQLRYVFTELRTGLRRNLSMHLAVILTLFVSLSLAGIGILVQREAT...  1000  206781673595442

[550295 rows x 4 columns]

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [from source]
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti      Off| 00000000:18:00.0 Off |                  N/A |
| 36%   31C    P0               51W / 250W|      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti      Off| 00000000:5E:00.0 Off |                  N/A |
| 37%   32C    P0               49W / 250W|      0MiB / 11264MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti      Off| 00000000:AF:00.0 Off |                  N/A |
| 34%   28C    P0               50W / 250W|      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 2080 Ti      Off| 00000000:D8:00.0 Off |                  N/A |
| 36%   31C    P0               25W / 250W|      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Click here to see environment details
 **git***
 commit e6537de7474c91b4153542e6611c8a4e33a58caa (HEAD -> branch-24.08, origin/branch-24.08, origin/HEAD)
 Author: Vyas Ramasubramani <vyasr@nvidia.com>
 Date:   Fri Jul 19 20:10:40 2024 -0700
 
 Experimental support for configurable prefetching (#16020)
 
 This PR adds experimental support for prefetching managed memory at a select few points in libcudf. A new configuration object is introduced for handling whether prefetching is enabled or disabled, and whether to print debug information about pointers being prefetched. Prefetching control is managed on a per API basis to enable profiling of the effects of prefetching different classes of data in different contexts. Prefetching in this PR always occurs on the default stream, so it will trigger synchronization with any blocking streams that the user has created. Turning on prefetching and then passing non-blocking to any libcudf APIs will trigger undefined behavior.
 
 Authors:
 - Vyas Ramasubramani (https://github.com/vyasr)
 
 Approvers:
 - David Wendt (https://github.com/davidwendt)
 - Kyle Edwards (https://github.com/KyleFromNVIDIA)
 - Thomas Li (https://github.com/lithomas1)
 - Muhammad Haseeb (https://github.com/mhaseeb123)
 
 URL: https://github.com/rapidsai/cudf/pull/16020
 **git submodules***
 
 ***OS Information***
 CentOS Linux release 8.2.2004 (Core)
 NAME="CentOS Linux"
 VERSION="8 (Core)"
 ID="centos"
 ID_LIKE="rhel fedora"
 VERSION_ID="8"
 PLATFORM_ID="platform:el8"
 PRETTY_NAME="CentOS Linux 8 (Core)"
 ANSI_COLOR="0;31"
 CPE_NAME="cpe:/o:centos:centos:8"
 HOME_URL="https://www.centos.org/"
 BUG_REPORT_URL="https://bugs.centos.org/"
 
 CENTOS_MANTISBT_PROJECT="CentOS-8"
 CENTOS_MANTISBT_PROJECT_VERSION="8"
 REDHAT_SUPPORT_PRODUCT="centos"
 REDHAT_SUPPORT_PRODUCT_VERSION="8"
 
 CentOS Linux release 8.2.2004 (Core)
 CentOS Linux release 8.2.2004 (Core)
 Linux grtq14.cluster.com 4.18.0-193.el8.x86_64 #1 SMP Fri May 8 10:59:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Thu Aug  1 15:21:27 2024
 +---------------------------------------------------------------------------------------+
 | NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
 |-----------------------------------------+----------------------+----------------------+
 | GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                                         |                      |               MIG M. |
 |=========================================+======================+======================|
 |   0  NVIDIA GeForce RTX 2080 Ti      Off| 00000000:18:00.0 Off |                  N/A |
 | 36%   31C    P0               51W / 250W|      0MiB / 11264MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   1  NVIDIA GeForce RTX 2080 Ti      Off| 00000000:5E:00.0 Off |                  N/A |
 | 37%   32C    P0               49W / 250W|      0MiB / 11264MiB |      1%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   2  NVIDIA GeForce RTX 2080 Ti      Off| 00000000:AF:00.0 Off |                  N/A |
 | 34%   28C    P0               49W / 250W|      0MiB / 11264MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 |   3  NVIDIA GeForce RTX 2080 Ti      Off| 00000000:D8:00.0 Off |                  N/A |
 | 36%   31C    P0               37W / 250W|      0MiB / 11264MiB |      0%      Default |
 |                                         |                      |                  N/A |
 +-----------------------------------------+----------------------+----------------------+
 
 +---------------------------------------------------------------------------------------+
 | Processes:                                                                            |
 |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
 |        ID   ID                                                             Usage      |
 |=======================================================================================|
 |  No running processes found                                                           |
 +---------------------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              40
 On-line CPU(s) list: 0-39
 Thread(s) per core:  1
 Core(s) per socket:  20
 Socket(s):           2
 NUMA node(s):        2
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               85
 Model name:          Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
 Stepping:            7
 CPU MHz:             2799.903
 CPU max MHz:         3900.0000
 CPU min MHz:         800.0000
 BogoMIPS:            4200.00
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            1024K
 L3 cache:            28160K
 NUMA node0 CPU(s):   0-19
 NUMA node1 CPU(s):   20-39
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
 
 ***CMake***
 /home/lihuilin/miniconda3/envs/cudf_dev/bin/cmake
 cmake version 3.30.1
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /home/lihuilin/miniconda3/envs/cudf_dev/bin/g++
 g++ (conda-forge gcc 11.4.0-13) 11.4.0
 Copyright (C) 2021 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 /home/lihuilin/miniconda3/envs/cudf_dev/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2024 NVIDIA Corporation
 Built on Thu_Jun__6_02:18:23_PDT_2024
 Cuda compilation tools, release 12.5, V12.5.82
 Build cuda_12.5.r12.5/compiler.34385749_0
 
 ***Python***
 /home/lihuilin/miniconda3/envs/cudf_dev/bin/python
 Python 3.11.9
 
 ***Environment Variables***
 PATH                            : /home/lihuilin/miniconda3/envs/cudf_dev/bin:/home/lihuilin/.local/bin:/home/lihuilin/bin:/opt/slurm/sbin:/opt/slurm/bin:/home/lihuilin/miniconda3/envs/cudf_dev/bin:/home/lihuilin/.vscode-server/cli/servers/Stable-f1e16e1e6214d7c44d078b1f0607b2388f29d729/server/bin/remote-cli:/home/lihuilin/.local/bin:/home/lihuilin/bin:/opt/slurm/sbin:/opt/slurm/bin:/home/lihuilin/miniconda3/condabin:/home/lihuilin/.local/bin:/home/lihuilin/bin:/opt/slurm/sbin:/opt/slurm/bin:/soft/modules/modules-4.7.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/homelihuilin/software/silent_tools:/home/lihuilin/software/silent_tools:/home/lihuilin/software/silent_tools
 LD_LIBRARY_PATH                 : /opt/slurm/lib:/opt/slurm/lib/slurm:/opt/slurm/lib:/opt/slurm/lib/slurm:/opt/slurm/lib:/opt/slurm/lib/slurm:
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /home/lihuilin/miniconda3/envs/cudf_dev
 PYTHON_PATH                     :
 
 ***conda packages***
 /home/lihuilin/miniconda3/condabin/conda
 # packages in environment at /home/lihuilin/miniconda3/envs/cudf_dev:
 #
 # Name                    Version                   Build  Channel
 _libgcc_mutex             0.1                 conda_forge    conda-forge
 _openmp_mutex             4.5                  2_kmp_llvm    conda-forge
 _sysroot_linux-64_curr_repodata_hack 3                   h69a702a_16    conda-forge
 accessible-pygments       0.0.5              pyhd8ed1ab_0    conda-forge
 aiobotocore               2.13.1             pyhd8ed1ab_0    conda-forge
 aiohttp                   3.9.5           py311h459d7ec_0    conda-forge
 aioitertools              0.11.0             pyhd8ed1ab_0    conda-forge
 aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
 alabaster                 0.7.16             pyhd8ed1ab_0    conda-forge
 annotated-types           0.7.0              pyhd8ed1ab_0    conda-forge
 anyio                     4.4.0              pyhd8ed1ab_0    conda-forge
 argon2-cffi               23.1.0             pyhd8ed1ab_0    conda-forge
 argon2-cffi-bindings      21.2.0          py311h459d7ec_4    conda-forge
 arrow                     1.3.0              pyhd8ed1ab_0    conda-forge
 asttokens                 2.4.1              pyhd8ed1ab_0    conda-forge
 async-lru                 2.0.4              pyhd8ed1ab_0    conda-forge
 attrs                     23.2.0             pyh71513ae_0    conda-forge
 aws-c-auth                0.7.22              hbd3ac97_10    conda-forge
 aws-c-cal                 0.7.1                h87b94db_1    conda-forge
 aws-c-common              0.9.23               h4ab18f5_0    conda-forge
 aws-c-compression         0.2.18               he027950_7    conda-forge
 aws-c-event-stream        0.4.2               h7671281_15    conda-forge
 aws-c-http                0.8.2                he17ee6b_6    conda-forge
 aws-c-io                  0.14.10              h826b7d6_1    conda-forge
 aws-c-mqtt                0.10.4               hcd6a914_8    conda-forge
 aws-c-s3                  0.6.0                h365ddd8_2    conda-forge
 aws-c-sdkutils            0.1.16               he027950_3    conda-forge
 aws-checksums             0.1.18               he027950_7    conda-forge
 aws-crt-cpp               0.27.3               hda66527_2    conda-forge
 aws-sdk-cpp               1.11.329             h46c3b66_9    conda-forge
 aws-xray-sdk              2.14.0             pyhd8ed1ab_0    conda-forge
 azure-core-cpp            1.12.0               h830ed8b_0    conda-forge
 azure-identity-cpp        1.8.0                hdb0d106_1    conda-forge
 azure-storage-blobs-cpp   12.11.0              ha67cba7_1    conda-forge
 azure-storage-common-cpp  12.6.0               he3f277c_1    conda-forge
 azure-storage-files-datalake-cpp 12.10.0              h29b5301_1    conda-forge
 babel                     2.14.0             pyhd8ed1ab_0    conda-forge
 backports.zoneinfo        0.2.1           py311h38be061_8    conda-forge
 beautifulsoup4            4.12.3             pyha770c72_0    conda-forge
 binutils                  2.40                 h4852527_7    conda-forge
 binutils_impl_linux-64    2.40                 ha1999f0_7    conda-forge
 binutils_linux-64         2.40                 hb3c18ed_4    conda-forge
 biopython                 1.84                     pypi_0    pypi
 bleach                    6.1.0              pyhd8ed1ab_0    conda-forge
 blinker                   1.8.2              pyhd8ed1ab_0    conda-forge
 bokeh                     3.5.0              pyhd8ed1ab_0    conda-forge
 boto3                     1.34.131           pyhd8ed1ab_0    conda-forge
 botocore                  1.34.131        pyge310_1234567_0    conda-forge
 breathe                   4.35.0             pyhd8ed1ab_1    conda-forge
 brotli-python             1.1.0           py311hb755f60_1    conda-forge
 bzip2                     1.0.8                h4bc722e_7    conda-forge
 c-ares                    1.32.2               h4bc722e_0    conda-forge
 c-compiler                1.5.2                h0b41bf4_0    conda-forge
 ca-certificates           2024.7.4             hbcca054_0    conda-forge
 cached-property           1.5.2                hd8ed1ab_1    conda-forge
 cached_property           1.5.2              pyha770c72_1    conda-forge
 cachetools                5.4.0              pyhd8ed1ab_0    conda-forge
 certifi                   2024.7.4           pyhd8ed1ab_0    conda-forge
 cffi                      1.16.0          py311hb3a22ac_0    conda-forge
 cfgv                      3.3.1              pyhd8ed1ab_0    conda-forge
 charset-normalizer        3.3.2              pyhd8ed1ab_0    conda-forge
 clang                     16.0.6          default_h9e3a008_11    conda-forge
 clang-16                  16.0.6          default_hf981a13_11    conda-forge
 clang-format              16.0.6          default_hf981a13_11    conda-forge
 clang-format-16           16.0.6          default_hf981a13_11    conda-forge
 clang-tools               16.0.6          default_hf981a13_11    conda-forge
 click                     8.1.7           unix_pyh707e725_0    conda-forge
 cloudpickle               3.0.0              pyhd8ed1ab_0    conda-forge
 cmake                     3.30.1               hf8c4bd3_0    conda-forge
 colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
 comm                      0.2.2              pyhd8ed1ab_0    conda-forge
 commonmark                0.9.1                      py_0    conda-forge
 contourpy                 1.2.1           py311h9547e67_0    conda-forge
 coverage                  7.6.0           py311h61187de_0    conda-forge
 cramjam                   2.8.3           py311h46250e7_0    conda-forge
 cryptography              43.0.0          py311hc6616f6_0    conda-forge
 cuda-cccl_linux-64        12.5.39              ha770c72_0    conda-forge
 cuda-crt-dev_linux-64     12.5.82              ha770c72_0    conda-forge
 cuda-crt-tools            12.5.82              ha770c72_0    conda-forge
 cuda-cudart               12.5.82              he02047a_0    conda-forge
 cuda-cudart-dev           12.5.82              he02047a_0    conda-forge
 cuda-cudart-dev_linux-64  12.5.82              h85509e4_0    conda-forge
 cuda-cudart-static        12.5.82              he02047a_0    conda-forge
 cuda-cudart-static_linux-64 12.5.82              h85509e4_0    conda-forge
 cuda-cudart_linux-64      12.5.82              h85509e4_0    conda-forge
 cuda-driver-dev_linux-64  12.5.82              h85509e4_0    conda-forge
 cuda-nvcc                 12.5.82              hcdd1206_0    conda-forge
 cuda-nvcc-dev_linux-64    12.5.82              ha770c72_0    conda-forge
 cuda-nvcc-impl            12.5.82              hd3aeb46_0    conda-forge
 cuda-nvcc-tools           12.5.82              hd3aeb46_0    conda-forge
 cuda-nvcc_linux-64        12.5.82              h8a487aa_0    conda-forge
 cuda-nvrtc                12.5.82              he02047a_0    conda-forge
 cuda-nvrtc-dev            12.5.82              he02047a_0    conda-forge
 cuda-nvtx                 12.5.82              he02047a_0    conda-forge
 cuda-nvtx-dev             12.5.82              ha770c72_0    conda-forge
 cuda-nvvm-dev_linux-64    12.5.82              ha770c72_0    conda-forge
 cuda-nvvm-impl            12.5.82              h59595ed_0    conda-forge
 cuda-nvvm-tools           12.5.82              h59595ed_0    conda-forge
 cuda-python               12.5.0          py311h817de4b_1    conda-forge
 cuda-sanitizer-api        12.5.81              he02047a_0    conda-forge
 cuda-version              12.5                 hd4f0392_3    conda-forge
 cudf                      24.08.00a361    cuda12_py311_240722_ge6537de747_361    rapidsai-nightly
 cudnn                     8.9.7.29             h092f7fd_3    conda-forge
 cupy                      13.2.0          py311he5a987b_0    conda-forge
 cupy-core                 13.2.0          py311h3bdf873_0    conda-forge
 cxx-compiler              1.5.2                hf52228f_0    conda-forge
 cyrus-sasl                2.1.27               h54b06d7_7    conda-forge
 cython                    3.0.10          py311hb755f60_0    conda-forge
 cytoolz                   0.12.3          py311h459d7ec_0    conda-forge
 dask                      2024.7.1a240719 py_g70ae414b_20    dask/label/dev
 dask-core                 2024.7.1a240719 py_gc9f3e39af_7    dask/label/dev
 dask-cuda                 24.08.00a14     py311_240722_gfa226b1_14    rapidsai-nightly
 dask-cudf                 24.08.00a361    cuda12_py311_240722_ge6537de747_361    rapidsai-nightly
 dask-expr                 1.1.8a240719      py_gaebe6eb_5    dask/label/dev
 dask-sql                  2024.5.0        py311h4799004_0    conda-forge
 datasets                  2.14.4             pyhd8ed1ab_0    conda-forge
 debugpy                   1.8.2           py311h4332511_0    conda-forge
 decopatch                 1.4.10             pyhd8ed1ab_0    conda-forge
 decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
 defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
 dill                      0.3.7              pyhd8ed1ab_0    conda-forge
 distlib                   0.3.8              pyhd8ed1ab_0    conda-forge
 distributed               2024.7.1a240719 py_g70ae414b_20    dask/label/dev
 dlpack                    0.8                  h59595ed_3    conda-forge
 dnspython                 2.6.1              pyhd8ed1ab_1    conda-forge
 docutils                  0.19            py311h38be061_1    conda-forge
 doxygen                   1.9.1                hb166930_1    conda-forge
 email-validator           2.2.0              pyhd8ed1ab_0    conda-forge
 email_validator           2.2.0                hd8ed1ab_0    conda-forge
 entrypoints               0.4                pyhd8ed1ab_0    conda-forge
 exceptiongroup            1.2.2              pyhd8ed1ab_0    conda-forge
 execnet                   2.1.1              pyhd8ed1ab_0    conda-forge
 executing                 2.0.1              pyhd8ed1ab_0    conda-forge
 fastapi                   0.111.1            pyhd8ed1ab_0    conda-forge
 fastapi-cli               0.0.4              pyhd8ed1ab_0    conda-forge
 fastavro                  1.9.5           py311h61187de_0    conda-forge
 fastparquet               2024.5.0        py311h18e1886_0    conda-forge
 fastrlock                 0.8.2           py311hb755f60_2    conda-forge
 filelock                  3.15.4             pyhd8ed1ab_0    conda-forge
 flask                     3.0.3              pyhd8ed1ab_0    conda-forge
 flask-cors                4.0.0              pyhd8ed1ab_0    conda-forge
 fmt                       10.2.1               h00ab1b0_0    conda-forge
 fqdn                      1.5.1              pyhd8ed1ab_0    conda-forge
 freetype                  2.12.1               h267a509_2    conda-forge
 frozenlist                1.4.1           py311h459d7ec_0    conda-forge
 fsspec                    2024.6.1           pyhff2d567_0    conda-forge
 future                    1.0.0              pyhd8ed1ab_0    conda-forge
 gcc                       11.4.0              h602e360_13    conda-forge
 gcc_impl_linux-64         11.4.0              h00c12a0_13    conda-forge
 gcc_linux-64              11.4.0               ha077dfb_4    conda-forge
 gflags                    2.2.2             he1b5a44_1004    conda-forge
 glog                      0.7.1                hbabe93e_0    conda-forge
 gmp                       6.3.0                hac33072_2    conda-forge
 gmpy2                     2.1.5           py311hc4f1f91_1    conda-forge
 greenlet                  3.0.3           py311hb755f60_0    conda-forge
 gxx                       11.4.0              h602e360_13    conda-forge
 gxx_impl_linux-64         11.4.0              h634f3ee_13    conda-forge
 gxx_linux-64              11.4.0               h35bfe5d_4    conda-forge
 h11                       0.14.0             pyhd8ed1ab_0    conda-forge
 h2                        4.1.0              pyhd8ed1ab_0    conda-forge
 hpack                     4.0.0              pyh9f0ad1d_0    conda-forge
 httpcore                  1.0.5              pyhd8ed1ab_0    conda-forge
 httpx                     0.27.0             pyhd8ed1ab_0    conda-forge
 huggingface_hub           0.23.5             pyhd8ed1ab_0    conda-forge
 hyperframe                6.0.1              pyhd8ed1ab_0    conda-forge
 hypothesis                6.108.2            pyha770c72_0    conda-forge
 icu                       75.1                 he02047a_0    conda-forge
 identify                  2.6.0              pyhd8ed1ab_0    conda-forge
 idna                      3.7                pyhd8ed1ab_0    conda-forge
 imagesize                 1.4.1              pyhd8ed1ab_0    conda-forge
 importlib-metadata        8.0.0              pyha770c72_0    conda-forge
 importlib-resources       6.4.0              pyhd8ed1ab_0    conda-forge
 importlib_metadata        8.0.0                hd8ed1ab_0    conda-forge
 importlib_resources       6.4.0              pyhd8ed1ab_0    conda-forge
 iniconfig                 2.0.0              pyhd8ed1ab_0    conda-forge
 ipykernel                 6.29.5             pyh3099207_0    conda-forge
 ipython                   8.26.0             pyh707e725_0    conda-forge
 ipywidgets                8.1.3              pyhd8ed1ab_0    conda-forge
 isoduration               20.11.0            pyhd8ed1ab_0    conda-forge
 itsdangerous              2.2.0              pyhd8ed1ab_0    conda-forge
 jedi                      0.19.1             pyhd8ed1ab_0    conda-forge
 jinja2                    3.1.4              pyhd8ed1ab_0    conda-forge
 jmespath                  1.0.1              pyhd8ed1ab_0    conda-forge
 joserfc                   1.0.0              pyhd8ed1ab_0    conda-forge
 json5                     0.9.25             pyhd8ed1ab_0    conda-forge
 jsondiff                  2.0.0              pyhd8ed1ab_0    conda-forge
 jsonpointer               3.0.0           py311h38be061_0    conda-forge
 jsonschema                4.23.0             pyhd8ed1ab_0    conda-forge
 jsonschema-path           0.3.3              pyhd8ed1ab_0    conda-forge
 jsonschema-specifications 2023.12.1          pyhd8ed1ab_0    conda-forge
 jsonschema-with-format-nongpl 4.23.0               hd8ed1ab_0    conda-forge
 jupyter                   1.0.0             pyhd8ed1ab_10    conda-forge
 jupyter-cache             1.0.0              pyhd8ed1ab_0    conda-forge
 jupyter-lsp               2.2.5              pyhd8ed1ab_0    conda-forge
 jupyter_client            8.6.2              pyhd8ed1ab_0    conda-forge
 jupyter_console           6.6.3              pyhd8ed1ab_0    conda-forge
 jupyter_core              5.7.2           py311h38be061_0    conda-forge
 jupyter_events            0.10.0             pyhd8ed1ab_0    conda-forge
 jupyter_server            2.14.2             pyhd8ed1ab_0    conda-forge
 jupyter_server_terminals  0.5.3              pyhd8ed1ab_0    conda-forge
 jupyterlab                4.2.4              pyhd8ed1ab_0    conda-forge
 jupyterlab_pygments       0.3.0              pyhd8ed1ab_1    conda-forge
 jupyterlab_server         2.27.3             pyhd8ed1ab_0    conda-forge
 jupyterlab_widgets        3.0.11             pyhd8ed1ab_0    conda-forge
 kernel-headers_linux-64   3.10.0              h4a8ded7_16    conda-forge
 keyutils                  1.6.1                h166bdaf_0    conda-forge
 krb5                      1.21.3               h659f571_0    conda-forge
 lazy-object-proxy         1.10.0          py311h459d7ec_0    conda-forge
 lcms2                     2.16                 hb7c19ff_0    conda-forge
 ld_impl_linux-64          2.40                 hf3520f5_7    conda-forge
 lerc                      4.0.0                h27087fc_0    conda-forge
 libabseil                 20240116.2      cxx17_he02047a_1    conda-forge
 libarrow                  16.1.0          h34456a7_14_cpu    conda-forge
 libarrow-acero            16.1.0          he02047a_14_cpu    conda-forge
 libarrow-dataset          16.1.0          he02047a_14_cpu    conda-forge
 libarrow-substrait        16.1.0          hc9a23c6_14_cpu    conda-forge
 libblas                   3.9.0           22_linux64_openblas    conda-forge
 libbrotlicommon           1.1.0                hd590300_1    conda-forge
 libbrotlidec              1.1.0                hd590300_1    conda-forge
 libbrotlienc              1.1.0                hd590300_1    conda-forge
 libcblas                  3.9.0           22_linux64_openblas    conda-forge
 libclang-cpp16            16.0.6          default_hf981a13_11    conda-forge
 libclang13                18.1.8          default_h9def88c_1    conda-forge
 libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
 libcublas                 12.5.3.2             he02047a_0    conda-forge
 libcudf                   24.08.00a361    cuda12_240722_ge6537de747_361    rapidsai-nightly
 libcufft                  11.2.3.61            he02047a_0    conda-forge
 libcufile                 1.10.1.7             he02047a_0    conda-forge
 libcufile-dev             1.10.1.7             he02047a_0    conda-forge
 libcurand                 10.3.6.82            he02047a_0    conda-forge
 libcurand-dev             10.3.6.82            he02047a_0    conda-forge
 libcurl                   8.8.0                hca28451_1    conda-forge
 libcusolver               11.6.3.83            he02047a_0    conda-forge
 libcusparse               12.5.1.3             he02047a_0    conda-forge
 libdeflate                1.20                 hd590300_0    conda-forge
 libedit                   3.1.20191231         he28a2e2_2    conda-forge
 libev                     4.33                 hd590300_2    conda-forge
 libevent                  2.1.12               hf998b51_1    conda-forge
 libexpat                  2.6.2                h59595ed_0    conda-forge
 libffi                    3.4.2                h7f98852_5    conda-forge
 libgcc-devel_linux-64     11.4.0             h8f596e0_113    conda-forge
 libgcc-ng                 14.1.0               h77fa898_0    conda-forge
 libgfortran-ng            14.1.0               h69a702a_0    conda-forge
 libgfortran5              14.1.0               hc5f4f2c_0    conda-forge
 libgomp                   14.1.0               h77fa898_0    conda-forge
 libgoogle-cloud           2.26.0               h26d7fe4_0    conda-forge
 libgoogle-cloud-storage   2.26.0               ha262f82_0    conda-forge
 libgrpc                   1.62.2               h15f2491_0    conda-forge
 libhwloc                  2.11.1          default_hecaa2ac_1000    conda-forge
 libiconv                  1.17                 hd590300_2    conda-forge
 libjpeg-turbo             3.0.0                hd590300_1    conda-forge
 libkvikio                 24.08.00a       cuda12_240722_ge7bc8b2_19    rapidsai-nightly
 liblapack                 3.9.0           22_linux64_openblas    conda-forge
 libllvm14                 14.0.6               hcd5def8_4    conda-forge
 libllvm16                 16.0.6               hb3ce162_3    conda-forge
 libllvm18                 18.1.8               h8b73ec9_1    conda-forge
 libmagma                  2.7.2                h173bb3b_2    conda-forge
 libmagma_sparse           2.7.2                h173bb3b_3    conda-forge
 libnghttp2                1.58.0               h47da74e_1    conda-forge
 libnsl                    2.0.1                hd590300_0    conda-forge
 libntlm                   1.4               h7f98852_1002    conda-forge
 libnvjitlink              12.5.82              he02047a_0    conda-forge
 libopenblas               0.3.27          pthreads_hac2b453_1    conda-forge
 libparquet                16.1.0          h9e5060d_14_cpu    conda-forge
 libpng                    1.6.43               h2797004_0    conda-forge
 libprotobuf               4.25.3               h08a7969_0    conda-forge
 librdkafka                1.9.2                ha5a0de0_2    conda-forge
 libre2-11                 2023.09.01           h5a48ba9_2    conda-forge
 librmm                    24.08.00a31     cuda12_240722_g5f786ba3_31    rapidsai-nightly
 libsanitizer              11.4.0              h5763a12_13    conda-forge
 libsodium                 1.0.18               h36c2ea0_1    conda-forge
 libsqlite                 3.46.0               hde9e2c9_0    conda-forge
 libssh2                   1.11.0               h0841786_0    conda-forge
 libstdcxx-devel_linux-64  11.4.0             h8f596e0_113    conda-forge
 libstdcxx-ng              14.1.0               hc0a3c3a_0    conda-forge
 libthrift                 0.19.0               hb90f79a_1    conda-forge
 libtiff                   4.6.0                h1dd3fc0_3    conda-forge
 libtorch                  2.3.1           cuda120_h2b0da52_300    conda-forge
 libutf8proc               2.8.0                h166bdaf_0    conda-forge
 libuuid                   2.38.1               h0b41bf4_0    conda-forge
 libuv                     1.48.0               hd590300_0    conda-forge
 libwebp-base              1.4.0                hd590300_0    conda-forge
 libxcb                    1.16                 hd590300_0    conda-forge
 libxcrypt                 4.4.36               hd590300_1    conda-forge
 libxml2                   2.12.7               he7c6b58_4    conda-forge
 libzlib                   1.3.1                h4ab18f5_1    conda-forge
 livereload                2.7.0              pyhd8ed1ab_0    conda-forge
 llvm-openmp               18.1.8               hf5423f3_0    conda-forge
 llvmlite                  0.43.0          py311hbde99c3_0    conda-forge
 locket                    1.0.0              pyhd8ed1ab_0    conda-forge
 lz4                       4.3.3           py311h38e4bf4_0    conda-forge
 lz4-c                     1.9.4                hcb278e6_0    conda-forge
 make                      4.3                  hd18ef5c_1    conda-forge
 makefun                   1.15.4             pyhd8ed1ab_0    conda-forge
 markdown                  3.6                pyhd8ed1ab_0    conda-forge
 markdown-it-py            3.0.0              pyhd8ed1ab_0    conda-forge
 markupsafe                2.1.5           py311h459d7ec_0    conda-forge
 matplotlib-inline         0.1.7              pyhd8ed1ab_0    conda-forge
 mdit-py-plugins           0.4.1              pyhd8ed1ab_0    conda-forge
 mdurl                     0.1.2              pyhd8ed1ab_0    conda-forge
 mistune                   3.0.2              pyhd8ed1ab_0    conda-forge
 mkl                       2023.2.0         h84fe81f_50496    conda-forge
 moto                      5.0.11             pyhd8ed1ab_0    conda-forge
 mpc                       1.3.1                hfe3b2da_0    conda-forge
 mpfr                      4.2.1                h9458935_1    conda-forge
 mpmath                    1.3.0              pyhd8ed1ab_0    conda-forge
 msgpack-python            1.0.8           py311h52f7536_0    conda-forge
 multidict                 6.0.5           py311h459d7ec_0    conda-forge
 multiprocess              0.70.15         py311h459d7ec_1    conda-forge
 myst-nb                   1.1.1              pyhd8ed1ab_0    conda-forge
 myst-parser               3.0.1              pyhd8ed1ab_0    conda-forge
 nbclient                  0.10.0             pyhd8ed1ab_0    conda-forge
 nbconvert                 7.16.4               hd8ed1ab_1    conda-forge
 nbconvert-core            7.16.4             pyhd8ed1ab_1    conda-forge
 nbconvert-pandoc          7.16.4               hd8ed1ab_1    conda-forge
 nbformat                  5.10.4             pyhd8ed1ab_0    conda-forge
 nbsphinx                  0.9.4              pyhd8ed1ab_0    conda-forge
 nccl                      2.22.3.1             hbc370b7_0    conda-forge
 ncurses                   6.5                  h59595ed_0    conda-forge
 nest-asyncio              1.6.0              pyhd8ed1ab_0    conda-forge
 networkx                  3.3                pyhd8ed1ab_1    conda-forge
 ninja                     1.12.1               h297d8ca_0    conda-forge
 nodeenv                   1.9.1              pyhd8ed1ab_0    conda-forge
 notebook                  7.2.1              pyhd8ed1ab_0    conda-forge
 notebook-shim             0.2.4              pyhd8ed1ab_0    conda-forge
 numba                     0.60.0          py311h4bc866e_0    conda-forge
 numpy                     1.26.4          py311h64a7726_0    conda-forge
 numpydoc                  1.7.0              pyhd8ed1ab_1    conda-forge
 nvcomp                    3.0.6                h10b603f_0    conda-forge
 nvtx                      0.2.10          py311h459d7ec_0    conda-forge
 openapi-schema-validator  0.6.2              pyhd8ed1ab_0    conda-forge
 openapi-spec-validator    0.7.1              pyhd8ed1ab_0    conda-forge
 openjpeg                  2.5.2                h488ebb8_0    conda-forge
 openssl                   3.3.1                h4bc722e_2    conda-forge
 orc                       2.0.1                h17fec99_1    conda-forge
 overrides                 7.7.0              pyhd8ed1ab_0    conda-forge
 packaging                 24.1               pyhd8ed1ab_0    conda-forge
 pandas                    2.2.2           py311h14de704_1    conda-forge
 pandoc                    3.2.1                ha770c72_0    conda-forge
 pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
 parso                     0.8.4              pyhd8ed1ab_0    conda-forge
 partd                     1.4.2              pyhd8ed1ab_0    conda-forge
 pathable                  0.4.3              pyhd8ed1ab_0    conda-forge
 pathspec                  0.12.1             pyhd8ed1ab_0    conda-forge
 pexpect                   4.9.0              pyhd8ed1ab_0    conda-forge
 pickleshare               0.7.5                   py_1003    conda-forge
 pillow                    10.4.0          py311h82a398c_0    conda-forge
 pip                       24.0               pyhd8ed1ab_0    conda-forge
 pkgutil-resolve-name      1.3.10             pyhd8ed1ab_1    conda-forge
 platformdirs              4.2.2              pyhd8ed1ab_0    conda-forge
 pluggy                    1.5.0              pyhd8ed1ab_0    conda-forge
 pre-commit                3.7.1              pyha770c72_0    conda-forge
 prometheus_client         0.20.0             pyhd8ed1ab_0    conda-forge
 prompt-toolkit            3.0.47             pyha770c72_0    conda-forge
 prompt_toolkit            3.0.47               hd8ed1ab_0    conda-forge
 psutil                    6.0.0           py311h331c9d8_0    conda-forge
 pthread-stubs             0.4               h36c2ea0_1001    conda-forge
 ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
 pure_eval                 0.2.3              pyhd8ed1ab_0    conda-forge
 py-cpuinfo                9.0.0              pyhd8ed1ab_0    conda-forge
 pyarrow                   16.1.0          py311hbd00459_4    conda-forge
 pyarrow-core              16.1.0          py311h8c3dac4_4_cpu    conda-forge
 pycparser                 2.22               pyhd8ed1ab_0    conda-forge
 pydantic                  2.8.2              pyhd8ed1ab_0    conda-forge
 pydantic-core             2.20.1          py311hb3a8bbb_0    conda-forge
 pydata-sphinx-theme       0.15.4             pyhd8ed1ab_0    conda-forge
 pygments                  2.18.0             pyhd8ed1ab_0    conda-forge
 pynvjitlink               0.3.0           py311hd269673_0    rapidsai
 pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
 pyparsing                 3.1.2              pyhd8ed1ab_0    conda-forge
 pysocks                   1.7.1              pyha2e5f31_6    conda-forge
 pytest                    7.4.4              pyhd8ed1ab_0    conda-forge
 pytest-benchmark          4.0.0              pyhd8ed1ab_0    conda-forge
 pytest-cases              3.8.5              pyhd8ed1ab_0    conda-forge
 pytest-cov                5.0.0              pyhd8ed1ab_0    conda-forge
 pytest-xdist              3.6.1              pyhd8ed1ab_0    conda-forge
 python                    3.11.9          hb806964_0_cpython    conda-forge
 python-confluent-kafka    1.9.2           py311hd4cff14_2    conda-forge
 python-dateutil           2.9.0              pyhd8ed1ab_0    conda-forge
 python-fastjsonschema     2.20.0             pyhd8ed1ab_0    conda-forge
 python-json-logger        2.0.7              pyhd8ed1ab_0    conda-forge
 python-multipart          0.0.9              pyhd8ed1ab_0    conda-forge
 python-tzdata             2024.1             pyhd8ed1ab_0    conda-forge
 python-xxhash             3.4.1           py311h459d7ec_0    conda-forge
 python_abi                3.11                    4_cp311    conda-forge
 pytorch                   2.3.1           cuda120_py311hf6aebf0_300    conda-forge
 pytz                      2024.1             pyhd8ed1ab_0    conda-forge
 pyyaml                    6.0.1           py311h459d7ec_1    conda-forge
 pyzmq                     26.0.3          py311h08a0b41_0    conda-forge
 qtconsole-base            5.5.2              pyha770c72_0    conda-forge
 qtpy                      2.4.1              pyhd8ed1ab_0    conda-forge
 rapids-build-backend      0.3.2                      py_0    rapidsai
 rapids-dask-dependency    24.08.00a5                 py_0    rapidsai-nightly
 rapids-dependency-file-generator 1.14.0                     py_0    rapidsai
 re2                       2023.09.01           h7f4b329_2    conda-forge
 readline                  8.2                  h8228510_1    conda-forge
 recommonmark              0.7.1              pyhd8ed1ab_0    conda-forge
 referencing               0.35.1             pyhd8ed1ab_0    conda-forge
 regex                     2024.5.15       py311h331c9d8_0    conda-forge
 requests                  2.32.3             pyhd8ed1ab_0    conda-forge
 responses                 0.25.3             pyhd8ed1ab_0    conda-forge
 rfc3339-validator         0.1.4              pyhd8ed1ab_0    conda-forge
 rfc3986-validator         0.1.1              pyh9f0ad1d_0    conda-forge
 rhash                     1.4.4                hd590300_0    conda-forge
 rich                      13.7.1             pyhd8ed1ab_0    conda-forge
 rmm                       24.08.00a31     cuda12_py311_240722_g5f786ba3_31    rapidsai-nightly
 rpds-py                   0.19.0          py311hb3a8bbb_0    conda-forge
 s2n                       1.4.17               he19d79f_0    conda-forge
 s3fs                      2024.6.1           pyhd8ed1ab_0    conda-forge
 s3transfer                0.10.2             pyhd8ed1ab_0    conda-forge
 safetensors               0.4.3           py311h46250e7_0    conda-forge
 scikit-build-core         0.9.8              pyh4af843d_0    conda-forge
 scipy                     1.14.0          py311h517d4fd_1    conda-forge
 send2trash                1.8.3              pyh0d859eb_0    conda-forge
 setuptools                71.0.4             pyhd8ed1ab_0    conda-forge
 shellingham               1.5.4              pyhd8ed1ab_0    conda-forge
 six                       1.16.0             pyh6c4a22f_0    conda-forge
 sleef                     3.6.1                h3400bea_1    conda-forge
 snappy                    1.2.1                ha2e4443_0    conda-forge
 sniffio                   1.3.1              pyhd8ed1ab_0    conda-forge
 snowballstemmer           2.2.0              pyhd8ed1ab_0    conda-forge
 sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
 soupsieve                 2.5                pyhd8ed1ab_1    conda-forge
 spdlog                    1.12.0               hd2e6256_2    conda-forge
 sphinx                    6.2.1              pyhd8ed1ab_0    conda-forge
 sphinx-autobuild          2024.4.16          pyhd8ed1ab_0    conda-forge
 sphinx-copybutton         0.5.2              pyhd8ed1ab_0    conda-forge
 sphinx-markdown-tables    0.0.17             pyh6c4a22f_0    conda-forge
 sphinx-remove-toctrees    1.0.0.post1        pyhd8ed1ab_0    conda-forge
 sphinxcontrib-applehelp   1.0.8              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-devhelp     1.0.6              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-htmlhelp    2.0.6              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-jsmath      1.0.1              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-qthelp      1.0.8              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-serializinghtml 1.1.10             pyhd8ed1ab_0    conda-forge
 sphinxcontrib-websupport  1.2.7              pyhd8ed1ab_0    conda-forge
 sqlalchemy                2.0.31          py311h331c9d8_0    conda-forge
 stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
 starlette                 0.37.2             pyhd8ed1ab_0    conda-forge
 streamz                   0.6.4              pyh6c4a22f_0    conda-forge
 sympy                     1.13.0          pypyh2585a3b_103    conda-forge
 sysroot_linux-64          2.17                h4a8ded7_16    conda-forge
 tabulate                  0.9.0              pyhd8ed1ab_1    conda-forge
 tbb                       2021.12.0            h434a139_3    conda-forge
 tblib                     3.0.0              pyhd8ed1ab_0    conda-forge
 terminado                 0.18.1             pyh0d859eb_0    conda-forge
 tinycss2                  1.3.0              pyhd8ed1ab_0    conda-forge
 tk                        8.6.13          noxft_h4845f30_101    conda-forge
 tokenizers                0.15.2          py311h6640629_0    conda-forge
 toml                      0.10.2             pyhd8ed1ab_0    conda-forge
 tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
 tomlkit                   0.13.0             pyha770c72_0    conda-forge
 toolz                     0.12.1             pyhd8ed1ab_0    conda-forge
 tornado                   6.4.1           py311h331c9d8_0    conda-forge
 tqdm                      4.66.4             pyhd8ed1ab_0    conda-forge
 traitlets                 5.14.3             pyhd8ed1ab_0    conda-forge
 transformers              4.39.3             pyhd8ed1ab_0    conda-forge
 typer                     0.12.3             pyhd8ed1ab_0    conda-forge
 typer-slim                0.12.3             pyhd8ed1ab_0    conda-forge
 typer-slim-standard       0.12.3               hd8ed1ab_0    conda-forge
 types-python-dateutil     2.9.0.20240316     pyhd8ed1ab_0    conda-forge
 types-pyyaml              6.0.12.20240311    pyhd8ed1ab_0    conda-forge
 typing-extensions         4.12.2               hd8ed1ab_0    conda-forge
 typing_extensions         4.12.2             pyha770c72_0    conda-forge
 typing_utils              0.1.0              pyhd8ed1ab_0    conda-forge
 tzdata                    2024a                h0c530f3_0    conda-forge
 tzlocal                   5.2             py311h38be061_0    conda-forge
 ukkonen                   1.0.1           py311h9547e67_4    conda-forge
 uri-template              1.3.0              pyhd8ed1ab_0    conda-forge
 urllib3                   2.2.2              pyhd8ed1ab_1    conda-forge
 uvicorn                   0.30.3          py311h38be061_0    conda-forge
 virtualenv                20.26.3            pyhd8ed1ab_0    conda-forge
 watchfiles                0.22.0          py311h5ecf98a_0    conda-forge
 wcwidth                   0.2.13             pyhd8ed1ab_0    conda-forge
 webcolors                 24.6.0             pyhd8ed1ab_0    conda-forge
 webencodings              0.5.1              pyhd8ed1ab_2    conda-forge
 websocket-client          1.8.0              pyhd8ed1ab_0    conda-forge
 websockets                12.0            py311h459d7ec_0    conda-forge
 werkzeug                  3.0.3              pyhd8ed1ab_0    conda-forge
 wheel                     0.43.0             pyhd8ed1ab_1    conda-forge
 widgetsnbextension        4.0.11             pyhd8ed1ab_0    conda-forge
 wrapt                     1.16.0          py311h459d7ec_0    conda-forge
 xmltodict                 0.13.0             pyhd8ed1ab_0    conda-forge
 xorg-libxau               1.0.11               hd590300_0    conda-forge
 xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
 xxhash                    0.8.2                hd590300_0    conda-forge
 xyzservices               2024.6.0           pyhd8ed1ab_0    conda-forge
 xz                        5.2.6                h166bdaf_0    conda-forge
 yaml                      0.2.5                h7f98852_2    conda-forge
 yarl                      1.9.4           py311h459d7ec_0    conda-forge
 zeromq                    4.3.5                h75354e8_4    conda-forge
 zict                      3.0.0              pyhd8ed1ab_0    conda-forge
 zipp                      3.19.2             pyhd8ed1ab_0    conda-forge
 zlib                      1.3.1                h4ab18f5_1    conda-forge
 zstandard                 0.23.0          py311h5cd10c7_0    conda-forge
 zstd                      1.5.6                ha6fb4c9_0    conda-forge
**Additional context**
Add any other context about the problem here.
@rjzamora
Copy link
Member

rjzamora commented Aug 2, 2024

Hi @Huilin-Li!

There is actually a lot going on in your example. So, let's break things down a bit:

You are trying to use both dask.dataframe and cudf (via the "cudf" backend configuration in dask.dataframe) to accelerate your data-processing workflow. When you use dask.dataframe with the "pandas" backend (the default), everything works fine for you. However, when you try using the "cudf" backend, you see errors for the following operations:

  • ddf.apply
  • ddf.explode

The ddf.apply error probably means that you are trying to apply a user-defined function (UDF) that cudf is unable to support. In order to get to the bottom of this error, I highly recommend that you provide a simple cudf-only reproducer that someone can run without having access to your data (or to the Bio library).

The error you see for ddf.explode is definitely a bug in the cudf backend of dask.dataframe. However, it doesn't seem like ddf.explode works with the pandas backend either when I attempt a local reproducer. Therefore, it would be nice to have a standalone reproducer that works for you with the pandas backend.

Overall, I appreciate that you are raising these issues and sharing the details of your workflow! I'm only suggesting that you share simpler stand-alone reproducers so that it will be much easier for someone to jump in and help.

@Huilin-Li
Copy link
Author

Huilin-Li commented Aug 4, 2024

Hi @Huilin-Li!

There is actually a lot going on in your example. So, let's break things down a bit:

You are trying to use both dask.dataframe and cudf (via the "cudf" backend configuration in dask.dataframe) to accelerate your data-processing workflow. When you use dask.dataframe with the "pandas" backend (the default), everything works fine for you. However, when you try using the "cudf" backend, you see errors for the following operations:

  • ddf.apply
  • ddf.explode

The ddf.apply error probably means that you are trying to apply a user-defined function (UDF) that cudf is unable to support. In order to get to the bottom of this error, I highly recommend that you provide a simple cudf-only reproducer that someone can run without having access to your data (or to the Bio library).

The error you see for ddf.explode is definitely a bug in the cudf backend of dask.dataframe. However, it doesn't seem like ddf.explode works with the pandas backend either when I attempt a local reproducer. Therefore, it would be nice to have a standalone reproducer that works for you with the pandas backend.

Overall, I appreciate that you are raising these issues and sharing the details of your workflow! I'm only suggesting that you share simpler stand-alone reproducers so that it will be much easier for someone to jump in and help.

@rjzamora Hi, please check this much simpler example. I tested it, and it can reproduce the same error. I also have some findings I want to share with you. I am thinking there might be some problems in dd.read_parquet("./test.parquet"). Because if dask.dataframe reada from pandas.DataFrame directly, the calculation works fine.

pdf = pd.DataFrame({"ID":  list(range(100)),  "seq": ["AAAAA","BB","CCC","DD","EEEEEE"]*20})

# this works correctly.
ddf = dd.from_pandas(pdf, npartitions=3)

# this works wrongly.
ddf_tmp = dd.from_pandas(pdf, npartitions=3)
ddf_tmp.to_parquet("./test.parquet", engine="pyarrow" )   # although I have installed fastparquet, the engine still cannot use it. Only pyarrow works.
ddf = dd.read_parquet("./test.parquet")

example

import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask
dask.config.set({"dataframe.backend": "cudf"})

pdf = pd.DataFrame({"ID":  list(range(100)),
                    "seq": ["AAAAA","BB","CCC","DD","EEEEEE"]*20})

ddf_tmp = dd.from_pandas(pdf, npartitions=3)
ddf_tmp.to_parquet("./test.parquet", engine="pyarrow" )  

ddf = dd.read_parquet("./test.parquet")
print(ddf)

def myfunc(df, arg):
    return apply_series(df.seq, arg)
def apply_series(series, arg):
    return process_row(s=series, arg=arg)
def process_row(s, arg):
    to_int = {"A":10, "B": 12, "C":13, "D":14, "E":15}
    res = []
    res_tmp = 0
    for i in range(arg):
        res_tmp = (res_tmp << 4) + to_int[s[i]] 
    res.append(res_tmp)
    for j in range(len(s)-arg):
        res_tmp = (res_tmp >> 4)*arg + to_int[s[j+arg]]
        res.append(res_tmp)
    return res

test = ddf.apply(myfunc, axis=1, arg=2, meta=("test", 'int')) # arg must be 2 in this test case
print(test.compute())

ddf["test"] = test
print(ddf.compute())

explode_test = ddf.explode('test')
print(explode_test.compute())

error

(cudf_dev) [lihuilin@gvno02 MYANAWORK]$ python test.py
Dask DataFrame Structure:
                  ID     seq
npartitions=3               
               int64  object
                 ...     ...
                 ...     ...
                 ...     ...
Dask Name: read_parquet, 1 expression
Expr=ReadParquetFSSpec(3de3aeb)
Traceback (most recent call last):
  File "/storage/lihuilin/MYANAWORK/test.py", line 33, in <module>
    print(test.compute())
          ^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_collection.py", line 476, in compute
    return DaskMethodsMixin.compute(out, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask/base.py", line 376, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask/base.py", line 662, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/dask_expr/_expr.py", line 3758, in _execute_task
    return dask.core.get(graph, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/dataframe.py", line 4683, in apply
    return self._apply(func, _get_row_kernel, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lihuilin/miniconda3/envs/cudf_dev/lib/python3.11/site-packages/cudf/core/indexed_frame.py", line 3429, in _apply
    raise ValueError("UDFs using **kwargs are not yet supported.")
ValueError: UDFs using **kwargs are not yet supported.

@rjzamora
Copy link
Member

Sorry for the delay @Huilin-Li (I have been away).

Thank you for sharing a simpler example - There may be Dask related issues in explode, but the primary problem you are reporting here has nothing to do with Dask. You are just finding that UDF support (i.e. apply support) is less flexible in cudf than it is in pandas.

You will find a similar error if you remove the Dask and Parquet-related code altogether. Even if you simplify the logic to avoid using any kwargs, cudf will still complain that your UDF (myfunc) cannot not be compiled (ValueError: user defined function compilation failed.):

import cudf

ser = cudf.Series(["AAAAA","BB","CCC","DD","EEEEEE"]*20)

def myfunc(s):
    arg = 2
    to_int = {"A":10, "B": 12, "C":13, "D":14, "E":15}
    res = []
    res_tmp = 0
    for i in range(arg):
        res_tmp = (res_tmp << 4) + to_int[s[i]] 
    res.append(res_tmp)
    for j in range(len(s)-arg):
        res_tmp = (res_tmp >> 4)*arg + to_int[s[j+arg]]
        res.append(res_tmp)
    return res

ser.apply(myfunc)

@brandon-b-miller - Perhaps you have some relevant advice on this subject?

@Huilin-Li
Copy link
Author

Sorry for the delay @Huilin-Li (I have been away).

Thank you for sharing a simpler example - There may be Dask related issues in explode, but the primary problem you are reporting here has nothing to do with Dask. You are just finding that UDF support (i.e. apply support) is less flexible in cudf than it is in pandas.

You will find a similar error if you remove the Dask and Parquet-related code altogether. Even if you simplify the logic to avoid using any kwargs, cudf will still complain that your UDF (myfunc) cannot not be compiled (ValueError: user defined function compilation failed.):

import cudf

ser = cudf.Series(["AAAAA","BB","CCC","DD","EEEEEE"]*20)

def myfunc(s):
    arg = 2
    to_int = {"A":10, "B": 12, "C":13, "D":14, "E":15}
    res = []
    res_tmp = 0
    for i in range(arg):
        res_tmp = (res_tmp << 4) + to_int[s[i]] 
    res.append(res_tmp)
    for j in range(len(s)-arg):
        res_tmp = (res_tmp >> 4)*arg + to_int[s[j+arg]]
        res.append(res_tmp)
    return res

ser.apply(myfunc)

@brandon-b-miller - Perhaps you have some relevant advice on this subject?

Hi, @rjzamora @brandon-b-miller , may I ask for any suggestions about this problem? Thanks in advance!

@brandon-b-miller
Copy link
Contributor

Hi @Huilin-Li , string support within UDFs is somewhat limited for now. However looking over your UDF, it seems to consist of features that are mostly on the roadmap. For now there's a couple things missing:

  1. Right now, we don't have __getitem__ for strings within UDFs. So operations like 'abc'[1] will fail to compile. This is the source of the first error where numba reports a missing overload for getitem(Masked(string_view), int64), which actually comes from the s[i] part of the UDF. Nothing really prevents us from implementing this feature over the next few releases, if desired.
  2. We are not yet able to return list types from UDFs. This is a harder problem to solve in the short term, but a workaround for your UDF might be to return a string delimited by some character and then follow up splitting the result with a cudf split operation to get a list. Unfortunately this does not solve problem (1) for you.

I wish I had a better answer for getting this UDF to run with today's cuDF, so you might have to fall back on higher level .str APIs within cuDF python to do the same thing if possible. All that said, I'm hoping we're able to get to a point where this will work in the short to medium term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

3 participants