Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ListDtype / ListArray #35176

Closed
TomAugspurger opened this issue Jul 8, 2020 · 13 comments
Closed

ENH: ListDtype / ListArray #35176

TomAugspurger opened this issue Jul 8, 2020 · 13 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@TomAugspurger
Copy link
Contributor

This issue is for adding a ListDtype. This might be useful on it's own, and will be useful for #35169 when we have string operations that return a List of values per scalar element.

I think the primary points to discuss are around

  1. How the value_type of the List, the T in List[T], should be specified by the user
  2. How, if at all, to switch between the list_ and large_list types.

xref rapidsai/cudf#5610, where cudf is implementing a ListDtype. Let's chime in over there if we have any thoughts.

@TomAugspurger TomAugspurger added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). labels Jul 8, 2020
@jbrockmendel
Copy link
Member

IIRC numba has a TypedList, could do something like that for the str.split-like ops

@xhochy
Copy link
Contributor

xhochy commented Jul 10, 2020

One of the biggest challenges with a ListType is probably how to differentiate between scalars and list-likes in the pandas API. There are quite some places where you can pass in both and the API behaviour will be slightly different. In the case of a ListDtype, the scalar is also a list-like. This makes it harder to decide which code path should be taken. We probably can use the Type-Information to decide whether we actually have a scalar or not. At the moment this information is sadly not yet present in the dispatching interfaces. For fletcher, I have yet been a bit lazy with this and xfail a lot of these cases https://github.com/xhochy/fletcher/blob/18ac1a348fdd6ccfb096ec5e27c9dedc1e7fc837/tests/test_pandas_extension.py#L74-L86

@jorisvandenbossche
Copy link
Member

Yes, this is indeed a general problem we need to solve in pandas. We also have been running into this with GeoPandas (eg #26333) and you also already run into corner cases when using iterable elements in object dtype. Other related issues: #27911, #35131

We will probably need some mechanism to let the dtype decide if some value can be a scalar or not.

@jorisvandenbossche
Copy link
Member

For storing list-like data, I think that will be relatively straightfoward (either just with pyarrow, or even the raw memory layout of Arrow are "just" two arrays with values and offsets).

But right now there are not yet many operations or kernels included in Arrow to work on nested data, I think. In the meantime, awkward-array might be an interesting option to explore to perform more operations on such data (https://github.com/scikit-hep/awkward-array/)

@ananis25
Copy link

ananis25 commented Dec 7, 2020

Could I request to also consider a pandas Extension type for n-dim numpy arrays? Though it probably strays off from the pandas semantics of considering a series as an array like of scalars.

For a lot of data analysis work, the features are generally aligned along an axis like time and thus are suited to pandas. However, with >1D features, pandas coerces them to a numpy array of subarray objects, which causes memory usage to explode. A native type for numpy arrays of arbitrary dimensions would be very helpful (and easily compatible with arrow), even if aggregation ops, etc. are not allowed.

There is a ragtag implementation here, mostly copied from other available examples of extension arrays. The failing extension array tests generally have to do with:

  1. Failed calls to is_scalar routine in pandas internals, which seems to support only numpy/pandas scalars.
  2. Construction of empty series with the extension dtype. I can't quite pin what would be a good NA value.

@JulianWgs
Copy link

For reference: cuDF (a GPU implementation of Pandas) has now support for ListDtype (Link).

@jreback
Copy link
Contributor

jreback commented Oct 24, 2021

looks great @JulianWgs would be great to implement in pandas proper

@jbrockmendel
Copy link
Member

@mroeschke can we put this into the "use ArrowDtype" pile?

@mroeschke
Copy link
Member

Yes definitely

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Apr 19, 2023
@mroeschke
Copy link
Member

Since this has functionality via ArrowDtype, additional functionality can be build upon that so closing

@gwerbin
Copy link

gwerbin commented Feb 10, 2024

@mroeschke is that work planned, or is it only in "hypothetically possible to implement" status?

@mroeschke
Copy link
Member

This functionality is implemented using pandas.ArrowDtype https://pandas.pydata.org/docs/user_guide/pyarrow.html#data-structure-integration

@gwerbin
Copy link

gwerbin commented Feb 15, 2024

This functionality is implemented using pandas.ArrowDtype https://pandas.pydata.org/docs/user_guide/pyarrow.html#data-structure-integration

Thanks for clarifying!

For anyone else coming across this thread, it looks like pd.ArrowDtype(pa.list_(...)) is what I am looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

9 participants