Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count by list columns #1597

Closed
MichaelChirico opened this issue Mar 17, 2016 · 6 comments
Closed

Count by list columns #1597

MichaelChirico opened this issue Mar 17, 2016 · 6 comments
Labels
feature request Low non-atomic column e.g. list columns, S4 vector columns

Comments

@MichaelChirico
Copy link
Member

Perhaps this is intentional, but it seems to me a more natural solution to this question on SO would be:

shoppinglists[ , if (.N>=3) 
  .(triplet = combn(items, 3, simplify=FALSE)), 
  by=consumer][ , .N, by=triplet][order(-N)]

But if we try and use the list column in by we get an error:

Error in`` [.data.table(shoppinglists[, if (.N >= 3) list(triplet = combn(items,`:
The items in the`by`or`keyby`list are length`(3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3)`. Each must be same length as rows in`x`or number of rows returned by`i` (18).

As cryptic as this error is, I think by is having trouble identifying unique lists (perhaps this is by design).

Is there no way to do use a list column in by?

@franknarf1
Copy link
Contributor

Yeah, pretty sure it's not supported. You cannot join on list columns either.

@MichaelChirico
Copy link
Member Author

I understand support may be difficult / not worthwhile (comparing and indexing lists must be much more time-consuming than comparing vectors)... worth having out there as a FR though!

@MichaelChirico MichaelChirico changed the title Bug: Unable to count by list list columns Bug: Unable to count by list columns Mar 18, 2016
@MichaelChirico
Copy link
Member Author

MichaelChirico commented May 5, 2019

I'm not sure count-by-list is quite properly defined since list comparison is undefined in R:

list(1) == list(1)

Error in list(1) == list(1) :
comparison of these types is not implemented

Although all.equal(list(1), list(1)) is TRUE and this can even capture some more complicated lists (I tried with a simple lm object), all.equal(list(1:2), list(2:1)) is FALSE & it's ambiguous what .N should produce in this case. Probably what is intended is to unnest and then count. To close this, we can add a more helpful error message.

@jangorecki jangorecki added the non-atomic column e.g. list columns, S4 vector columns label Apr 6, 2020
@jangorecki
Copy link
Member

I think we can safely close this issue, as ordering of list column is undefined operation.

@MichaelChirico
Copy link
Member Author

for the record, it's possible to GROUP BY non-atomic columns (e.g. Presto SQL does this).

ordering is not defined but distinctness can be defined (e.g. unique.list) -- IINM they use a hash operation to group.

That said, it would require a pretty big upheaval of our backend (?i think? from sort-then-group to sometimes-sort/sometimes-hash) to accommodate this use case & there doesn't appear to be much demand. So happy to leave closed

@mattdowle mattdowle changed the title Bug: Unable to count by list columns Unable to count by list columns Jun 14, 2021
@mattdowle mattdowle changed the title Unable to count by list columns Count by list columns Jun 14, 2021
@jan-glx
Copy link
Contributor

jan-glx commented Feb 11, 2024

As a (potentially slow) workaround one can hash the elements of the list column, use the hash in by/keyby and listcolumn[1] in j:

as.data.table(shoppinglists)[ , if (.N>=3) 
    .(triplet = combn(items, 3, simplify=FALSE)), 
    by=consumer][ , .(.N, triplet=triplet[1]), by=.(triplet_hash=sapply(triplet, digest::digest))][order(-N)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Low non-atomic column e.g. list columns, S4 vector columns
Projects
None yet
Development

No branches or pull requests

5 participants