Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique can be optimized on keyed data.tables #2947

Open
MichaelChirico opened this issue Jun 21, 2018 · 12 comments
Open

unique can be optimized on keyed data.tables #2947

MichaelChirico opened this issue Jun 21, 2018 · 12 comments
Assignees

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 21, 2018

keyed tables are already known sorted, so finding unique values is much easier than it is in the general case.

Compare:

NN = 1e8

set.seed(13013)
# about 400 MB, if you're RAM-conscious
DT = data.table(sample(1e5, NN, TRUE), key = 'V1')

system.time(unique(DT$V1))
#    user  system elapsed 
#   1.354   0.415   1.798 

system.time(DT[ , unique(V1)])
#    user  system elapsed 
#   1.266   0.414   1.681 

system.time(DT[ , TRUE, keyby = V1])
#    user  system elapsed 
#   0.375   0.000   0.375 

It seems to me we should be able to match (or exceed) the final time in the second call to unique (i.e. within []).

If we were willing to do something like add a dt_primary_key class to the primary key, we could also achieve this speed in the first approach by writing a unique.dt_primary_key method, but I'm not sure how extensible this is to multiple keys (S4?)

@jangorecki
Copy link
Member

good idea, but should be implemented without making new class, just checking attributes should be enough to detect and apply optimization.

@HughParsonage

This comment was marked as outdated.

@jangorecki jangorecki self-assigned this Apr 10, 2020
@jangorecki

This comment was marked as outdated.

@HughParsonage

This comment was marked as resolved.

@HughParsonage
Copy link
Member

That said, I can't reproduce the numbers I got back in 2018. Further, here is a direct way in C++ that I wrote last year that seems to produce faster results (for integer vectors at least):

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector do_unique_sorted(IntegerVector x) {
  int n = x.length();
  int i = 0;
  std::vector<int> o;
  o.reserve(n);
  o.push_back(x[i]);
  
  while (++i < n) {
    int xi = x[i];
    if (xi != x[i - 1]) {
      o.push_back(xi);
    }
    
  }
  return wrap(o);
}
bench::mark(DT[, TRUE, keyby = "V1"][["V1"]], 
            unique(DT$V1),
            unique(DT, by = "V1")[["V1"]],
            do_unique_sorted(DT$V1))
# A tibble: 4 x 14
  expression        min       mean     median        max `itr/sec`     mem_alloc  n_gc n_itr
  <chr>        <bch:tm>   <bch:tm>   <bch:tm>   <bch:tm>     <dbl>     <bch:byt> <dbl> <int>
1 "DT[, TRU~  132.675ms  135.174ms  133.808ms  140.405ms     7.40     1600.242KB     0     4
2 "unique(D~ 1238.871ms 1238.871ms 1238.871ms 1238.871ms     0.807 1439591.766KB     1     1
3 "unique(D~ 1138.289ms 1138.289ms 1138.289ms 1138.289ms     0.879  391813.172KB     1     1
4 "do_uniqu~   79.655ms   80.354ms   79.918ms   82.283ms    12.4       393.164KB     0     7
# ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>,
#   gc <list>

@jangorecki
Copy link
Member

As proposed in #4346, if we would also collect groups information as an attribute to an index...
so instead of

o = forderv(DT, by="V1", sort=TRUE, retGrp=FALSE)

use

o = forderv(DT, by="V1", sort=TRUE, retGrp=TRUE)

we could quite easily optimize cases like this one using "lazy forder" #4386

@jangorecki
Copy link
Member

jangorecki commented May 31, 2020

DT[ , unique(V1)]

is tricky to optimize because base R unique used is not our unique, and it doesn't have access to attributes like key or index. It is different for DT[, keyby=] or unique(DT, by=) where we can access DT and its key or indices.

@jangorecki
Copy link
Member

jangorecki commented May 31, 2020

I think we all agree that optimizing with index rather than key is much more useful.

Here are the timings I am getting now using merged of #4370 and #4386. There could be some speed up after #4467 as well.
The function fdistinct is from #4370 but merged into #4386 to re-use existing index.

Verbose output is there to explain whenever index optimization was attempted to be made. If there are two lines, then it means that two calls to forder were made. Second AFAIK to revert to original order, which is very cheap because it is integer of length of unique value.

library(data.table)
fdistinct = data.table:::fdistinct
NN = 1e8
set.seed(13013)
DT = data.table(sample(1e5, NN, TRUE)) # 400MB
setindexv(DT, "V1")

system.time(a1<-unique(DT$V1))
#   user  system elapsed 
#  2.952   0.592   3.544  
system.time(a2<-DT[ , unique(V1)])
#   user  system elapsed 
#  3.062   0.984   4.047 
system.time(a3<-DT[ , TRUE, by = V1][["V1"]])
#   user  system elapsed 
# 10.642   2.513   1.053 
##Finding groups using forderv ... forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
system.time(a4<-DT[ , TRUE, keyby = V1][["V1"]]) ## it is incompatible sort here, but lets just have it for comparison
#   user  system elapsed 
#  1.903   0.371   2.275 
##Finding groups using uniqlist on index 'V1' ... 1.914s elapsed (1.586s cpu) 
system.time(a5<-unique(DT, by="V1")[["V1"]])
#   user  system elapsed 
# 10.153   1.184   0.769 
##forderLazy: opt not possible: is.data.table(DT)=1, sortGroups=0, all1(ascArg)=1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
system.time(a6<-fdistinct(DT, "V1")[["V1"]])
#   user  system elapsed 
#  0.040   0.044   0.075 
##forderLazy: using existing index: __V1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1

all.equal(a1, a2) && all.equal(a1, a3) && all.equal(sort(a1), a4) && all.equal(a1, a5) && all.equal(a1, a6)
#[1] TRUE

First two are slowest because they dispatch to base::unique.

It is nice to see how fdistinct can outperform other operations here. This is particularly because of

data.table/R/mergelist.R

Lines 53 to 56 in 2884b29

## do not compute sort=F for mult="first" if index (sort=T) already available, sort=T is needed only for mult="last"
## this short circuit will work after #4386 because it requires retGrp=T
sort = mult!="first" || hasindex(x, by=on, retGrp=TRUE)
o = forderv(x, by=on, sort=sort, retGrp=TRUE)

normally when computing unique we would call forderv(sort=FALSE) which cannot use index because indices are sort=TRUE. For unique we don't need order but starts attribute only. Those few lines of code checks if there is index available, and if it is, then uses it. Then it has to re-order at the end as well, which is handled by the following line

if (sort && length(o <- forderv(f))) f = f[o] ## this rolls back to original order

The same optimization could be applied to unique(DT, by="V1").

@jangorecki
Copy link
Member

To actually optimize calls like DT[, unique(V1)] it would probably be the best to follow same strategy as optimizing DT[order(...)] to DT[forderv(DT, ...)] from #4458

@MichaelChirico
Copy link
Member Author

For one-key tables, users will soon benefit from fast unique/duplicated on ALTREP objects:

https://twitter.com/groundwalkergmb/status/1389685109909975040

Which changes the calculus for this FR a bit. Related: #4697

@statquant
Copy link

For what it is worth I think fdistinct would be amongst the functions we would use the most. From what I understand there are 2 orders of magnitude to make that would be amazing.

@jangorecki
Copy link
Member

jangorecki commented Nov 3, 2022

fdistinct is not exported in mergelist PR. I agree it make sense to export it as well.
2 orders of magnitude, having index (retGrp=T) set before.

Once resolved then this Q can be nicely addressed
https://stackoverflow.com/questions/74286104/what-is-causing-this-difference-in-performance-in-indexing-between-data-table-an

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants