unique can be optimized on keyed data.tables #2947

MichaelChirico · 2018-06-21T13:51:23Z

keyed tables are already known sorted, so finding unique values is much easier than it is in the general case.

Compare:

NN = 1e8

set.seed(13013)
# about 400 MB, if you're RAM-conscious
DT = data.table(sample(1e5, NN, TRUE), key = 'V1')

system.time(unique(DT$V1))
#    user  system elapsed 
#   1.354   0.415   1.798 

system.time(DT[ , unique(V1)])
#    user  system elapsed 
#   1.266   0.414   1.681 

system.time(DT[ , TRUE, keyby = V1])
#    user  system elapsed 
#   0.375   0.000   0.375

It seems to me we should be able to match (or exceed) the final time in the second call to unique (i.e. within []).

If we were willing to do something like add a dt_primary_key class to the primary key, we could also achieve this speed in the first approach by writing a unique.dt_primary_key method, but I'm not sure how extensible this is to multiple keys (S4?)

The text was updated successfully, but these errors were encountered:

jangorecki · 2018-06-21T14:07:19Z

good idea, but should be implemented without making new class, just checking attributes should be enough to detect and apply optimization.

HughParsonage · 2020-04-12T01:48:28Z

That said, I can't reproduce the numbers I got back in 2018. Further, here is a direct way in C++ that I wrote last year that seems to produce faster results (for integer vectors at least):

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector do_unique_sorted(IntegerVector x) {
  int n = x.length();
  int i = 0;
  std::vector<int> o;
  o.reserve(n);
  o.push_back(x[i]);
  
  while (++i < n) {
    int xi = x[i];
    if (xi != x[i - 1]) {
      o.push_back(xi);
    }
    
  }
  return wrap(o);
}

bench::mark(DT[, TRUE, keyby = "V1"][["V1"]], 
            unique(DT$V1),
            unique(DT, by = "V1")[["V1"]],
            do_unique_sorted(DT$V1))

# A tibble: 4 x 14
  expression        min       mean     median        max `itr/sec`     mem_alloc  n_gc n_itr
  <chr>        <bch:tm>   <bch:tm>   <bch:tm>   <bch:tm>     <dbl>     <bch:byt> <dbl> <int>
1 "DT[, TRU~  132.675ms  135.174ms  133.808ms  140.405ms     7.40     1600.242KB     0     4
2 "unique(D~ 1238.871ms 1238.871ms 1238.871ms 1238.871ms     0.807 1439591.766KB     1     1
3 "unique(D~ 1138.289ms 1138.289ms 1138.289ms 1138.289ms     0.879  391813.172KB     1     1
4 "do_uniqu~   79.655ms   80.354ms   79.918ms   82.283ms    12.4       393.164KB     0     7
# ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>,
#   gc <list>

jangorecki · 2020-04-19T21:08:19Z

As proposed in #4346, if we would also collect groups information as an attribute to an index...
so instead of

o = forderv(DT, by="V1", sort=TRUE, retGrp=FALSE)

use

o = forderv(DT, by="V1", sort=TRUE, retGrp=TRUE)

we could quite easily optimize cases like this one using "lazy forder" #4386

jangorecki · 2020-05-31T13:17:27Z

DT[ , unique(V1)]

is tricky to optimize because base R unique used is not our unique, and it doesn't have access to attributes like key or index. It is different for DT[, keyby=] or unique(DT, by=) where we can access DT and its key or indices.

jangorecki · 2020-05-31T13:36:15Z

I think we all agree that optimizing with index rather than key is much more useful.

Here are the timings I am getting now using merged of #4370 and #4386. There could be some speed up after #4467 as well.
The function fdistinct is from #4370 but merged into #4386 to re-use existing index.

Verbose output is there to explain whenever index optimization was attempted to be made. If there are two lines, then it means that two calls to forder were made. Second AFAIK to revert to original order, which is very cheap because it is integer of length of unique value.

library(data.table)
fdistinct = data.table:::fdistinct
NN = 1e8
set.seed(13013)
DT = data.table(sample(1e5, NN, TRUE)) # 400MB
setindexv(DT, "V1")

system.time(a1<-unique(DT$V1))
#   user  system elapsed 
#  2.952   0.592   3.544  
system.time(a2<-DT[ , unique(V1)])
#   user  system elapsed 
#  3.062   0.984   4.047 
system.time(a3<-DT[ , TRUE, by = V1][["V1"]])
#   user  system elapsed 
# 10.642   2.513   1.053 
##Finding groups using forderv ... forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
system.time(a4<-DT[ , TRUE, keyby = V1][["V1"]]) ## it is incompatible sort here, but lets just have it for comparison
#   user  system elapsed 
#  1.903   0.371   2.275 
##Finding groups using uniqlist on index 'V1' ... 1.914s elapsed (1.586s cpu) 
system.time(a5<-unique(DT, by="V1")[["V1"]])
#   user  system elapsed 
# 10.153   1.184   0.769 
##forderLazy: opt not possible: is.data.table(DT)=1, sortGroups=0, all1(ascArg)=1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
system.time(a6<-fdistinct(DT, "V1")[["V1"]])
#   user  system elapsed 
#  0.040   0.044   0.075 
##forderLazy: using existing index: __V1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1

all.equal(a1, a2) && all.equal(a1, a3) && all.equal(sort(a1), a4) && all.equal(a1, a5) && all.equal(a1, a6)
#[1] TRUE

First two are slowest because they dispatch to base::unique.

It is nice to see how fdistinct can outperform other operations here. This is particularly because of

data.table/R/mergelist.R

Lines 53 to 56 in 2884b29

    
           ## do not compute sort=F for mult="first" if index (sort=T) already available, sort=T is needed only for mult="last" 
        
           ## this short circuit will work after #4386 because it requires retGrp=T 
        
           sort = mult!="first" || hasindex(x, by=on, retGrp=TRUE) 
        
           o = forderv(x, by=on, sort=sort, retGrp=TRUE)

normally when computing unique we would call forderv(sort=FALSE) which cannot use index because indices are sort=TRUE. For unique we don't need order but starts attribute only. Those few lines of code checks if there is index available, and if it is, then uses it. Then it has to re-order at the end as well, which is handled by the following line

if (sort && length(o <- forderv(f))) f = f[o] ## this rolls back to original order

The same optimization could be applied to unique(DT, by="V1").

jangorecki · 2020-05-31T18:22:25Z

To actually optimize calls like DT[, unique(V1)] it would probably be the best to follow same strategy as optimizing DT[order(...)] to DT[forderv(DT, ...)] from #4458

MichaelChirico · 2021-05-04T21:00:03Z

For one-key tables, users will soon benefit from fast unique/duplicated on ALTREP objects:

https://twitter.com/groundwalkergmb/status/1389685109909975040

Which changes the calculus for this FR a bit. Related: #4697

statquant · 2022-11-03T11:44:14Z

For what it is worth I think fdistinct would be amongst the functions we would use the most. From what I understand there are 2 orders of magnitude to make that would be amazing.

jangorecki · 2022-11-03T12:24:54Z

fdistinct is not exported in mergelist PR. I agree it make sense to export it as well.
2 orders of magnitude, having index (retGrp=T) set before.

Once resolved then this Q can be nicely addressed
https://stackoverflow.com/questions/74286104/what-is-causing-this-difference-in-performance-in-indexing-between-data-table-an

jangorecki added the enhancement label Jun 21, 2018

This comment was marked as outdated.

Sign in to view

jangorecki self-assigned this Apr 10, 2020

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

jangorecki mentioned this issue Apr 20, 2020

computing index could find groups as well #4387

Closed

jangorecki mentioned this issue May 11, 2020

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged

3 tasks

jangorecki linked a pull request May 31, 2020 that will close this issue

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged

3 tasks

jangorecki removed a link to a pull request May 31, 2020

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique can be optimized on keyed data.tables #2947

unique can be optimized on keyed data.tables #2947

MichaelChirico commented Jun 21, 2018 •

edited

Loading

jangorecki commented Jun 21, 2018

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

HughParsonage commented Apr 12, 2020

jangorecki commented Apr 19, 2020

jangorecki commented May 31, 2020 •

edited

Loading

jangorecki commented May 31, 2020 •

edited

Loading

jangorecki commented May 31, 2020

MichaelChirico commented May 4, 2021

statquant commented Nov 3, 2022

jangorecki commented Nov 3, 2022 •

edited

Loading

unique can be optimized on keyed data.tables #2947

unique can be optimized on keyed data.tables #2947

Comments

MichaelChirico commented Jun 21, 2018 • edited Loading

jangorecki commented Jun 21, 2018

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as resolved.

HughParsonage commented Apr 12, 2020

jangorecki commented Apr 19, 2020

jangorecki commented May 31, 2020 • edited Loading

jangorecki commented May 31, 2020 • edited Loading

jangorecki commented May 31, 2020

MichaelChirico commented May 4, 2021

statquant commented Nov 3, 2022

jangorecki commented Nov 3, 2022 • edited Loading

MichaelChirico commented Jun 21, 2018 •

edited

Loading

jangorecki commented May 31, 2020 •

edited

Loading

jangorecki commented May 31, 2020 •

edited

Loading

jangorecki commented Nov 3, 2022 •

edited

Loading