in result of non-equi join, table is not actually correctly sorted even though it's keyed #4603

myoung3 · 2020-07-16T03:29:21Z

library(data.table)

x <- setDT(
  structure(list(id1 = c(1L, 1L, 1L, 1L, 1L),
start_date = c(10590L, 10597L, 10601L, 10604L, 
10608L), end_date = c(10596L, 
10600L, 10603L, 10607L, 10610L), 
value1 = c(NA, NA, 2.63896640369205, 2.03415674303523, 1.42934708237841
), value2 = c(NA, NA, -0.577368046118137, -0.211425473706347, 
0.154517098705444)), row.names = c(NA, -5L), class = c( 
"data.frame"))
)

y <- setDT(structure(list(id1 = c(1L, 1L, 1L, 1L, 1L),
V1 = c(TRUE, TRUE, TRUE, TRUE, TRUE), start_date = c(16585L, 
14541L, 13429L, 16139L, 13915L), 
end_date = c(16589L, 14544L, 13436L, 16144L, 13919L
)), row.names = c(NA, -5L), class = c("data.frame")))



setkey(x,id1,start_date,end_date)
setkey(y,id1)
z <- x[y, list(V1),
       on=c("id1","end_date>=start_date","start_date<=end_date"),by=.EACHI]


##note that id is all the same value so the second key (end_date) should be sorted
stopifnot(identical(z$end_date,sort(z$end_date)))

key(z)
setkeyv(z,key(z)) #doesn't fix it

#setting key null then resetting key fixes it
setkey(z,NULL)
setkey(z,id1,end_date,start_date) 

stopifnot(identical(z$end_date,sort(z$end_date)))

Reproducible in both release and development data.table

sessionInfo

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS:   /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.8

loaded via a namespace (and not attached):
[1] compiler_3.6.2 tools_3.6.2

The text was updated successfully, but these errors were encountered:

myoung3 · 2020-07-16T03:41:41Z

This bug is not due to the non-equi join having the same name in both x and y.

The following code (where interval columns in x and y have different names) produces the same bug:

library(data.table)

x <- setDT(
  structure(list(id1 = c(1L, 1L, 1L, 1L, 1L),
                 start_datex = c(10590L, 10597L, 10601L, 10604L, 
                                10608L), end_datex = c(10596L, 
                                                      10600L, 10603L, 10607L, 10610L), 
                 value1 = c(NA, NA, 2.63896640369205, 2.03415674303523, 1.42934708237841
                 ), value2 = c(NA, NA, -0.577368046118137, -0.211425473706347, 
                               0.154517098705444)), row.names = c(NA, -5L), class = c( 
                                 "data.frame"))
)

y <- setDT(structure(list(id1 = c(1L, 1L, 1L, 1L, 1L),
                          V1 = c(TRUE, TRUE, TRUE, TRUE, TRUE), start_datey = c(16585L, 
                                                                               14541L, 13429L, 16139L, 13915L), 
                          end_datey = c(16589L, 14544L, 13436L, 16144L, 13919L
                          )), row.names = c(NA, -5L), class = c("data.frame")))



setkey(x,id1,start_datex,end_datex)
setkey(y,id1)
z <- x[y, list(V1),
       on=c("id1","end_datex>=start_datey","start_datex<=end_datey"),by=.EACHI]


##note that id is all the same value so the second key (end_date) should be sorted
stopifnot(identical(z$end_datex,sort(z$end_datex)))

key(z)
setkeyv(z,key(z)) #doesn't fix it

#setting key null then resetting key fixes it
setkey(z,NULL)
setkey(z,id1,end_datex,start_datex) 

stopifnot(identical(z$end_datex,sort(z$end_datex)))

ColeMiller1 · 2020-07-17T23:55:00Z

Thank you for the report - I can reproduce. Note, the bug is slightly different than reported- the table has the key incorrectly labeled as it should just be id:

library(data.table) #1.12.8

x = data.table(id = rep(1L, 5L),
               start = 1:5,
               end = 2:6)
setkey(x, id, start, end)

y = data.table(id = rep(1L, 5L),
               start = c(15L, 13L, 14L, 12L, 11L),
               end = c(26L, 20L, 23L, 24L, 22),
               v1 = 5:1)
setkey(y, id)

z = x[y,
  on = .(id,
         start <= end,
         end >= start),
  .(v1),
  by = .EACHI]

key(z) ## should only be id!
#> [1] "id"    "start" "end"
z ##not actually sorted!
#>       id start   end    v1
#>    <int> <int> <int> <int>
#> 1:     1    26    15     5
#> 2:     1    20    13     4
#> 3:     1    23    14     3
#> 4:     1    24    12     2
#> 5:     1    22    11     1

If instead we had set the key on y to the three columns, we would have gotten your expected results (although note the key is still in the wrong order - it should be id, end, start):

setkey(y, id, start, end) 

z = x[y,
      on = .(id,
             start <= end,
             end >= start),
      .(v1),
      by = .EACHI]

key(z)
## [1] "id"    "start" "end" 

identical(z$end,sort(z$end))

# [1] TRUE

z
##      id start   end    v1
##   <int> <int> <int> <int>
##1:     1    22    11     1
##2:     1    24    12     2
##3:     1    20    13     4
##4:     1    23    14     3
##5:     1    26    15     5

I will work on a fix

Finally, in the future it is sometimes helpful to see some output. I normally use reprex::reprex() after copying the code to my clipboard.

myoung3 · 2020-07-18T00:16:23Z

If something is keyed on a variable but not sorted by that variable I would call that "not correctly sorted" conditional on that key existing as an attribute. But yes you're right the result should just be keyed and sorted on the variables specified in the key of y.

myoung3 · 2020-07-18T00:22:46Z

I guess the point that I'm making is that if the return table didn't have the expected key but the return table was sorted according to that unexpected key, it wouldn't be as big as a bug (unless you're relying on knowing that the result of a join takes the key of y, this would be not likely to cause issues especially considering that it's reasonable to explicitly set the key of z prior to doing merging with it).

The actual bug that is present is a bigger deal since explicitly setting the key to what you want isn't necessarily going to undo the bug as demonstrated in my code.

myoung3 · 2020-07-18T00:25:08Z

I'm also happy to take a look at this and submit a PR. I've been busy with other things the last few days so I haven't tried to debug it yet.

ColeMiller1 · 2020-07-18T01:51:49Z

@myoung3 I wanted to point out to the data.table maintainers what the root issue is. The original example was unclear as it appeared that the expected output was for it to be sorted by end_date based on this:

stopifnot(identical(z$end_date,sort(z$end_date)))

I agree that the unexpected key should in fact be accurate. I am still working on a fix and should have something soon.

myoung3 mentioned this issue Jul 16, 2020

intervalaverage extra setkey as hotfix kaufman-lab/intervalaverage#24

Open

ColeMiller1 added bug non-equi joins rolling, overlapping, non-equi joins labels Jul 18, 2020

ColeMiller1 mentioned this issue Feb 20, 2021

Unexpected missing matches with non-equi join with grouping by .EACHI #4911

Closed

ColeMiller1 self-assigned this Feb 20, 2021

ColeMiller1 mentioned this issue Feb 22, 2021

by = .EACHI key fix 4603 #4917

Merged

mattdowle added this to the 1.14.1 milestone Mar 17, 2021

mattdowle closed this as completed in #4917 Apr 13, 2021

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in result of non-equi join, table is not actually correctly sorted even though it's keyed #4603

in result of non-equi join, table is not actually correctly sorted even though it's keyed #4603

myoung3 commented Jul 16, 2020 •

edited

Loading

myoung3 commented Jul 16, 2020 •

edited

Loading

ColeMiller1 commented Jul 17, 2020

myoung3 commented Jul 18, 2020

myoung3 commented Jul 18, 2020 •

edited

Loading

myoung3 commented Jul 18, 2020

ColeMiller1 commented Jul 18, 2020

in result of non-equi join, table is not actually correctly sorted even though it's keyed #4603

in result of non-equi join, table is not actually correctly sorted even though it's keyed #4603

Comments

myoung3 commented Jul 16, 2020 • edited Loading

myoung3 commented Jul 16, 2020 • edited Loading

ColeMiller1 commented Jul 17, 2020

myoung3 commented Jul 18, 2020

myoung3 commented Jul 18, 2020 • edited Loading

myoung3 commented Jul 18, 2020

ColeMiller1 commented Jul 18, 2020

myoung3 commented Jul 16, 2020 •

edited

Loading

myoung3 commented Jul 16, 2020 •

edited

Loading

myoung3 commented Jul 18, 2020 •

edited

Loading