Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in result of non-equi join, table is not actually correctly sorted even though it's keyed #4603

Closed
myoung3 opened this issue Jul 16, 2020 · 6 comments · Fixed by #4917
Closed
Assignees
Labels
bug non-equi joins rolling, overlapping, non-equi joins
Milestone

Comments

@myoung3
Copy link
Contributor

myoung3 commented Jul 16, 2020

library(data.table)

x <- setDT(
  structure(list(id1 = c(1L, 1L, 1L, 1L, 1L),
start_date = c(10590L, 10597L, 10601L, 10604L, 
10608L), end_date = c(10596L, 
10600L, 10603L, 10607L, 10610L), 
value1 = c(NA, NA, 2.63896640369205, 2.03415674303523, 1.42934708237841
), value2 = c(NA, NA, -0.577368046118137, -0.211425473706347, 
0.154517098705444)), row.names = c(NA, -5L), class = c( 
"data.frame"))
)

y <- setDT(structure(list(id1 = c(1L, 1L, 1L, 1L, 1L),
V1 = c(TRUE, TRUE, TRUE, TRUE, TRUE), start_date = c(16585L, 
14541L, 13429L, 16139L, 13915L), 
end_date = c(16589L, 14544L, 13436L, 16144L, 13919L
)), row.names = c(NA, -5L), class = c("data.frame")))



setkey(x,id1,start_date,end_date)
setkey(y,id1)
z <- x[y, list(V1),
       on=c("id1","end_date>=start_date","start_date<=end_date"),by=.EACHI]


##note that id is all the same value so the second key (end_date) should be sorted
stopifnot(identical(z$end_date,sort(z$end_date)))

key(z)
setkeyv(z,key(z)) #doesn't fix it

#setting key null then resetting key fixes it
setkey(z,NULL)
setkey(z,id1,end_date,start_date) 

stopifnot(identical(z$end_date,sort(z$end_date)))

Reproducible in both release and development data.table

sessionInfo

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS:   /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.8

loaded via a namespace (and not attached):
[1] compiler_3.6.2 tools_3.6.2  

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 16, 2020

This bug is not due to the non-equi join having the same name in both x and y.

The following code (where interval columns in x and y have different names) produces the same bug:

library(data.table)

x <- setDT(
  structure(list(id1 = c(1L, 1L, 1L, 1L, 1L),
                 start_datex = c(10590L, 10597L, 10601L, 10604L, 
                                10608L), end_datex = c(10596L, 
                                                      10600L, 10603L, 10607L, 10610L), 
                 value1 = c(NA, NA, 2.63896640369205, 2.03415674303523, 1.42934708237841
                 ), value2 = c(NA, NA, -0.577368046118137, -0.211425473706347, 
                               0.154517098705444)), row.names = c(NA, -5L), class = c( 
                                 "data.frame"))
)

y <- setDT(structure(list(id1 = c(1L, 1L, 1L, 1L, 1L),
                          V1 = c(TRUE, TRUE, TRUE, TRUE, TRUE), start_datey = c(16585L, 
                                                                               14541L, 13429L, 16139L, 13915L), 
                          end_datey = c(16589L, 14544L, 13436L, 16144L, 13919L
                          )), row.names = c(NA, -5L), class = c("data.frame")))



setkey(x,id1,start_datex,end_datex)
setkey(y,id1)
z <- x[y, list(V1),
       on=c("id1","end_datex>=start_datey","start_datex<=end_datey"),by=.EACHI]


##note that id is all the same value so the second key (end_date) should be sorted
stopifnot(identical(z$end_datex,sort(z$end_datex)))

key(z)
setkeyv(z,key(z)) #doesn't fix it

#setting key null then resetting key fixes it
setkey(z,NULL)
setkey(z,id1,end_datex,start_datex) 

stopifnot(identical(z$end_datex,sort(z$end_datex)))

@ColeMiller1
Copy link
Contributor

Thank you for the report - I can reproduce. Note, the bug is slightly different than reported- the table has the key incorrectly labeled as it should just be id:

library(data.table) #1.12.8

x = data.table(id = rep(1L, 5L),
               start = 1:5,
               end = 2:6)
setkey(x, id, start, end)

y = data.table(id = rep(1L, 5L),
               start = c(15L, 13L, 14L, 12L, 11L),
               end = c(26L, 20L, 23L, 24L, 22),
               v1 = 5:1)
setkey(y, id)

z = x[y,
  on = .(id,
         start <= end,
         end >= start),
  .(v1),
  by = .EACHI]

key(z) ## should only be id!
#> [1] "id"    "start" "end"
z ##not actually sorted!
#>       id start   end    v1
#>    <int> <int> <int> <int>
#> 1:     1    26    15     5
#> 2:     1    20    13     4
#> 3:     1    23    14     3
#> 4:     1    24    12     2
#> 5:     1    22    11     1

If instead we had set the key on y to the three columns, we would have gotten your expected results (although note the key is still in the wrong order - it should be id, end, start):

setkey(y, id, start, end) 

z = x[y,
      on = .(id,
             start <= end,
             end >= start),
      .(v1),
      by = .EACHI]

key(z)
## [1] "id"    "start" "end" 

identical(z$end,sort(z$end))

# [1] TRUE

z
##      id start   end    v1
##   <int> <int> <int> <int>
##1:     1    22    11     1
##2:     1    24    12     2
##3:     1    20    13     4
##4:     1    23    14     3
##5:     1    26    15     5

I will work on a fix

Finally, in the future it is sometimes helpful to see some output. I normally use reprex::reprex() after copying the code to my clipboard.

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 18, 2020

If something is keyed on a variable but not sorted by that variable I would call that "not correctly sorted" conditional on that key existing as an attribute. But yes you're right the result should just be keyed and sorted on the variables specified in the key of y.

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 18, 2020

I guess the point that I'm making is that if the return table didn't have the expected key but the return table was sorted according to that unexpected key, it wouldn't be as big as a bug (unless you're relying on knowing that the result of a join takes the key of y, this would be not likely to cause issues especially considering that it's reasonable to explicitly set the key of z prior to doing merging with it).

The actual bug that is present is a bigger deal since explicitly setting the key to what you want isn't necessarily going to undo the bug as demonstrated in my code.

@myoung3
Copy link
Contributor Author

myoung3 commented Jul 18, 2020

I'm also happy to take a look at this and submit a PR. I've been busy with other things the last few days so I haven't tried to debug it yet.

@ColeMiller1
Copy link
Contributor

@myoung3 I wanted to point out to the data.table maintainers what the root issue is. The original example was unclear as it appeared that the expected output was for it to be sorted by end_date based on this:

stopifnot(identical(z$end_date,sort(z$end_date)))

I agree that the unexpected key should in fact be accurate. I am still working on a fix and should have something soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug non-equi joins rolling, overlapping, non-equi joins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants