Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spatsoc group_times vs. group_pts #44

Closed
kaijagahm opened this issue Jan 6, 2023 · 3 comments
Closed

spatsoc group_times vs. group_pts #44

kaijagahm opened this issue Jan 6, 2023 · 3 comments
Labels
help wanted Extra attention is needed question Further information is requested wontfix This will not be worked on

Comments

@kaijagahm
Copy link
Owner

kaijagahm commented Jan 6, 2023

There is a lot of weirdness going on with how data.table (through spatsoc) handles pass-by-reference.

I have now created a reprex, which can be found here. In case the repo is private, I'm putting the reprex here:

# A reprex for trying to understand how data.table works in the context of spatsoc
library(spatsoc)
library(tidyverse)

# A sample dataset
load("repr.Rda") # some sample data, just 25 rows of my GPS observations
class(repr) # this is a data.frame. It's not a data.table.

# convert to a data.table for spatsoc
data.table::setDT(repr) # apparently this is what you're supposed to do in order to turn a data.frame into a data.table so that data.table functions can be used on it.

# Group times
spatsoc::group_times(repr, datetime = "timestamp", threshold = "10 minutes") # expected output: a "timegroup" column gets added by reference. I did not need to reassign, e.g. repr <- spatsoc::group_times(repr...)

# group spatial
spatsoc::group_pts(repr, threshold = 1000, id = "trackId", coords = c("utmE", "utmN"), timegroup = "timegroup") # expected output: a "group" column gets added by reference. Once again, I did not need to reassign.

names(repr) # this works! We have modified `repr` by reference, adding both `timegroup` and `group` columns.

# Okay, so even though it's quite poorly documented in spatsoc, both of these functions seem to have the ability to modify by reference. Note that it *is* documented in spatsoc, but only here (https://docs.ropensci.org/spatsoc/articles/faq.html), not in the actual help files for group_times and group_pts. Grrr.

# HOWEVER! There is a problem.
# If we make changes to the data between doing group_times and group_pts, group_pts will not successfully modify by reference.

# Here's an example (same code as above except for removing a column)
# A sample dataset
load("repr.Rda") # same as above
data.table::setDT(repr) # same as above
spatsoc::group_times(repr, datetime = "timestamp", threshold = "10 minutes") # same as above
# remove the ground_speed column, using dplyr/tidyverse.
repr2 <- repr %>%
  dplyr::select(-ground_speed)
# now try to group_pts
spatsoc::group_pts(repr2, threshold = 1000, id = "trackId", coords = c("utmE", "utmN"), timegroup = "timegroup") # expected output: a group column gets added.
names(repr2)
# In this case, it doesn't work. The `group` column doesn't get added.

# So, this leads me to ask, what does the dplyr::select() step do to the data that prevents pass-by-reference from working?
class(repr)
class(repr2) # the classes are the same! i.e. it is still a data.table.

str(repr)
str(repr2) # except for the removed ground_speed column (expected), there seems to be no difference in structure of these two objects!
# aaaaargh what the heck is going on????

# Maybe if we run setDT again on repr2, and *then* try running group_pts, it will work again?
data.table::setDT(repr2)
spatsoc::group_pts(repr2, threshold = 1000, id = "trackId", coords = c("utmE", "utmN"), timegroup = "timegroup")
names(repr2)
# Wait WHAT??? Doing that did not fix the issue. I guess that's because setDT modifies the class to data.table, and since repr2 was already a data.table, nothing changed.
# So it seems like the dplyr operation has broken this irreversibly and invisibly. I cannot figure out what was changed that is making this not work.
# Goals:
# 1. Figure out what criteria determine whether a data.table object can, or cannot, be modified by reference.
# 2. Figure out how to see those criteria, because clearly class() and str() are not cutting it.
# 3. Figure out how to transform a data.table object from one that can be modified by reference to one that cannot be modified by reference, and vice versa.
# 4. Implement a check in `vultureUtils` to make sure the `group` column is present before any other computations are performed. (Pretty sure this check already exists, but I need to verify.)
# 5. Send a PR to the spatsoc developers for increased documentation of the above, so this doesn't trip up other people.
kaijagahm pushed a commit to kaijagahm/dataExploration that referenced this issue Jan 6, 2023
@kaijagahm
Copy link
Owner Author

Asked Kirby about this and here's what he said

Kirby:
this disc might help? https://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another

at minimum there should be some way of figuring out the address of your table so you can see if it's the original

Me:
yeah i have seen this
i think i dug into it and it didn't tell me anything but i'll try again

Kirby:
my hunch is that when you remove the column, you're creating a view of the original data frame, and then when it comes time to modify it the new package is like "dude this is a view; I'd better make a copy"
but you really need a way to figure out what's real

Me:
lol so i tried the approach on that SO post and it gives me like 100 lines of characters
really hard to compare
but okay, before I go down this rabbit hole
i think maybe it's just a bad idea to use this feature at all
i'm basically just yelling at the devs, though, because this is a really really easy pitfall and i don't understand why it's not better documented
they should warn you way the hell away from using pass by reference if this is going to happen
i didn't think i was doing anything off-label, I thought my code was perfectly normal

Kirby:
yeah the key is here "Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table."

Kaija:
yeah
I saw that

Kirby:
there should really be a warning whenever you copy something in that package
if the point is to avoid that

Me:
so the takeaway from their argument is that if you're going to use data.table, you're better off staying entirely within the package universe

Kirby:
well

Me:
use := and set() instead of using any other package
(in my case, I used some functions from the dplyr package)

Kirby:
if you're going to do pass by reference
you need to be very careful about modifying the underlying data
because the whole point is that you have a big chunk of memory
if you start picking and choosing sections of it to pass around to different functions it's hard to say how that will be handled

kaijagahm added a commit that referenced this issue Jan 6, 2023
@kaijagahm kaijagahm added help wanted Extra attention is needed question Further information is requested wontfix This will not be worked on labels Jan 26, 2023
@kaijagahm
Copy link
Owner Author

Closing so it's not distracting. Can always look back for reference.

@robitalec
Copy link

Most of this is documented here in the FAQ here (https://docs.ropensci.org/spatsoc/articles/faq.html#package-design), but the dplyr + data.table clash that you have identified adds an additional wrinkle...

I made a reprex with spatsoc's example package data (I can't see your repr.Rda above) to try and help clarify. I will expand this section in the FAQ (and link to it in the manual) because it is a source of confusion.

First, for usage:

The two things to check for a data.table to be used with spatsoc functions, or with any of the :=, set() functions in data.table are:

  • is the input a data.table?
  • does the data.table have columns over allocated so it can modify by reference?

data.table class can be checked with data.table::is.data.table(), and converted by reference with data.table::setDT.

Over allocated columns is a more abstract concept. It is required for adding columns by reference and you can read more about it here: https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html#reference-semantics. If you want to avoid modifying by reference, see here: https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html#b-the-copy-function. The manual page for over-allocation functions is here: https://rdatatable.gitlab.io/data.table/reference/truelength.html. Essentially, we want to check that there are extra spaces available for our new columns.

To check if columns are over allocated, use data.table::truelength().

An example:

library(data.table)
library(spatsoc)
#> Note: spatsoc has been updated to follow the R-spatial evolution. 
#> Package dependencies and functions have been modified. 
#> Please see the NEWS for details: 
#> https://docs.ropensci.org/spatsoc/index.html#news

DT <- fread(system.file('extdata', 'DT.csv', package = 'spatsoc'))

is.data.table(DT)
#> [1] TRUE
truelength(DT)
#> [1] 1029
getOption('datatable.alloccol')
#> [1] 1024

Created on 2023-09-17 with reprex v2.0.2

The option 'datatable.alloccol' returns the number of columns that are by default over allocated, so 1024 + ncol(DT) = 1029 truelength.

In your example above - while it may have the data.table class, objects saved to .Rda and .Rds lose their over allocated columns when you save them. See the data.table FAQ here: https://rdatatable.gitlab.io/data.table/articles/datatable-faq.html#reading-data-table-from-rds-or-rdata-file

library(data.table)
library(spatsoc)
#> Note: spatsoc has been updated to follow the R-spatial evolution. 
#> Package dependencies and functions have been modified. 
#> Please see the NEWS for details: 
#> https://docs.ropensci.org/spatsoc/index.html#news

# Write out the example data as an Rda
DT <- fread(system.file('extdata', 'DT.csv', package = 'spatsoc'))
save(DT, file = 'DT.Rda')
rm(DT)

# Load it as an Rda
load('DT.Rda')

is.data.table(DT)
#> [1] TRUE
truelength(DT)
#> [1] 0

group_times(DT, datetime = 'datetime', threshold = '10 minutes')
#>        ID        X       Y            datetime population minutes timegroup
#>     1:  A 715851.4 5505340 2016-11-01 00:00:54          1       0         1
#>     2:  A 715822.8 5505289 2016-11-01 02:01:22          1       0         2
#>     3:  A 715872.9 5505252 2016-11-01 04:01:24          1       0         3
#>     4:  A 715820.5 5505231 2016-11-01 06:01:05          1       0         4
#>     5:  A 715830.6 5505227 2016-11-01 08:01:11          1       0         5
#>    ---                                                                     
#> 14293:  J 700616.5 5509069 2017-02-28 14:00:54          1       0      1393
#> 14294:  J 700622.6 5509065 2017-02-28 16:00:11          1       0      1394
#> 14295:  J 700657.5 5509277 2017-02-28 18:00:55          1       0      1440
#> 14296:  J 700610.3 5509269 2017-02-28 20:00:48          1       0      1395
#> 14297:  J 700744.0 5508782 2017-02-28 22:00:39          1       0      1396
'timegroup' %in% colnames(DT)
#> [1] FALSE

setalloccol(DT)
#>        ID        X       Y            datetime population
#>     1:  A 715851.4 5505340 2016-11-01 00:00:54          1
#>     2:  A 715822.8 5505289 2016-11-01 02:01:22          1
#>     3:  A 715872.9 5505252 2016-11-01 04:01:24          1
#>     4:  A 715820.5 5505231 2016-11-01 06:01:05          1
#>     5:  A 715830.6 5505227 2016-11-01 08:01:11          1
#>    ---                                                   
#> 14293:  J 700616.5 5509069 2017-02-28 14:00:54          1
#> 14294:  J 700622.6 5509065 2017-02-28 16:00:11          1
#> 14295:  J 700657.5 5509277 2017-02-28 18:00:55          1
#> 14296:  J 700610.3 5509269 2017-02-28 20:00:48          1
#> 14297:  J 700744.0 5508782 2017-02-28 22:00:39          1
group_times(DT, datetime = 'datetime', threshold = '10 minutes')
#>        ID        X       Y            datetime population minutes timegroup
#>     1:  A 715851.4 5505340 2016-11-01 00:00:54          1       0         1
#>     2:  A 715822.8 5505289 2016-11-01 02:01:22          1       0         2
#>     3:  A 715872.9 5505252 2016-11-01 04:01:24          1       0         3
#>     4:  A 715820.5 5505231 2016-11-01 06:01:05          1       0         4
#>     5:  A 715830.6 5505227 2016-11-01 08:01:11          1       0         5
#>    ---                                                                     
#> 14293:  J 700616.5 5509069 2017-02-28 14:00:54          1       0      1393
#> 14294:  J 700622.6 5509065 2017-02-28 16:00:11          1       0      1394
#> 14295:  J 700657.5 5509277 2017-02-28 18:00:55          1       0      1440
#> 14296:  J 700610.3 5509269 2017-02-28 20:00:48          1       0      1395
#> 14297:  J 700744.0 5508782 2017-02-28 22:00:39          1       0      1396
'timegroup' %in% colnames(DT)
#> [1] TRUE

Created on 2023-09-17 with reprex v2.0.2

This is the source of your initial issue where columns are not added to your example data. Documented in spatsoc's FAQ here: https://docs.ropensci.org/spatsoc/articles/faq.html#why-does-a-function-print-the-result-but-columns-arent-added-to-my-dt. The solution is to over allocate columns after you read in an Rda or Rds with data.table::setalloccol. I will clarify when setalloccol should be used in the FAQ.

The second issue is related to dplyr recreating the input data through the dplyr::select function. As I mention above, this is the first time I've seen this issue. I will add a section about this in the FAQ. If dplyr detects the input is a data.table, it passes it to dplyr:::dplyr_col_select and then to dplyr:::dplyr_reconstruct. After this PR (tidyverse/dplyr#6171), the data.table's attributes are retained but the over allocated columns are lost. This is the first time I've noticed this issue and I can reproduce using the reprex in the linked PR with a truelength check added.

library(data.table)
library(dplyr, warn.conflicts = FALSE)

d1 <- data.table(x = 1)
attr(d1, "foo") <- "bar"
truelength(d1)
#> [1] 1025
d2 <- dplyr:::dplyr_col_select(d1, "x")
attr(d2, "foo")
#> [1] "bar"
truelength(d2)
#> [1] 0

Created on 2023-09-17 with reprex v2.0.2

This means that while the data.table class is retained, the data.table's over allocated columns are lost when it is passed to dplyr::select. I would not expect there is a simple fix in dplyr given the use of attributes() <- here https://github.com/tidyverse/dplyr/blob/main/R/generics.R#L192-L223, highlighted in the data.table warning next time we try to modify by reference. For the time being, if you use dplyr functions between spatsoc or data.table functions, check that you have columns allocated for the next step.

options(datatable.verbose=TRUE)
d2[, foo2 := 'bar2']
#> Detected that j uses these columns: <none>
#> Warning in `[.data.table`(d2, , `:=`(foo2, "bar2")): Invalid .internal.selfref
#> detected and fixed by taking a (shallow) copy of the data.table so that := can
#> add this new column by reference. At an earlier point, this data.table has been
#> copied by R (or was created manually using structure() or similar). Avoid
#> names<- and attr<- which in R currently (and oddly) may copy the whole
#> data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and
#> ?setattr. If this message doesn't help, please report your use case to the
#> data.table issue tracker so the root cause can be fixed or this message
#> improved.
#> Assigning to all 1 rows
#> RHS_list_of_columns == false
#> RHS for item 1 has been duplicated because NAMED==4 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
colnames(d2)
#> [1] "x"    "foo2"

I'll link to an issue in spatsoc where I'll improve the documentation so it'll ping here when it's completed.
Please feel free to open an issue there if there is something I missed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants