-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spatsoc group_times vs. group_pts #44
Comments
Asked Kirby about this and here's what he said
|
Closing so it's not distracting. Can always look back for reference. |
Most of this is documented here in the FAQ here (https://docs.ropensci.org/spatsoc/articles/faq.html#package-design), but the dplyr + data.table clash that you have identified adds an additional wrinkle... I made a reprex with spatsoc's example package data (I can't see your repr.Rda above) to try and help clarify. I will expand this section in the FAQ (and link to it in the manual) because it is a source of confusion. First, for usage: The two things to check for a data.table to be used with spatsoc functions, or with any of the :=, set() functions in data.table are:
data.table class can be checked with Over allocated columns is a more abstract concept. It is required for adding columns by reference and you can read more about it here: https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html#reference-semantics. If you want to avoid modifying by reference, see here: https://rdatatable.gitlab.io/data.table/articles/datatable-reference-semantics.html#b-the-copy-function. The manual page for over-allocation functions is here: https://rdatatable.gitlab.io/data.table/reference/truelength.html. Essentially, we want to check that there are extra spaces available for our new columns. To check if columns are over allocated, use An example: library(data.table)
library(spatsoc)
#> Note: spatsoc has been updated to follow the R-spatial evolution.
#> Package dependencies and functions have been modified.
#> Please see the NEWS for details:
#> https://docs.ropensci.org/spatsoc/index.html#news
DT <- fread(system.file('extdata', 'DT.csv', package = 'spatsoc'))
is.data.table(DT)
#> [1] TRUE
truelength(DT)
#> [1] 1029
getOption('datatable.alloccol')
#> [1] 1024 Created on 2023-09-17 with reprex v2.0.2 The option 'datatable.alloccol' returns the number of columns that are by default over allocated, so 1024 + ncol(DT) = 1029 truelength. In your example above - while it may have the data.table class, objects saved to .Rda and .Rds lose their over allocated columns when you save them. See the data.table FAQ here: https://rdatatable.gitlab.io/data.table/articles/datatable-faq.html#reading-data-table-from-rds-or-rdata-file library(data.table)
library(spatsoc)
#> Note: spatsoc has been updated to follow the R-spatial evolution.
#> Package dependencies and functions have been modified.
#> Please see the NEWS for details:
#> https://docs.ropensci.org/spatsoc/index.html#news
# Write out the example data as an Rda
DT <- fread(system.file('extdata', 'DT.csv', package = 'spatsoc'))
save(DT, file = 'DT.Rda')
rm(DT)
# Load it as an Rda
load('DT.Rda')
is.data.table(DT)
#> [1] TRUE
truelength(DT)
#> [1] 0
group_times(DT, datetime = 'datetime', threshold = '10 minutes')
#> ID X Y datetime population minutes timegroup
#> 1: A 715851.4 5505340 2016-11-01 00:00:54 1 0 1
#> 2: A 715822.8 5505289 2016-11-01 02:01:22 1 0 2
#> 3: A 715872.9 5505252 2016-11-01 04:01:24 1 0 3
#> 4: A 715820.5 5505231 2016-11-01 06:01:05 1 0 4
#> 5: A 715830.6 5505227 2016-11-01 08:01:11 1 0 5
#> ---
#> 14293: J 700616.5 5509069 2017-02-28 14:00:54 1 0 1393
#> 14294: J 700622.6 5509065 2017-02-28 16:00:11 1 0 1394
#> 14295: J 700657.5 5509277 2017-02-28 18:00:55 1 0 1440
#> 14296: J 700610.3 5509269 2017-02-28 20:00:48 1 0 1395
#> 14297: J 700744.0 5508782 2017-02-28 22:00:39 1 0 1396
'timegroup' %in% colnames(DT)
#> [1] FALSE
setalloccol(DT)
#> ID X Y datetime population
#> 1: A 715851.4 5505340 2016-11-01 00:00:54 1
#> 2: A 715822.8 5505289 2016-11-01 02:01:22 1
#> 3: A 715872.9 5505252 2016-11-01 04:01:24 1
#> 4: A 715820.5 5505231 2016-11-01 06:01:05 1
#> 5: A 715830.6 5505227 2016-11-01 08:01:11 1
#> ---
#> 14293: J 700616.5 5509069 2017-02-28 14:00:54 1
#> 14294: J 700622.6 5509065 2017-02-28 16:00:11 1
#> 14295: J 700657.5 5509277 2017-02-28 18:00:55 1
#> 14296: J 700610.3 5509269 2017-02-28 20:00:48 1
#> 14297: J 700744.0 5508782 2017-02-28 22:00:39 1
group_times(DT, datetime = 'datetime', threshold = '10 minutes')
#> ID X Y datetime population minutes timegroup
#> 1: A 715851.4 5505340 2016-11-01 00:00:54 1 0 1
#> 2: A 715822.8 5505289 2016-11-01 02:01:22 1 0 2
#> 3: A 715872.9 5505252 2016-11-01 04:01:24 1 0 3
#> 4: A 715820.5 5505231 2016-11-01 06:01:05 1 0 4
#> 5: A 715830.6 5505227 2016-11-01 08:01:11 1 0 5
#> ---
#> 14293: J 700616.5 5509069 2017-02-28 14:00:54 1 0 1393
#> 14294: J 700622.6 5509065 2017-02-28 16:00:11 1 0 1394
#> 14295: J 700657.5 5509277 2017-02-28 18:00:55 1 0 1440
#> 14296: J 700610.3 5509269 2017-02-28 20:00:48 1 0 1395
#> 14297: J 700744.0 5508782 2017-02-28 22:00:39 1 0 1396
'timegroup' %in% colnames(DT)
#> [1] TRUE Created on 2023-09-17 with reprex v2.0.2 This is the source of your initial issue where columns are not added to your example data. Documented in spatsoc's FAQ here: https://docs.ropensci.org/spatsoc/articles/faq.html#why-does-a-function-print-the-result-but-columns-arent-added-to-my-dt. The solution is to over allocate columns after you read in an Rda or Rds with The second issue is related to dplyr recreating the input data through the library(data.table)
library(dplyr, warn.conflicts = FALSE)
d1 <- data.table(x = 1)
attr(d1, "foo") <- "bar"
truelength(d1)
#> [1] 1025
d2 <- dplyr:::dplyr_col_select(d1, "x")
attr(d2, "foo")
#> [1] "bar"
truelength(d2)
#> [1] 0 Created on 2023-09-17 with reprex v2.0.2 This means that while the data.table class is retained, the data.table's over allocated columns are lost when it is passed to options(datatable.verbose=TRUE)
d2[, foo2 := 'bar2']
#> Detected that j uses these columns: <none>
#> Warning in `[.data.table`(d2, , `:=`(foo2, "bar2")): Invalid .internal.selfref
#> detected and fixed by taking a (shallow) copy of the data.table so that := can
#> add this new column by reference. At an earlier point, this data.table has been
#> copied by R (or was created manually using structure() or similar). Avoid
#> names<- and attr<- which in R currently (and oddly) may copy the whole
#> data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and
#> ?setattr. If this message doesn't help, please report your use case to the
#> data.table issue tracker so the root cause can be fixed or this message
#> improved.
#> Assigning to all 1 rows
#> RHS_list_of_columns == false
#> RHS for item 1 has been duplicated because NAMED==4 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
colnames(d2)
#> [1] "x" "foo2" I'll link to an issue in spatsoc where I'll improve the documentation so it'll ping here when it's completed. |
There is a lot of weirdness going on with how data.table (through spatsoc) handles pass-by-reference.
I have now created a reprex, which can be found here. In case the repo is private, I'm putting the reprex here:
The text was updated successfully, but these errors were encountered: