Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table re-write of slice() #115

Closed
15 of 19 tasks
dylanbeaudette opened this issue Jan 21, 2020 · 12 comments
Closed
15 of 19 tasks

data.table re-write of slice() #115

dylanbeaudette opened this issue Jan 21, 2020 · 12 comments

Comments

@dylanbeaudette
Copy link
Member

dylanbeaudette commented Jan 21, 2020

A Plan

  1. finish dice
  2. test / document dice
  3. deprecate slice
  4. fix many affected functions / documentation (note slice max + 1 convention)
  5. enable slicing shortcut through the SPC k-index (maybe)

TODO

  • finish tests
  • side/side comparisons with slice in all aqp functions (this will take time, and aqp 2.0 is not backwards compatible)
  • sometimes fm = 0:z ~ . results in z+1 slices (bug related to filling to bottom of z-index, z +1)
  • NA-padding to depth via fillHzGaps (default = FALSE), important for slab and profile_compare
    • fully-integrate newly optimized fillHzGaps, consider separate arguments to control filling / padding
  • formula interface (backwards compatibility, less data to push around):
  • .pctMissing eval (new function)
    • optimize with data.table (see warnings...)
    • DT warnings related := (see below)
  • ERRORS on NA bottom depth (consider repairMissingHzDepths())
  • horizon depth logic errors: flag (@metadata) / subset:
  • integrate into slab() (problems as of 2022-01-04), or integrate into slab-rewrite (better). (different PR / issue)

Long-Term

Updates

  • 2021-02-11: just about ready for use. This is not a direct replacement for slice, maybe introduce now and phase-in by aqp 2.0.
  • 2022-01-04: much more robust, direct replacement for slice in most cases. will attempt to change all slicedice in the SPC introduction and see what happens. a messages is now printed when using slice, will be deprecated in aqp 2.0.

See draft code for dice(). This is about 6x faster than the current implementation of slice, with a lower memory footprint as well.

@brownag
Copy link
Member

brownag commented Feb 19, 2021

Reprex for slab() on data.table-SPC slots

library(aqp, warn=FALSE)
#> This is aqp 1.27

data(sp4)
sp4$name <- factor(sp4$name)

depths(sp4) <- id ~ top + bottom

# works
a <- slab(sp4, ~ name)
head(a)
#>   variable all.profiles   A  A1  A2  AB ABt  Bt Bt1 Bt2 Bt3 top bottom
#> 1     name            1 0.9 0.1 0.0 0.0 0.0 0.0 0.0   0   0   0      1
#> 2     name            1 0.9 0.1 0.0 0.0 0.0 0.0 0.0   0   0   1      2
#> 3     name            1 0.8 0.0 0.1 0.0 0.0 0.0 0.1   0   0   2      3
#> 4     name            1 0.4 0.0 0.1 0.0 0.1 0.1 0.3   0   0   3      4
#> 5     name            1 0.3 0.0 0.1 0.0 0.1 0.1 0.4   0   0   4      5
#> 6     name            1 0.3 0.0 0.0 0.1 0.1 0.1 0.4   0   0   5      6
#>   contributing_fraction
#> 1                     1
#> 2                     1
#> 3                     1
#> 4                     1
#> 5                     1
#> 6                     1

# does not work
data(sp4)
sp4$name <- factor(sp4$name)
sp4 <- data.table::as.data.table(sp4)
depths(sp4) <- id ~ top + bottom

b <- slab(sp4, ~ name)
#> Error in h[, vars]: incorrect number of dimensions

@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Feb 19, 2021

Yup, slab and many other functions in {aqp} are not data.table aware. There are going to be a lot of x[, vars] cases that need to be generalized. I've got a number of small todos before dice can (almost entirely) replace slice. I'm on the fence as to:

  • deprecate slice since it now conflicts with {dplyr} in favor of dice
  • merge dice code into slice and forget about namespace conflicts

This last round of S4SS has demonstrated the many ways in which {dplyr} can interfere with {aqp}.

@brownag
Copy link
Member

brownag commented Feb 19, 2021

I think I would lean towards your option 1, based on recent experiences we all have had

@dylanbeaudette
Copy link
Member Author

Me too. Hegemony for the tidyverse and all.

Here is the evolving plan: finish dice → deprecate slice → fix many affected functions / documentation → enable slicing shortcut through the SPC k-index (maybe).

dylanbeaudette added a commit that referenced this issue Feb 19, 2021
functions are not yet exported, see R/dice.R for possibly subset of
profiles or horizons when depths are invalid, and keeping track in
metadata @brownag
@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Feb 20, 2021

Making progress, plenty of room to optimize:

  • mapply is a bottleneck
  • strict = TRUE
  • byhz = TRUE
  • z-slice specification up front vs. at end
  • gap-filling and padding with empty hz up/down according to z
  • pctMissing = TRUE -> 6-8x slowdown
  • data.table keys (?) -> causes sort / re-sort

10,000 profiles, no DT key or index:

  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory          time    gc       
1 slice        15.8s   15.8s    0.0634     1.6GB     1.65     1    26      15.8s <NULL> <Rprofmem[,3] … <bch:t… <tibble …
2 dice_fm      10.3s   10.3s    0.0972   530.6MB     2.91     1    30      10.3s <NULL> <Rprofmem[,3] … <bch:t… <tibble …
3 dice         10.1s   10.1s    0.0988   459.1MB     2.08     1    21      10.1s <NULL> <Rprofmem[,3] … <bch:t… <tibble …

10,000 profiles, DT index (no sorting):

  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory          time    gc       
1 slice       15.99s  15.99s    0.0626     1.6GB     1.81     1    29     15.99s <NULL> <Rprofmem[,3] … <bch:t… <tibble …
2 dice_fm      9.83s   9.83s    0.102    535.1MB     3.15     1    31      9.83s <NULL> <Rprofmem[,3] … <bch:t… <tibble …
3 dice         9.12s   9.12s    0.110    462.3MB     2.52     1    23      9.12s <NULL> <Rprofmem[,3] … <bch:t… <tibble …

10,000 profiles, DT key (sorting):

 expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory          time    gc       
1 slice        15.8s   15.8s    0.0633    1.58GB     1.83     1    29      15.8s <NULL> <Rprofmem[,3] … <bch:t… <tibble …
2 dice_fm      10.3s   10.3s    0.0972  518.32MB     3.40     1    35      10.3s <NULL> <Rprofmem[,3] … <bch:t… <tibble …
3 dice           10s     10s    0.100   447.62MB     3.10     1    31        10s <NULL> <Rprofmem[,3] … <bch:t… <tibble …

10,000 profiles, various arguments, latest improvements to min, max, fillHzGaps:

  expression                min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
1 slice_strict            20.6s  20.6s    0.0486     1.6GB    1.46      1    30      20.6s <NULL> <Rpro~
2 slice                  21.05s 21.05s    0.0475    1.59GB    0.855     1    18     21.05s <NULL> <Rpro~
3 dice                    6.06s  6.06s    0.165   540.95MB    1.98      1    12      6.06s <NULL> <Rpro~
4 dice_byhz              11.39s 11.39s    0.0878  546.16MB    2.72      1    31     11.39s <NULL> <Rpro~
5 dice_fm                 5.07s  5.07s    0.197   680.91MB    2.37      1    12      5.07s <NULL> <Rpro~
6 dice_fm_nofill          5.42s  5.42s    0.184   680.91MB    1.66      1     9      5.42s <NULL> <Rpro~
7 dice_no_chk             4.29s  4.29s    0.233   535.34MB    2.57      1    11      4.29s <NULL> <Rpro~
8 dice_no_chk_pctmissing  4.25s  4.25s    0.236    675.5MB    2.36      1    10      4.25s <NULL> <Rpro~

@dylanbeaudette
Copy link
Member Author

slice pads all profiles with NA to the maximum depth of slicing (specified in fm), dice does not.

image

I'm going to implement this as an option in dice, but there are only a couple of cases where the extra work / bloat are worth it:

  • functions expecting it (slab, profile_compare at the moment)
  • expectation about the SPC j index post-slicing/dicing

dylanbeaudette added a commit that referenced this issue Feb 24, 2021
This is in support of dice (#115) and backwards compatibility with slice. Conceptually, this function converts a ragged array to matrix by padding with NA.
@dylanbeaudette
Copy link
Member Author

Crap, when asking for a slice or range of slices that fall entirely within a horizon gap or horizon with bogus depths there is an error. This happens because the re-packed horizons are missing a profile ID that exists in site.

data(sp4)
depths(sp4) <- id ~ top + bottom

sp4$top[1:2] <- NA

## this throws an error
d <- dice(sp4, fm = 5 ~ ., byhz = TRUE)

## this works, but a profile is dropped
d <- dice(sp4, fm = 5 ~ ., byhz = FALSE)

## this breaks at mapply step
d <- dice(sp4, fm = 5 ~ ., strict = FALSE)

# old slice, NA returned
s <- slice(sp4, fm = 5 ~ .)

profile_id(d) <- sprintf("%s-dice", profile_id(d))
z <- combine(s, d)

par(mar = c(0, 0, 0, 0))
plotSPC(z[1:8, ], color = 'Ca')

@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Feb 24, 2021

With the addition of growEmptyHz, dice can now pad with NA. Note that this is still not quite the same as the output from slice which always gave one extra slice. This convention will not be carried forward.

image

Also, growEmptyHz now benefits from the addition of .LAST and .HZID keywords (#201).

dylanbeaudette added a commit that referenced this issue Feb 24, 2021
@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Feb 24, 2021

Encountering DT warnings when .pctMissing = TRUE:

In [.data.table(res, , :=(.pctMissing, rowSums(is.na(res[, .SD, :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

dylanbeaudette added a commit that referenced this issue Feb 24, 2021
dylanbeaudette added a commit that referenced this issue Mar 1, 2021
fillHzGaps() is not yet optimized, so this option is disabled by
default.
@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Mar 1, 2021

Crap, when asking for a slice or range of slices that fall entirely within a horizon gap or horizon with bogus depths there is an error. This happens because the re-packed horizons are missing a profile ID that exists in site.

No longer a problem, fillHzGaps can help with gaps in the horizon depth sequence but still needs optimization.

image

data(sp4)
depths(sp4) <- id ~ top + bottom

# corrupt 2nd horizon of colusa
sp4$top[2] <- NA

# remove invalid horizons
x <- HzDepthLogicSubset(sp4, byhz = TRUE)

# fill hz gaps
x <- fillHzGaps(x)

# works after gap-filling
d <- dice(x, fm = 0:15 ~ ., byhz = TRUE)

# old slice, NA returned
s <- slice(sp4, fm = 0:15 ~ .)

profile_id(d) <- sprintf("%s-dice", profile_id(d))
z <- combine(s, d)

par(mar = c(0, 0, 0, 0))
plotSPC(z[1:8, ], color = 'Ca')

@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Jan 4, 2022

Encountering DT warnings when .pctMissing = TRUE:

In [.data.table(res, , :=(.pctMissing, rowSums(is.na(res[, .SD, :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

This is now addressed / tested by using set() vs. := notation. Still not sure why this was happening, but no more warnings + same results.

@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Jan 4, 2022

Latest benchmarks using 10,000 random profiles.

  expression                  min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
1 slice_strict             19.81s   19.81s    0.0505    1.58GB    0.656     1    13     19.81s
2 slice                    20.76s   20.76s    0.0482    1.57GB    0.482     1    10     20.76s
3 dice                      4.92s    4.92s    0.203   540.95MB    1.22      1     6      4.92s
4 dice_byhz                   13s      13s    0.0769  547.78MB    1.38      1    18        13s
5 dice_fm                   4.72s    4.72s    0.212   680.66MB    1.27      1     6      4.72s
6 dice_fm_nofill            6.06s    6.06s    0.165   680.66MB    1.16      1     7      6.06s
7 dice_no_chk               4.26s    4.26s    0.235   535.34MB    1.41      1     6      4.26s
8 dice_no_chk_pctmissing    4.53s    4.53s    0.221   792.12MB    1.32      1     6      4.53s

Update 2022-08-23, using DT vs. mapply() slice vector expansion.

  expression                  min   median `itr/sec` mem_al…¹ gc/se…² n_itr  n_gc total…³ result memory     time       gc   
1 slice_strict             14.99s   14.99s    0.0667   1.59GB    1.67     1    25  14.99s <NULL> <Rprofmem> <bench_tm> <tibble>
2 slice                    14.75s   14.75s    0.0678   1.57GB    1.83     1    27  14.75s <NULL> <Rprofmem> <bench_tm> <tibble>
3 dice                      2.95s    2.95s    0.339  660.81MB    4.07     1    12   2.95s <NULL> <Rprofmem> <bench_tm> <tibble>
4 dice_byhz                  6.7s     6.7s    0.149  667.64MB    3.43     1    23    6.7s <NULL> <Rprofmem> <bench_tm> <tibble>
5 dice_fm                   7.37s    7.37s    0.136  790.09MB    4.34     1    32   7.37s <NULL> <Rprofmem> <bench_tm> <tibble>
6 dice_fm_nofill            6.52s    6.52s    0.153  790.09MB    3.53     1    23   6.52s <NULL> <Rprofmem> <bench_tm> <tibble>
7 dice_no_chk               2.65s    2.65s    0.377   655.2MB    4.53     1    12   2.65s <NULL> <Rprofmem> <bench_tm> <tibble>
8 dice_no_chk_pctmissing     3.1s     3.1s    0.323  921.32MB    5.17     1    16    3.1s <NULL> <Rprofmem> <bench_tm> <tibble>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants