Phase-in of data.table support in SPC objects and functions returning SPC objects #157

dylanbeaudette · 2020-07-15T22:14:20Z

Keeping track of changes / testing related to data.table replacement of plyr / reshape functionality. Each change should be checked and a new test added if there isn't sufficient coverage. See related issue in soilDB and sharpshootR.

`plyr`

`join` → `base::merge(...,` incomparables = NA`, sort = FALSE)` or `data.table` join

SoilProfileCollection-coercion
SoilProfileCollection-slice-methods (deprecated, in favor of dice)
munsell2rgb

`ddply`

`llply`

profile_compare

`ldply`

SoilProfileCollection-slice-methods (deprecated, in favor of dice)
aggregateColor
soilColorSignature

`dlply`

aggregateColor
profile_compare (deprecated, in favor of NCSP)

`reshape`

`cast` → `data.table::dcast`

evalGenHz
soilColorSignature

Note that value argument becomes value.var.

`melt` → `data.table::melt`

If the object is not a data.table then perform on-the-fly conversion to/from. The following works in plotColorQuantiles:

as.data.frame(
data.table::melt(
as.data.table(res$marginal), 
id.var = c('p', 'L_colors', 'A_colors', 'B_colors', 'L_chip', 'A_chip', 'B_chip')
)
)

However, there following appears with a similar conversion in evalGenHz:

In melt.data.table(as.data.table(h), id.vars = genhz, measure.vars = c(vars,  :
  'measure.vars' [prop, sil.width, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'double'. All measure variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion.

Which is not a problem, and can be suppressed for now. I've added a TODO item for converting all affected columns to numeric.

Caveats / Notes

It appears that data.table::melt will not preserve factor type / levels unless asked to do so explicitly via value.factor argument. A reasonable solution is to set it after the reshape operation when appropriate.
id.vars should not contain duplicated column names
use eval to dereference variables inside of data.table[ , ... ]

I'm fine with the current trajectory as long as we don't (yet) guarantee class-in == class-out for all functions in aqp.

A Plan

continue with data.frame, tbl_df, data.table end-to-end support for functions using / returning SoilProfileCollection objects
improve test coverage all around
add on-the-fly data.table usage (results are still data.frame) to remaining functions where it could give significant performance gain (slice, slab, munsell2rgb, etc.)
test, test, test
push to CRAN and document that functions using / returning SoilProfileCollection objects will respect input table class
(time passes)
attend to soilDB
document class-safe function template (j-indexing, use of keys, ???)
finish DT-ification progressively---if we remain happy with that approach

Related issues

The text was updated successfully, but these errors were encountered:

brownag · 2020-07-16T01:06:48Z

I will dump out a bunch of ideas, work that has been done so far, and some of the mysteries I am yet to figure out.

Extending and testing the SoilProfileCollection object

Most S4 methods that are fundamental to the SPC class and its basic operations (including subsetting, slot access, joins etc.) now respect the three different classes. It is a bit difficult to draw the conceptual line where "basic operations" stop and where "complicated" operations start... but here is an attempt:

Any method that takes SoilProfileCollection objects as input should be "aware" that the slots could be data.frame subclasses
Any method that builds an SPC (from depths<-) should be aware that the input class will set the metadata (and all other df classes) for the object
Any method that uses the SoilProfileCollection() constructor needs to make sure that the relevant values are set (I would not encourage anyone to use this. The less prevalent its use the better.)

To my knowledge there are three that use the constructor: [,SoilProfileCollection-method, .initSPCfromMF/depths<-, and permute_profile. permute_profile is a relatively new one of mine, and should be converted to use depths<- or possibly further refactored. I don't think there is anything in there that warrants using the constructor rather than depths + other setter methods/joins.

Methods have the ability to:

optional check class of SPC with aqp_df_class(object)
optional: error (e.g. data.table required but not available)
optional: convert intermediates to an internal class (essentially "pick one to convert to/from")
unambiguous code or switch statement on classes

There are several methods out there that do not have an method signature to ensure SPC-input-relationship. Also, there are several methods that have an option to return SPCs that do not currently respect these features (return data.frame SPCs no matter what input). The latter methods only have a mechanism to "respect" the aqp_df_class if they receive an SPC as input (i.e. they need the metadata in addition to the data) OR if the class is given as an argument/calculated internally from something else.

data.table and tbl_df awareness

Most of the work of data.table and tbl_df inheritance is done automatically by defining their "old classes" using setOldClass(). This allows things like merge(x,y) to automatically trigger merge.data.table, for instance, if x is data.table. This allows much of the important code for e.g. joins inside the object to be written in an unambiguous way with little effort / conceptual overhead.

There are two special internal aqp methods for working with slots that may contain data.frame-like classes

object is the SPC
target_class is a character vector unit length e.g. "data.frame" [default/failsafe] "data.table" or "tbl_df."
coluimn_names is a vector of column names (as would be used in the data.frame column or "j"-index)

.as.data.frame.aqp -- convert a data.frame-like object to target class

.as.data.frame.aqp(object, target_class)

.data.frame.j -- subset a data.frame-like object using a vector of column names in the j-index

.data.frame.j(object, column_names, target_class)

object <- fetchSPCSomehow()
cn <- c("column1", "column2")

# dat could contain data.frame, data.table or tbl_df class at this point
dat <- horizons(object)

# inspect
print(aqp_df_class(object))

Working with data.table object j index subsetting in an unambiguous way requires that one rewrite:

# 1. allowed for data.frame and tbl_df but NOT data.table
dat[,cn]

as

# 2. allowed for all three classes
.data.frame.j(dat, cn, aqp_df_class(object))

Anywhere one is using NUMERIC j index should be refactored to use column names (which are much less ambiguous/prone to error). This is generally good practice even for pure data.frames, IMO, and should be easy (index + names attribute).

Where appropriate we should consider defining functions that take an SPC object as input as S4 methods with SPC method signature. I also find that methods that return SPCs are really handy for %>%-based workflows.

I don't think we should overdo it with wrapper methods, but I also think that the point of S4 is at least in part to dispatch methods for our special class and make sure things are handled safely (and ideally easily for the user). So if there is a major benefit to getting a particular aqp functions' output into SPC, lets define that method it in aqp, rather than leaving the user to figure it out.

Wrappers for SPC object

The gist below shows a basic wrapper around munsell2rgb. I want to highlight how handling of data.frames for slots within the SPC works in series with on-the-fly data.table implementations for specific functions.

https://gist.github.com/brownag/e57cb66e32ea0842b7e507a18e228906

The wrapper method could be dispatched via S4 (e.g. munsell2rgb,SoilProfileCollection-method) so one could use the 3-vector or the SPC interface to munsell2rgb (using the same function name, different method signature).

On the fly

However, there is a ton of heavy lifting that goes on inside munsell2rgb(hue, value, chroma, ...) that could probably be optimized using data.table. In the current system where data.table is in Suggests one would add a requireNamespace in the relevant scope to trigger (or error) for the data.table code.

Its not clear how much more effort should be put into retaining historic data.frame implementations where data.table is obviously better or dramatically simplifies the code. I have no good suggestions at this point, but munsell2rgb may pose a really great case study for how to convert over a pretty complex internal method that relies heavily on the data.frame object. While I have said we should retain base R (I am mostly referring to the basic methods/within the SPC class) functionality, that of course has limits as you have pointed out. I will not lose any sleep if e.g. we omit the plyr::rbind.fill from union and make no attempt to write that mess in base R.

The examples I have done so far have been relatively simple to retain old data.frame functionality -- but munsell2rgb is considerably more complicated. A wholesale rewrite using data.table (the slice->dice) approach may be most appropriate -- I need to study how it works more and think on the options that are available.

Even without internal enhancements, there is value in having the wrapper method from above gist around it to handle merging of result back into SPC for the user and adding horizon/site IDs.

Keys

For some operations it is plainly evident that it is worth applying keys -- even on the fly. I think there is nothing wrong with setting key on the fly for operations like dice -- where there is tremendous benefit to be gained through the course of that function call -- even if the key was "thrown out" at the end.

I am cautious about more broad handling of keys ... i.e. within SPC methods like [ ... mostly because of my limited understanding.

E.g. in [... I explicitly constrain using by= I think it may be better to set the key in addition, but I've had mixed results testing this ... that led me to believe that they were performing essentially the same. I wasn't sure whether the key was being ignored or what.

In contrast, I was very surprised to see how badly the data.table code in min/max performed, and an 8x difference between keyed and unkeyed. I wondered whether I messed something in the DT syntax, making it way more costly -- since tapply weighed in at only 3x slower -- it didn't seem possible it could be that bad (or that tapply could be that good, rather). It is important to remember that a lot of the base R stuff is highly optimized!!

Clearly me "doing it wrong" is the most likely explanation.

I suspect the cumulative effect of setting keys, or doing joins where a key is expected but not set (or otherwise "defined" e.g. using by=), could be significant across a series of function calls for a typical workflow run on a large SPC. Though this is just a suspicion and it probably is [at least in part] wrong. This applies to a broader idea of allowing keys to be "re-used" and making sure they persist at some level within the SPC object itself -- not to on the fly keys.
I notice that keys sometimes aren't set when I "thought" they would be. I need to read up more on the by-reference and key-setting stuff for data.table, as I think it is the major conceptual gap I have relative to what my intuition for base R is. While im not a stranger to byref, its not how my R brain thinks.
My understanding is that the DT is generally not copied -- essentially just passing around references -- but some of the error messages I have got in weird cases running interactively have made me question my understanding. I am not clear on what non-data.table operations may create shallow copies versus true copies of data.tables and how this interacts with setting keys, operations etc. Apparently the unintentional copying thing is something the data.table creators struggle with.

I still need to experiment a lot more with this -- at this point I was elated to have something that worked let alone be several fold faster across the board. Figuring out the right "data.table" way to do things will be the needed for unlocking the final gains in performance and solving these mysteries. Before I started this I knew next to nothing about data.table. Now I feel like I have a pretty good handle on how it is different -- but am a total noob when it comes to effectively rewriting common base R workflows. I try many things and get many verbose errors from the data.table class... thankfully the error messages are pretty informative and there are many many guides and examples out there. It is clearly a very well thought out and powerful platform -- though IMO not for the uninitiated.

…157

brownag · 2021-01-20T18:10:51Z

Forgot to tag this issue in recent commits adding the data.table import

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-importing.html

dylanbeaudette · 2021-01-20T18:18:20Z

EDIT: @brownag moved checkboxes to OP so that the progress tracking shows up in the issue

…als" as null locally #157 #174

brownag · 2021-01-21T00:08:21Z

~~Probably related to merge.data.table downstream side-effects: ncss-tech/soilDB#158~~
Above statement and below mention was a false alarm

* checkHzDepthLogic: use data.table instead of profileApply #157 #174 * data.table: R CMD check-safe .() -> list() and define "undefined globals" as null locally #157 #174 * #190; looks like this got uncommented when I rebased onto latest changes

brownag · 2021-01-21T17:26:40Z

moved checkboxes to OP so that the progress tracking shows up in the issue

…157

* sim: update docs; remove reshape example #157 * sp3 example data.table patch #157 * update aqp demo to use data.table #157 * docs * fix missing comment example

* Tests for "round trip site normalization/denormalization"; fix target for #171 * Add warnings for bad site<-~ normalization; closes #171; remove ddply #157

* Use data.table for all j-index subsetting #157 * Significant optimization for j-index subsetting of SPC * Add more competitive version of base function for j-index * spc_j_base_2: 100k benchmark ~0.1s faster than original, nearly 3x slower than new DT * spc_j_base_3: more competitive for small SPCs * add demo of SPC non-standard eval "k-index" .LAST keyword #199 #201 * add experimental support for .FIRST, .HZID and multiple special keywords * better logic for i+j+k * Convert demo to tests * Docs

brownag added a commit that referenced this issue Aug 20, 2020

add munsell2spc wrapper method #157

2f95b08

brownag added a commit that referenced this issue Sep 30, 2020

rebuildSPC: enforce aqp_df_class; possibly changed by aqp_df_class<- #…

c39ecf1

…157

brownag mentioned this issue Dec 4, 2020

data.table rewrite of logic functions, profileApply / mutate_profile #174

Closed

brownag added a commit that referenced this issue Dec 4, 2020

Reprex for data.table checkHzDepthLogic #157 #174

7d80ff3

dylanbeaudette mentioned this issue Jan 20, 2021

reduce dependencies (if possible) #79

Closed

brownag added a commit that referenced this issue Jan 20, 2021

data.table: test swap data.table melt for reshape melt #157

1714ee2

dylanbeaudette added a commit that referenced this issue Jan 20, 2021

#157 and #79

1b561ef

dylanbeaudette added a commit that referenced this issue Jan 20, 2021

more conversion to DT methods, #157

33a1ab1

dylanbeaudette added a commit that referenced this issue Jan 20, 2021

last fix in slab examples for removal of reshape pkg, #79 and #157

d94af90

dylanbeaudette mentioned this issue Jan 20, 2021

Replace {reshape} melt and cast with {data.table} methods. #188

Merged

brownag added a commit that referenced this issue Jan 20, 2021

checkHzDepthLogic: use data.table instead of profileApply #157 #174

eebbbc6

brownag added a commit that referenced this issue Jan 20, 2021

data.table: R CMD check-safe .() -> list() and define "undefined glob…

2823053

…als" as null locally #157 #174

brownag added a commit that referenced this issue Jan 20, 2021

checkHzDepthLogic: use data.table instead of profileApply #157 #174

eb497f7

brownag added a commit that referenced this issue Jan 20, 2021

data.table: R CMD check-safe .() -> list() and define "undefined glob…

8f40d39

…als" as null locally #157 #174

brownag mentioned this issue Jan 21, 2021

error in fetchNASIS with diagHzBoolean ncss-tech/soilDB#158

Closed

dylanbeaudette added a commit that referenced this issue Jan 21, 2021

converting plyr functions to base / data.table, per #157

d8ef783

dylanbeaudette added a commit that referenced this issue Jan 21, 2021

removing another call to plyr::join #157

7dd5f63

brownag added a commit that referenced this issue Jan 21, 2021

Remove reshape from DESCRIPTION Imports: #157 #179

7a5330a

dylanbeaudette added a commit that referenced this issue Jan 21, 2021

removing ldply, converting documentation, adding tests #157 #79

aaed5b1

dylanbeaudette added a commit that referenced this issue Jan 21, 2021

removing ddply, #157

e4d56cd

brownag added a commit that referenced this issue Jan 21, 2021

Add warnings for bad site<-~ normalization; closes #171; remove ddply #…

414928f

…157

brownag mentioned this issue Jan 21, 2021

site normalization fixes #192

Merged

dylanbeaudette added a commit that referenced this issue Jan 21, 2021

one more ddply gone, #157

50c20c1

brownag added a commit that referenced this issue Jan 21, 2021

sim: update docs; remove reshape example #157

c1f4ff5

brownag added a commit that referenced this issue Jan 21, 2021

sp3 example data.table patch #157

7ab27af

brownag added a commit that referenced this issue Jan 21, 2021

update aqp demo to use data.table #157

122034f

brownag added a commit that referenced this issue Jan 22, 2021

reshape examples/demo (#193)

799ec15

* sim: update docs; remove reshape example #157 * sp3 example data.table patch #157 * update aqp demo to use data.table #157 * docs * fix missing comment example

brownag added a commit that referenced this issue Jan 22, 2021

site<- ~ ... normalization fixes (#192)

3b3d848

* Tests for "round trip site normalization/denormalization"; fix target for #171 * Add warnings for bad site<-~ normalization; closes #171; remove ddply #157

brownag added a commit that referenced this issue Feb 15, 2021

Use data.table for all j-index subsetting #157

b92c41c

This was referenced Feb 15, 2021

data.table optimization of SPC j-index and adding k-keywords #201

Merged

first cut at abstracting the fixing of NA bottom depths, safely constrained to the deepest horizon per profile #199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase-in of data.table support in SPC objects and functions returning SPC objects #157

Phase-in of data.table support in SPC objects and functions returning SPC objects #157

dylanbeaudette commented Jul 15, 2020 •

edited by brownag

Loading

brownag commented Jul 16, 2020 •

edited

Loading

brownag commented Jan 20, 2021

dylanbeaudette commented Jan 20, 2021 •

edited by brownag

Loading

brownag commented Jan 21, 2021 •

edited

Loading

brownag commented Jan 21, 2021

Phase-in of data.table support in SPC objects and functions returning SPC objects #157

Phase-in of data.table support in SPC objects and functions returning SPC objects #157

Comments

dylanbeaudette commented Jul 15, 2020 • edited by brownag Loading

plyr

join → base::merge(..., incomparables = NA, sort = FALSE) or data.table join

ddply

llply

ldply

dlply

reshape

cast → data.table::dcast

melt → data.table::melt

Caveats / Notes

A Plan

Related issues

brownag commented Jul 16, 2020 • edited Loading

Extending and testing the SoilProfileCollection object

data.table and tbl_df awareness

.as.data.frame.aqp -- convert a data.frame-like object to target class

.data.frame.j -- subset a data.frame-like object using a vector of column names in the j-index

Wrappers for SPC object

On the fly

Keys

Clearly me "doing it wrong" is the most likely explanation.

brownag commented Jan 20, 2021

dylanbeaudette commented Jan 20, 2021 • edited by brownag Loading

brownag commented Jan 21, 2021 • edited Loading

brownag commented Jan 21, 2021

dylanbeaudette commented Jul 15, 2020 •

edited by brownag

Loading

`plyr`

`join` → `base::merge(...,` incomparables = NA`, sort = FALSE)` or `data.table` join

`ddply`

`llply`

`ldply`

`dlply`

`reshape`

`cast` → `data.table::dcast`

`melt` → `data.table::melt`

brownag commented Jul 16, 2020 •

edited

Loading

dylanbeaudette commented Jan 20, 2021 •

edited by brownag

Loading

brownag commented Jan 21, 2021 •

edited

Loading