Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test horizon-less members of an SPC #134

Closed
dylanbeaudette opened this issue May 19, 2020 · 10 comments
Closed

test horizon-less members of an SPC #134

dylanbeaudette opened this issue May 19, 2020 · 10 comments

Comments

@dylanbeaudette
Copy link
Member

Does this make sense and is it supported by all methods?

@brownag
Copy link
Member

brownag commented May 25, 2020

Here are some stubs of things I've tried to break so far

library(aqp)

# 10 10cm thick profiles with 1cm thick slices
df <- data.frame(id = do.call('c',lapply(10:1, rep, 10)),
                 top = do.call('c', lapply(1:10, function(i) 0:9)),
                 bottom = do.call('c', lapply(1:10, function(i) 1:10)))

# add an 11th profile, with 1 "slice" that has NA depths
df <- rbind(df, data.frame(id=11, top=NA, bottom=NA))

# promotion to SPC works
depths(df) <- id ~ top + bottom

# plot works
plot(df)

# union, profileApply, glom work
plot(union(profileApply(df, glom, z1=2, z2=7)))

# filter works
plot(filter(df, !is.na(top) & !is.na(bottom)))
plot(filter(df, is.na(top) & is.na(bottom)))

# i-subset works
df[11,]

# j-subset works
df[11, 1]
length(df[11,1])

# j-subset fails gracefully
df[11, 2]
length(df[11,2])

@brownag
Copy link
Member

brownag commented May 25, 2020

I think a distinction from "horizonless" versus "fill" (as it is used in fetchNASIS) is that by "filling" with a single horizon with NA depths, we don't break the SPC, allow plots to "work" and all functions that use horizon data and are properly NA handling should be "OK"

@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented May 25, 2020

Thanks for testing. First off, I'm not a fan of word fill for such behavior: if we do this, I'd prefer a more explicit argument name such as allowMissingHorizons or something like that.

I'm not sure what is stranger: missing horizons or NA depths, both of which are a conceptual problem for a data model that is supposed to be describing physical measurements. How about those functions that assume / require that horizons have depths? I'd rather allow horizon-less sites vs. attempt to check / keep track of NA in all depth handling. In addition, there will need to be extra checking for cases where there are missing horizon data for all sites.

This is a small example, and likely easily fixed, but slice currently creates some confusing output:

  1. 'k' is given a depth slice (is this a bug in slice)
  2. 'k' loses its hzID
library(aqp)

# 10 profiles
# reminder to allow for n_prop=0
x <- lapply(letters[1:10], random_profile, n_prop=1)
x <- do.call('rbind', x)
x <- x[, c('id', 'top', 'bottom')]

# add an 11th profile, with 1 "slice" that has NA depths
# this is profile 'k'
x <- rbind(x, data.frame(id=letters[11], top=NA, bottom=NA))

# promotion to SPC
depths(x) <- id ~ top + bottom

# visual check
par(mar=c(0,0,0,1))
plot(x)

# output is wrong:
s <- slice(x, fm = 10 ~ .)
horizons(s)

plot(s)

# seems to work
slab(x, id ~ p1, slab.structure = c(0,10))

I know that we can figure out a work-around, but I'm not yet convinced that the additional code complexity is worth the effort. Other options might include allowing for IDs to exist in @site and not in @horizons. This seems to work, but is not well-tested. Also, this makes SPC init from data without horizons more complex.

# 10 profiles
# reminder to allow for n_prop=0
x <- lapply(letters[1:10], random_profile, n_prop=1)
x <- do.call('rbind', x)
x <- x[, c('id', 'top', 'bottom', 'p1')]

# promotion to SPC
depths(x) <- id ~ top + bottom

# this would require an alternative init method...
# add an 11th profile to @site, probably dangerous
# this is profile 'k'
x@site <- rbind(site(x), data.frame(id=letters[11]))

# visual check: 'k' is missing
par(mar=c(0,0,0,1))
plot(x)

# works fine
s <- slice(x, fm = 10 ~ .)
horizons(s)

plot(s)

@brownag
Copy link
Member

brownag commented May 27, 2020

I think it is a false dichotomy to suggest that we either allow horizonless sites or handle NA in depths. I didn't really care about this matter one way or another until you said that. I think we could do just about anything to how these edge case profiles are handled and still have to deal with the possibilities of NA in depth. So lets just agree on that for now.

Are you suggesting that we eliminate that permissive behavior for NA depths in the SPC and replace with that with e.g. another constructor and/or lots of additional logic?

slice does exactly what it is supposed to -- and the only confusion is because we never really defined what is supposed to happen in that case. Specifically: how should NA depths be considered when calculating "missingPct" / contributing fractions.

We don't need the severely degenerate case of a profile with a single horizon with both depths NA to have this undefined behavior happen... just set any top or bottom depth in a SPC to NA.

Fix is simple: calculate idx in get.slice so that it identifies and handles NA an appropriate way --before calling which.

I don't think it is wrong per se to assign a slice ID to a depth interval from a non-existent horizon as long at the missingPct is 100% (or NA? which slab treats equivalently). In theory all profiles have the same number of slices defined by the LHS of the formula but some are empty. In practice, for actual calculation, you only get slices from a layer if it has a top and bottom depth that includes the slice range. That slice only "contributes" to e.g. a slab if it has values for the attribute(s) of interest.

It is a little odd that slice depths are returned from a depthless horizon -- but that actually makes sense as long the data in that slice are 100% missing (which they are, because a fill-ed profile has no top or bottom and no data). Slice does not return portions of profiles -- it scans through profiles and finds the data that occurs in 1cm intervals.

You can 'slice' thin air (er... a vacuum), and the result of your analysis (in terms of soil properties) is nothing. 100% missing. This applies to summarizing horizons you observed, but properties you didn't. And likewise to slicing a horizon depth you never observed but could have observed.

slice.fast dives right into manipulating depths with inequalities (wrapped in which -- which silently omits NA). For comparison, glom/clod.hz.ids performs a variety of topologic checks before trying to manipulate horizon depths. Under certain conditions the function cannot evaluate. At least to a limited degree slice needs to behave the same -- which, really won't take much additional effort. Only the most pathological cases should prevent slice from slicing (i.e. multiple sites but no horizons/depths at all? you could probably still return a result in that case -- it wouldn't be too helpful though!)

Since slice is vectorized, we don't see the degenerate edge cases as often -- even though they are in there. We would probably have seen effects from slice-ing on missing depths more often if it was used on individual profiles. I don't think I have ever sliced a NASIS component, either -- and those tend to have gone through a little more QC before we pull them into R. Regardless, some profiles will not have the data needed to complete computations across all possible intervals -- and aqp needs to be robust to that. I would further suggest that, by and large, it already is.

library(aqp)
x <- lapply(letters[1:10], random_profile, n_prop=1)
x <- do.call('rbind', x)
x <- rbind(x, data.frame(id=letters[11], top=NA, bottom=NA, name=NA, p1=NA))

# take 3rd horizon in first profile, set bottom depth to NA
x[3,3] <- NA

# promotion to SPC
depths(x) <- id ~ top + bottom

# visual check (horizon with NA bottom depth missing just like `k`)
par(mar=c(0,0,0,1))
plot(x)

# output is the same for the NA depth horizon as the bottom-depth-missing horizon
# the missingPct should possibly be modified to be 100 for slices from these NA-depth horizons -- just for clarity in a slice-only result
s <- slice(x, fm = 40 ~ .)
horizons(s)

# slab result is "correct" -- because NA and 100% missing are essentially the same in terms of contributing fraction -- there is no value in that slice to take the quantile of so all Q are NA.
slab(x, id ~ p1, slab.structure = c(20,40))

So far, slice is the only function that needs to be fortified against the NA depths case. It works surprisingly well -- one might even suggest that it works as expected -- considering that it was never designed for this. I doubt that fixes to implement a site-focused SPC constructor will be simpler than just fixing slice to slightly less confusingly handle NA depths.


Here is a philosophical question: Should we be treating a horizon that was not observed the same as one that doesn't exist (e.g. because a root-limiting layer was encountered and the profile stopped above that depth)? This is the discussion where I think the argument around horizonless SPCs becomes relevant to slice and slab. I would argue that an unobserved horizon should not be counted towards the denominator of the contributing fraction, whereas a depth range that was observed and definitively did not contain a soil horizon should be counted.

In your second example -- profile k is left out of the plot -- so it does not replicate one of the key things that we need for a horizonless SPC: the ability to plot and see that the collection contains one or more "profiles" that are horizonless. I assume we would alter the plot method or its dependencies to accommodate sites without horizons and retain current fill-ed SPC behavior. This is probably the number 2 reason we have fill for fetchNASIS -- after from ensuring our DMU comppct_r sum to 100.

I am respectful of the fact that creating the NA depth condition "artificially" is not ideal "for a data model that is supposed to be describing physical measurements" -- but ultimately our physical measurements have to live in some sort of data structure -- which is invariably an approximation of reality. Presence/absence of property data needs to be portrayed consistently. The structure is designed / maintained by humans and is prone to human errors, so we need the ability to portray the gaps in the data just as well as the data themselves.

The SoilProfileCollection and plotSPC is perhaps most often used for quick/basic visual inspection of data -- not for slicing/slabbing/gloming/anything rigorously quantitative.

There could be meaningful physical measurements at the site level as well -- especially when considering ecological sites or interpretations -- so there is value to making the inclusion of those data in the SPC seamless for analyses that might need to use a mixture of sites with and without horizons as supporting data. That doesn't have to be by adding horizons with NA depths -- but that would by far be the best way to achieve it in the current implementation. There are much bigger fish to fry than this one for aqp.

@brownag
Copy link
Member

brownag commented May 27, 2020

One more point -- even if you put some fake data into sites with NA depth horizons, slice will not lead you astray.

library(aqp)
x <- data.frame(id=1:10, top=rep(NA, 10), bottom=rep(NA, 10), none=rep(NA, 10), one=rep(1, 10))
depths(x) <- id ~ top + bottom
slice(x, 1 ~ none)
slice(x, 1 ~ one)
Horizon attributes:
   id top bottom none .pctMissing sliceID
1   1   1      2   NA          NA       1
10 10   1      2   NA          NA       2
2   2   1      2   NA          NA       3
3   3   1      2   NA          NA       4
4   4   1      2   NA          NA       5
5   5   1      2   NA          NA       6
Horizon attributes:
   id top bottom one .pctMissing sliceID
1   1   1      2  NA          NA       1
10 10   1      2  NA          NA       2
2   2   1      2  NA          NA       3
3   3   1      2  NA          NA       4
4   4   1      2  NA          NA       5
5   5   1      2  NA          NA       6

@dylanbeaudette
Copy link
Member Author

There is no dichotomy: I presented two possible options, neither of which I prefer, and outlined some pitfalls with our current codebase. There are a lot of moving parts and clearly room to adapt to horizon-less profiles, something that is way out of the original concept of the SPC. It is true that we are the keepers of this package and thus free to bend it to the specialized requirements of our daily jobs, however, the SPC is supposed to be generic.

I'd like to investigate some other packages to see how NULL components are handled. Spatial* objects don't work without coordinates, rasterStacks require coherent extents / resolution / CRS. These aren't perfect examples but worth considering. I'm curious about how sf accommodates NULL geometry.

I propose the following:

  • changes to fetchNASIS so that pedon / component data without horizons are integrated into the SPC in the same way

Lets have these discussions in person, in front of some examples. This is too complex for GH comments.

@brownag
Copy link
Member

brownag commented May 29, 2020

It is true that we are the keepers of this package and thus free to bend it to the specialized requirements of our daily jobs, however, the SPC is supposed to be generic.

And the changes proposed in your post would be breaking historic behavior (even if not the earliest behavior) of allowing NA..., which I think you all decided on long before I came around -- because of the value of being able to promote to SPC before doing diagnostics, essentially.

I'd like to investigate some other packages to see how NULL components are handled. Spatial* objects don't work without coordinates, rasterStacks require coherent extents / resolution / CRS. These aren't perfect examples but worth considering. I'm curious about how sf accommodates NULL geometry.

I see your point on Spatial.. but probably not the best example, as you admit...

> sp::SpatialPoints(data.frame(x=NA, y=NA)[0,])
Error in .local(obj, ...) : 
  cannot derive coordinates from non-numeric matrix

This is a "valid" generic SPC by all accounts. But the sp slot is broken if you access it (i.e. will generate process-stopping error!)

> df <- data.frame(id=1, top=0:9, bottom=1:10)
> depths(df) <- id ~ top + bottom
> df@sp
class       : SpatialPoints 
features    : 1 
Error in e[1, 2] : subscript out of bounds

raster seems to have it worked out though...

> raster::stack(raster::raster())
class      : RasterStack 
nlayers    : 0
> table(raster::values(raster::raster()), useNA = "ifany")
 <NA> 
64800 

I suggest we add a data.table import ASAP. Just to start incorporating it where we can, and getting dependencies straighened out so its not a big schock. I look forward to this development and am sympathetic to the effects that NA depths might have on the new JOIN based implementation.

Yes, clod.hz.ids is an example that requires "coherent" data -- and I think it is fair to handle certain irresolvable cases with either NA or NULL or length(x) == 0 result as appropriate, with a warning issued about topologic problems and profile_id(x). It is important that only the worst offenders generate errors -- i.e. things that are so bad that you either need to set up special error checking code to work with the data -- or should rightly shut things down.

I propose this as the default case...

df <- data.frame(id=1, top=NA, bottom=NA)
depths(df) <- id ~ top + bottom

in lieu of being able to do this...

df <- data.frame(id=1, top=NA, bottom=NA)[0,]
depths(df) <- id ~ top + bottom

Which currently generates:

Error: replacement horizon IDs must have same length as original

I would prefer to have 0-row site and horizon slots for a truly empty SPC. This is achievable (not sure if it should be?) if you start with a valid SPC and then subset.

df <- data.frame(id=1, top=0, bottom=1)
depths(df) <- id ~ top + bottom
df[0,]
SoilProfileCollection: 0 profiles | 0 horizons
profile ID: id
horizon ID: hzID

Horizon Attributes (first 4 of 4 columns):
------------------------------------------------
 id hzID top bottom
 NA   NA  NA     NA
[... more rows ...]

Site Attributes (first 1 of 1 columns):
---------------------------------------------
   id
 <NA>
[... more rows ...]
  • discussion of horizon depth pointer table, as discussed in 2D morph. issue, with NULL pointers for horizon-less SPC

Interesting -- but pretty experimental and theoretical. I guess I am most concerned about my upcoming presentations for SSSA -- we probably will not be implementing these more advanced things before then. So perhaps we should separate the more lofty goals from the immediate ones that we need to ensure my new functions are conforming.

Yeah -- this one is super important. You might be too close to the plotSPC function and its origins, but as you say, I think it is very useful for QC and general inspection. I would support anything we could do to segregate profile_id() from @horizons ... but I think that it currently is pretty important for logic. At least, in current codebase, it breaks slice tests if you try to use x@site[[idname(x)]] instead

  • changes to fetchNASIS so that pedon / component data without horizons are integrated into the SPC in the same way

Yeah, soilDB will adapt to whatever we decide here.

Lets have these discussions in person, in front of some examples. This is too complex for GH comments.

That is probably true / a good idea -- but just wanted to respond to your carefully laid out points so we're on same page.

@brownag
Copy link
Member

brownag commented May 29, 2020

sf

EDIT: whoops I mixed up examples

library(sf)
df <- data.frame(NA)
sfc <- st_sfc(st_point())
st_as_sf(df, geom=sfc)
> st_as_sf(df, geom=sfc)
Simple feature collection with 1 feature and 1 field (with 1 geometry empty)
geometry type:  POINT
dimension:      XY
bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
epsg (SRID):    NA
proj4string:    NA
  NA.        geom
1  NA POINT EMPTY

@brownag
Copy link
Member

brownag commented May 29, 2020

This also works

library(sf)
df <- data.frame()
sfc <- st_sfc(st_point())
st_as_sf(df, geom=sfc[0])

returning completely empty

Simple feature collection with 0 features and 0 fields
bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
epsg (SRID):    NA
proj4string:    NA
[1] geom
<0 rows> (or 0-length row.names)

@dylanbeaudette
Copy link
Member Author

I think that most of these sub-issues have either been resolved or sufficiently tested. Note that dice() will drop invalid (empty) profiles from a collection.

library(aqp)

x <- lapply(letters[1:10], random_profile, SPC = FALSE, n_prop = 1)
x <- do.call('rbind', x)

# add an 11th profile, with 1 "slice" that has NA depths
x <- rbind(x, data.frame(id = letters[11], top = NA, bottom = NA, name =NA, p1 = NA))

# promotion to SPC works
depths(x) <- id ~ top + bottom

# plot works
par(mar = c(1, 1, 3, 2))
plotSPC(x, color = 'p1', name = 'name')

# errors as expected
d <- dice(x)

# works
d <- dice(x, fm = 0:50 ~ ., byhz = FALSE)

# note that profile 'k' has been dropped
plotSPC(d, color = 'p1', lwd = 0.5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants