Skip to content

Commit

Permalink
[r] Implement missing domain argument to DataFrame create (#3032)
Browse files Browse the repository at this point in the history
* [r] Implement missing `domain` argument to `DataFrame` `create`

* devtools::document()

* add a helpful comment [skip ci]

* code-review feedback

Co-authored-by: Paul Hoffman <mojaveazure@users.noreply.github.com>

* typofix

* update comment [skip ci]

* DESCRIPTION and NEWS.md [skip ci]

* devtools::document()

* code-review feedback

Co-authored-by: Paul Hoffman <mojaveazure@users.noreply.github.com>

---------

Co-authored-by: Paul Hoffman <mojaveazure@users.noreply.github.com>
  • Loading branch information
johnkerl and mojaveazure authored Sep 30, 2024
1 parent a162dad commit 8885e31
Show file tree
Hide file tree
Showing 9 changed files with 431 additions and 109 deletions.
2 changes: 1 addition & 1 deletion apis/r/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Description: Interface for working with 'TileDB'-based Stack of Matrices,
like those commonly used for single cell data analysis. It is documented at
<https://github.com/single-cell-data>; a formal specification available is at
<https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md>.
Version: 1.14.99.1
Version: 1.14.99.2
Authors@R: c(
person(given = "Aaron", family = "Wolen",
role = c("cre", "aut"), email = "aaron@tiledb.com",
Expand Down
1 change: 1 addition & 0 deletions apis/r/NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

## Changes

* Implement missing `domain` argument to `SOMADataFrame` `create` [#3032](https://github.com/single-cell-data/TileDB-SOMA/pull/3032)
* Remove unused `fragment_count` accessor [#3054](https://github.com/single-cell-data/TileDB-SOMA/pull/3054)

# tiledbsoma 1.14.1
Expand Down
10 changes: 10 additions & 0 deletions apis/r/R/Factory.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@
#' @param index_column_names A vector of column names to use as user-defined
#' index columns; all named columns must exist in the schema, and at least
#' one index column name is required
#' @param domain An optional list of 2-element vectors specifying the domain of each index
#' column. Each vector should be a pair consisting of the minimum and maximum values storable in
#' the index column. For example, if there is a single int64-valued index column, then `domain`
#' might be `c(100, 200)` to indicate that values between 100 and 200, inclusive, can be stored
#' in that column. If provided, this list must have the same length as `index_column_names`,
#' and the index-column domain will be as specified. If omitted entirely, or if `NULL` in a given
#' dimension, the corresponding index-column domain will use the minimum and maximum possible
#' values for the column's datatype. This makes a `DataFrame` growable.
#' @param ingest_mode Ingestion mode when creating the TileDB object; choose from:
#' \itemize{
#' \item \dQuote{\code{write}}: create a new TileDB object and error if it already exists
Expand All @@ -24,6 +32,7 @@ SOMADataFrameCreate <- function(
uri,
schema,
index_column_names = c("soma_joinid"),
domain = NULL,
ingest_mode = c("write", "resume"),
platform_config = NULL,
tiledbsoma_ctx = NULL,
Expand All @@ -50,6 +59,7 @@ SOMADataFrameCreate <- function(
sdf$create(
schema,
index_column_names = index_column_names,
domain = domain,
platform_config = platform_config,
internal_use_only = "allowed_use"
)
Expand Down
31 changes: 28 additions & 3 deletions apis/r/R/SOMADataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,21 @@ SOMADataFrame <- R6::R6Class(
#' @param index_column_names A vector of column names to use as user-defined
#' index columns. All named columns must exist in the schema, and at least
#' one index column name is required.
#' @param domain An optional list of 2-element vectors specifying the domain of each index
#' column. Each vector should be a pair consisting of the minimum and maximum values storable in
#' the index column. For example, if there is a single int64-valued index column, then `domain`
#' might be `c(100, 200)` to indicate that values between 100 and 200, inclusive, can be stored
#' in that column. If provided, this list must have the same length as `index_column_names`,
#' and the index-column domain will be as specified. If omitted entirely, or if `NULL` in a given
#' dimension, the corresponding index-column domain will use the minimum and maximum possible
#' values for the column's datatype. This makes a `DataFrame` growable.
#' @template param-platform-config
#' @param internal_use_only Character value to signal this is a 'permitted' call,
#' as `create()` is considered internal and should not be called directly.
create = function(
schema,
index_column_names = c("soma_joinid"),
domain = NULL,
platform_config = NULL,
internal_use_only = NULL
) {
Expand All @@ -35,6 +44,20 @@ SOMADataFrame <- R6::R6Class(
}
schema <- private$validate_schema(schema, index_column_names)

stopifnot(
"domain must be NULL or a named list, with values being 2-element vectors or NULL" = is.null(domain) ||
( # Check that `domain` is a list of length `length(index_column_names)`
# where all values are named after `index_column_names`
# and all values are `NULL` or a two-length atomic non-factor vector
rlang::is_list(domain, n = length(index_column_names)) &&
identical(sort(names(domain)), sort(index_column_names)) &&
all(vapply_lgl(
domain,
function(x) is.null(x) || (is.atomic(x) && !is.factor(x) && length(x) == 2L)
))
)
)

attr_column_names <- setdiff(schema$names, index_column_names)
stopifnot("At least one non-index column must be defined in the schema" =
length(attr_column_names) > 0)
Expand All @@ -43,12 +66,14 @@ SOMADataFrame <- R6::R6Class(
# typed, queryable data structure.
tiledb_create_options <- TileDBCreateOptions$new(platform_config)

## we (currently pass domain and extent values in an arrow table (i.e. data.frame alike)
## where each dimension is one column (of the same type as in the schema) followed by three
## values for the domain pair and the extent
# We currently pass domain and extent values in an arrow table (i.e. data.frame alike)
# where each dimension is one column (of the same type as in the schema followed by:
# * Before the new shape feature: three values for the domain pair and the extent;
# * After the new shape feature: five values for the maxdomain pair, extent, and domain.
dom_ext_tbl <- get_domain_and_extent_dataframe(
schema,
ind_col_names = index_column_names,
domain = domain,
tdco = tiledb_create_options
)

Expand Down
62 changes: 56 additions & 6 deletions apis/r/R/utils-arrow.R
Original file line number Diff line number Diff line change
Expand Up @@ -348,21 +348,33 @@ extract_levels <- function(arrtbl, exclude_cols=c("soma_joinid")) {
#' Domain and extent table creation helper for data.frame writes returning a Table with
#' a column per dimension for the given (incoming) arrow schema of a Table
#' @noRd
get_domain_and_extent_dataframe <- function(tbl_schema, ind_col_names,
get_domain_and_extent_dataframe <- function(tbl_schema, ind_col_names, domain = NULL,
tdco = TileDBCreateOptions$new(PlatformConfig$new())) {
stopifnot("First argument must be an arrow schema" = inherits(tbl_schema, "Schema"),
"Second argument must be character" = is.character(ind_col_names),
"Second argument cannot be empty vector" = length(ind_col_names) > 0,
"Second argument index names must be columns in first argument" =
all(is.finite(match(ind_col_names, names(tbl_schema)))),
"Third argument must be options wrapper" = inherits(tdco, "TileDBCreateOptions"))
stopifnot(
"domain must be NULL or a named list, with values being 2-element vectors or NULL" = is.null(domain) ||
( # Check that `domain` is a list of length `length(ind_col_names)`
# where all values are named after `ind_col_names`
# and all values are `NULL` or a two-length atomic non-factor vector
rlang::is_list(domain, n = length(ind_col_names)) &&
identical(sort(names(domain)), sort(ind_col_names)) &&
all(vapply_lgl(
domain,
function(x) is.null(x) || (is.atomic(x) && !is.factor(x) && length(x) == 2L)
))
)
)

rl <- sapply(ind_col_names, \(ind_col_name) {
ind_col <- tbl_schema$GetFieldByName(ind_col_name)
ind_col_type <- ind_col$type
ind_col_type_name <- ind_col$type$name

# TODO: tiledbsoma-r does not accept the domain argument to SOMADataFrame::create, but should
# https://github.com/single-cell-data/TileDB-SOMA/issues/2967
ind_ext <- tdco$dim_tile(ind_col_name)

# Default 2048 mods to 0 for 8-bit types and 0 is an invalid extent
Expand All @@ -384,21 +396,59 @@ get_domain_and_extent_dataframe <- function(tbl_schema, ind_col_names,
ind_max_dom <- arrow_type_range(ind_col_type) - c(0, ind_ext)
}

ind_cur_dom <- ind_max_dom
requested_slot <- domain[[ind_col_name]]
ind_cur_dom <- if (is.null(requested_slot)) {
ind_max_dom
} else {
requested_slot
}
# Core supports no domain specification for variable-length dims, which
# includes string/binary dims.
if (ind_col_type_name %in% c("string", "large_utf8", "utf8")) ind_ext <- NA

# https://github.com/single-cell-data/TileDB-SOMA/issues/2407
if (.new_shape_feature_flag_is_enabled()) {
if (ind_col_type_name %in% c("string", "utf8", "large_utf8")) {
aa <- arrow::arrow_array(c("", "", "", "", ""), ind_col_type)
aa <- if (is.null(requested_slot)) {
arrow::arrow_array(c("", "", "", "", ""), ind_col_type)
} else {
arrow::arrow_array(c("", "", "", requested_slot[[1]], requested_slot[[2]]), ind_col_type)
}
} else {
# If they wanted (0, 99) then extent must be at most 100.
# This is tricky though. Some cases:
# * lo = 0, hi = 99, extent = 1000
# We look at hi - lo + 1; resize extent down to 100
# * lo = 1000, hi = 1099, extent = 1000
# We look at hi - lo + 1; resize extent down to 100
# * lo = min for datatype, hi = max for datatype
# We get integer overflow trying to compute hi - lo + 1
# So if lo <= 0 and hi >= ind_ext, this is fine without
# computing hi - lo + 1.
lo <- ind_max_dom[[1]]
hi <- ind_max_dom[[2]]
if (lo > 0 || hi < ind_ext) {
dom_span <- hi - lo + 1
if (ind_ext > dom_span) {
ind_ext <- dom_span
}
}
aa <- arrow::arrow_array(c(ind_max_dom, ind_ext, ind_cur_dom), ind_col_type)
}
} else {
if (ind_col_type_name %in% c("string", "utf8", "large_utf8")) {
aa <- arrow::arrow_array(c("", "", ""), ind_col_type)
} else {
aa <- arrow::arrow_array(c(ind_max_dom, ind_ext), ind_col_type)
# Same comments as above
lo <- ind_cur_dom[[1]]
hi <- ind_cur_dom[[2]]
if (lo > 0 || hi < ind_ext) {
dom_span <- hi - lo + 1
if (ind_ext > dom_span) {
ind_ext <- dom_span
}
}
aa <- arrow::arrow_array(c(ind_cur_dom, ind_ext), ind_col_type)
}
}

Expand Down
10 changes: 10 additions & 0 deletions apis/r/man/SOMADataFrame.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions apis/r/man/SOMADataFrameCreate.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

35 changes: 26 additions & 9 deletions apis/r/src/rutilities.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,9 @@
#define TILEDB_NO_API_DEPRECATION_WARNINGS
#endif

#include <Rcpp.h> // for R interface to C++
#include <nanoarrow/r.h> // for C interface to Arrow (via R package)
#include <nanoarrow/nanoarrow.h> // for C interface to Arrow
#include <RcppInt64> // for fromInteger64
#include <tiledbsoma/tiledbsoma>
#include <Rcpp.h> // for R interface to C++
#include <nanoarrow/nanoarrow.h> // for C interface to Arrow
#include <nanoarrow/r.h> // for C interface to Arrow (via R package)
#include <tiledbsoma/reindexer/reindexer.h>
#include <RcppInt64> // for fromInteger64
#include <tiledbsoma/tiledbsoma>
Expand Down Expand Up @@ -471,10 +469,29 @@ SEXP convert_domainish(const tdbs::ArrowTable& arrow_table) {
"Bad array children alloc");

for (size_t i = 0; i < ncol; i++) {
spdl::info(
"[domainish] name {} length {}",
std::string(arrow_schema->children[i]->name),
arrow_array->children[i]->length);
if (arrow_array->children[i]->n_buffers == 3) {
// Arrow semantics: variable-length: buffers 0,1,2 are validity,
// offsets, data
std::vector<std::string>
lohi = tiledbsoma::ArrowAdapter::get_array_string_column(
arrow_array->children[i], arrow_schema->children[i]);
spdl::info(
"[domainish] name {} format {} length {} lo {} hi {}",
std::string(arrow_schema->children[i]->name),
std::string(arrow_schema->children[i]->format),
arrow_array->children[i]->length,
lohi[0],
lohi[1]);
} else {
// Arrow semantics: non-variable-length: buffers 0,1 are validity &
// data
spdl::info(
"[domainish] name {} format {} length {}",
std::string(arrow_schema->children[i]->name),
std::string(arrow_schema->children[i]->format),
arrow_array->children[i]->length);
}

ArrowArrayMove(arrow_array->children[i], arr->children[i]);
ArrowSchemaMove(arrow_schema->children[i], sch->children[i]);
}
Expand Down
Loading

0 comments on commit 8885e31

Please sign in to comment.