-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c++] Geometry Dataframe #3212
base: main
Are you sure you want to change the base?
[c++] Geometry Dataframe #3212
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3212 +/- ##
==========================================
+ Coverage 83.86% 83.97% +0.10%
==========================================
Files 51 51
Lines 5505 5505
==========================================
+ Hits 4617 4623 +6
+ Misses 888 882 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That parts of this I understand look good to me, but I'm not familiar enough with the arrow adapter code to have a good sense of what impact the change might have there.
if (type_metadata.compare("WKB") == 0) { | ||
type = TILEDB_GEOM_WKB; | ||
} else { | ||
throw std::runtime_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use fmt::format
with {}
as elsewhere in this file
return {attr, enmr}; | ||
} | ||
|
||
std::pair<Dimension, bool> ArrowAdapter::tiledb_dimension_from_arrow_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer you move the body of this dimension function above the body of the corresponding attribute funtion
* @return std::pair<Dimension, bool> The TileDB dimension with a boolean | ||
* flag indicating whether or not the dimension uses `current domain`. | ||
*/ | ||
static std::pair<Dimension, bool> tiledb_dimension_from_arrow_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer you move the declaration of this dimension function above the corresponding attribute function
static void create( | ||
std::string_view uri, | ||
std::unique_ptr<ArrowSchema> schema, | ||
ArrowTable index_columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see no reason not to use const references here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agrre on const reference here but will require refactoring all other classes which tiledb_schema_from_arrow_schema
. Should be done to a follow-up PR maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't need to be done everywhere -- it can be done in places you add
we already have const references in some places not others -- you can have this PR be part of the work that doesn't need to be redone later
it can also be done all at once in a follow-on, of course
* @param mode read or write | ||
* @param ctx SOMAContext | ||
* @param column_names A list of column names to use as user-defined index | ||
* columns (e.g., ``['cell_type', 'tissue_type']``). All named columns must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these example column names legit for geometry dataframes?
ctx->tiledb_ctx(), | ||
std::move(schema), | ||
ArrowTable( | ||
std::move(index_columns.first), std::move(index_columns.second)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe you can avoid this by having this constructor take the ArrowTable
parameters as const references
SOMAGeometryDataFrame::create( | ||
uri, | ||
std::move(schema), | ||
ArrowTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use const references to avoid the moves
REQUIRE(soma_object->type() == "SOMAGeometryDataFrame"); | ||
soma_object->close(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we could use some tests which write & read data -- is this planned for a future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is #3222 which adds tests for read/write.
if (type_metadata.compare("WKB") != 0) { | ||
throw std::runtime_error( | ||
std::string( | ||
"Unkwown type metadata for `soma_geometry`. " | ||
"Expected 'WKB', found ") + | ||
std::string(type_metadata.data())); | ||
} | ||
|
||
FilterList filter_list = ArrowAdapter::_create_dim_filter_list( | ||
child->name, platform_config, soma_type, ctx); | ||
for (int64_t j = 0; | ||
j < spatial_column_info.second->n_children; | ||
++j) { | ||
auto min_dim = tiledb_dimension_from_arrow_schema( | ||
ctx, | ||
spatial_column_info.second->children[j], | ||
spatial_column_info.first->children[j], | ||
soma_type, | ||
type_metadata, | ||
"tiledb__internal__", | ||
"__min", | ||
platform_config); | ||
|
||
auto max_dim = tiledb_dimension_from_arrow_schema( | ||
ctx, | ||
spatial_column_info.second->children[j], | ||
spatial_column_info.first->children[j], | ||
soma_type, | ||
type_metadata, | ||
"tiledb__internal__", | ||
"__max", | ||
platform_config); | ||
|
||
dims.insert({min_dim.first.name(), min_dim.first}); | ||
dims.insert({max_dim.first.name(), max_dim.first}); | ||
|
||
use_current_domain &= min_dim.second; | ||
use_current_domain &= max_dim.second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, I think this can be pulled out into a separate _set_spatial_dimension
(or better name) helper function.
} | ||
if (strcmp(child->name, index_column_schema->children[i]->name) == | ||
0) { | ||
if (strcmp(child->name, "soma_geometry") == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should const std::string
some of these in utils/common.h
like "soma_geometry", "tiledb__internal__", "__min", and "__max"
02bfa69
to
ead7104
Compare
This PR adds the new
SOMAGeometryDataFrame
object. This new object requires spatial axes to be determined and uses them internally to index geometry objects stored as a WKB encoded attribute. To differentiate between binary attributes and WKB binary attributes the Arrow schema requires adtype
metadata key with the valueWKB
.