Column API and DataType Containers #180

rlizzo · 2020-02-19T17:10:34Z

Motivation and Context

Why is this change required? What problem does it solve?:

This PR implements the new columns API, replacing the old arraysets and metadata terminology. The concept of a column is to act as a highly modular container for potentially arbitrary data types. As a simple proof of concept, a string type column has been introduced - replacing the functionality of metadata in a much more useful and expandable way.

Description

Describe your changes in detail:

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Documentation update
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

Ready for review
Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

Current tests cover modifications made
New tests have been added to the test suite
Modifications were made to existing tests to support these changes
Tests may be needed, but they are not included when the PR was proposed
I don't know. Help!

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have signed (or will sign when prompted) the tensorwork CLA.
I have added tests to cover my changes.
All new and existing tests passed.

…st of hangar backend module

Combined reader and writer arrayset column methods in order to allow schema type (and specification), along with argument validators, to be dynamically mixed in. This is to support the new "columns" API where different column data types (previously 'arraysets' and 'metadata') are combined into a unified API, each operating on backends independently. Renamed columns/arrayset_nested.py -> columns/nested.py and columns/arrayset_flat.py to columns/flat.py to mark this change in how we are thinking about the accessor methods. Tests are observed to pass.

…of moving in preparation for pulling out column types

…set up special (highly isolated) mock of hangar repo dirs / dbs and initialize the column class for writing to lmdb30 backend (and reading back out)

… schema keys

hhsecond

Looks great so far. I have added a few minor comments. I'll spend more time soon

src/hangar/backends/__init__.py

src/hangar/typesystem/descriptors.py

rlizzo · 2020-02-20T17:29:28Z

@hhsecond any idea why this test is failing? I'm not sure what tensorflow wants here...

https://travis-ci.com/tensorwerk/hangar-py/jobs/289376731#L423-L506

codecov · 2020-02-20T17:32:16Z

Codecov Report

Merging #180 into master will not change coverage by %.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #180   +/-   ##
=======================================
  Coverage   95.25%   95.25%           
=======================================
  Files          97       97           
  Lines       16175    16175           
  Branches     1547     1547           
=======================================
  Hits        15407    15407           
  Misses        525      525           
  Partials      243      243

src/hangar/dataloaders/common.py

hhsecond · 2020-02-24T10:37:46Z

@hhsecond any idea why this test is failing? I'm not sure what tensorflow wants here...

https://travis-ci.com/tensorwerk/hangar-py/jobs/289376731#L423-L506

@rlizzo Here is the culprit

hhsecond · 2020-02-24T11:50:03Z

src/hangar/columns/column.py

+        return column
+
+    @writer_checkout_only
+    def create_str_column(self,


@rlizzo @lantiga
IMO; these names are verbose and maybe we could try something else. Few ideas on top of my mind (which needs polishing) are

co.columns.create(type='ndarray' / 'str', layout='flat' / 'nested') where the argument names are arbitrary and could be anything else

Constructors being nouns like np.array -> co.column.flat, co.column.nested, col.column.str

I think that verbosity is a feature here; customers need that verbosity when they're trying to learn the API.

I don't think that co.columns.create is particularly Pythonic, since the intermediate columns object is tied to and operating on the co, but presumably is also a dict-like thing. Calling create on a dict to manipulate the dict's parent object? That's weird. Also, flat has no meaning to me, though perhaps that will be fixed once the v0.5.0 docs are live.

When I said verbose, I meant, we are repeating information here. The API in this PR is co.columns.create_xxx_column and by this, we are repeating the word column which would be obvious even if we use co.columns.create(). I am not particularly sentimental about the name create. It could be anything else, like add or update (like dict.update, thinking aloud here).

Ah, I misunderstood, sorry.

I think the function should go on the checkout object; putting it on the columns pseudo-container seems odd.

Hey guys as a follow up for completeness here...

The method has been renamed to add_ndarray_column() and add_string_column(). This is to reflect the actual workings of the method. Me and @lantiga considered define_foo_column() as well (as suggested by @elistevens) but decided against it because while acting as a definition, the operation also creates a column ready for use. We wanted to make it clear that nothing else needs to be done in order to use the column that was "defined/initialized"

The method does now live on the checkout class (instead of checkout.columns). This is just better API design. ex:

co = repo.checkout(write=True) col = co.add_ndarray_column('foo', shape=(10, 10), dtype=np.uint8) col[1] = np.arange(10, dtype=np.uint8) co.commit('added column') co.close()

I'm going to be working on improving docstrings and tutorials shortely, but with the new "generalized" column forms, the options essentially break down into the following:

"layout": the method by which data in columns are named/indexed/accessed by. This can currently take on the values of "flat" (for single-level story type mapping) OR "nested" (for a sample name which contains subsample name/data pairs)

"type": indicates the format of data stored within the column container. Valid values are "ndarray" or "string". These are not accepted as keyword arguments, but are determined by the method name which is actually called to create the column (add_ndarray_column() for "ndarray" and add_str_column() for "string")

"variable_shape": bool indicating if data length / size can be variable sized (will have different defaults and avalability depending on the column "type"). For example, ndarraycolumns accept bothTrueandFalsevalues, whilestringcolumn types only acceptTrue`.

"backend" and "backend_options": specify specific backend and any filter / compression options to apply to the column. These are advanced arguments which are type & value checked / enforced based on the definition of column values described above. I won't get into the full details here...

Making this all clear is a high priority for me, and i'm going to need input on the explanations we eventually get around to writing. Let me know if you have any questions!

docs/Tutorial-Dataloader.ipynb

docs/concepts.rst

src/hangar/mixins/datasetget.py

rlizzo added 20 commits February 4, 2020 06:50

initial addition of lmdb_30 backend. not integrated into cython or re…

b513764

…st of hangar backend module

integration of lmdb_30 backend into cython

0f3b79a

added metaclass constructors to flat and nested column classes. lots …

4086293

…of moving in preparation for pulling out column types

more updates and simplifications, moving on to column type spec

3a7fb12

progress on defining options for each backend

7bbfa29

starting to make progress on dynamic schema definition

ce68ff3

wip branch, unsure if changes will stick

7af721e

first initial signs of life that this might eventually work, able to …

dafdabc

…set up special (highly isolated) mock of hangar repo dirs / dbs and initialize the column class for writing to lmdb30 backend (and reading back out)

minor changes

1f2b4e0

major updates to new descriptor type system, this will work

a065021

this version actually works

3381c36

many changes

e774959

cythonized column parsers

8b2ba0d

more movement of parsing functions and cythonization

8faf3bb

cython updates

c66b110

initial work on tests

1cc45f7

more tests

5c3f2d5

many more tests

1a8ee69

many more tests fixed. need to address issue with diffing and merging…

344d081

… schema keys

rlizzo requested a review from hhsecond February 19, 2020 17:10

rlizzo self-assigned this Feb 19, 2020

rlizzo requested a review from lantiga February 19, 2020 20:15

rlizzo added the enhancement New feature or request label Feb 19, 2020

rlizzo added this to the v0.5.0 milestone Feb 19, 2020

rlizzo force-pushed the metadata-columns branch from 28bb2e3 to 9838855 Compare February 19, 2020 21:42

new test workflow

051fea3

rlizzo force-pushed the metadata-columns branch from 5b962e3 to 051fea3 Compare February 20, 2020 00:26

fixed issue with hdf5 backend blosc data buffer size

cf91f64

hhsecond reviewed Feb 20, 2020

View reviewed changes

src/hangar/backends/__init__.py Show resolved Hide resolved

src/hangar/typesystem/descriptors.py Show resolved Hide resolved

hhsecond reviewed Feb 24, 2020

View reviewed changes

src/hangar/dataloaders/common.py Outdated Show resolved Hide resolved

hhsecond reviewed Feb 24, 2020

View reviewed changes

rlizzo force-pushed the metadata-columns branch from 8ba7619 to ca3bc16 Compare February 25, 2020 11:03

hhsecond reviewed Feb 26, 2020

View reviewed changes

docs/Tutorial-Dataloader.ipynb Outdated Show resolved Hide resolved

hhsecond reviewed Feb 26, 2020

View reviewed changes

docs/concepts.rst Show resolved Hide resolved

hhsecond reviewed Feb 26, 2020

View reviewed changes

docs/concepts.rst Show resolved Hide resolved

hhsecond mentioned this pull request Feb 28, 2020

Singletonless tensorwerk/stockroom#3

Merged

hhsecond reviewed Feb 29, 2020

View reviewed changes

src/hangar/mixins/datasetget.py Show resolved Hide resolved

rlizzo mentioned this pull request Mar 4, 2020

Commit Level Metadata #181

Open

rlizzo added 11 commits March 4, 2020 03:49

testing and usability updates

6b4bd4b

more tests fixed

b8cb39d

renamed create_xxx_column to define_xxx_column

82caea4

new tests for column permutations

a4be1c8

asv updates

fbd4b53

skipping push nbytes tests

0817f9e

rename co.column.define to co.add_xxx and fix up docs and tests

bd0825c

updated tutorial and added writer-lock (with force release) CLI option

31cc844

fixes for tests

cefdf1e

updates to tutorail and hangar utils functions

929febe

review fixes and tests added

320c759

rlizzo force-pushed the metadata-columns branch from f81bfed to 320c759 Compare March 4, 2020 08:49

updated changelog

d126d12

rlizzo changed the title ~~Column API and DataType Containers - WIP~~ Column API and DataType Containers Mar 4, 2020

rlizzo merged commit fc68f58 into tensorwerk:master Mar 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column API and DataType Containers #180

Column API and DataType Containers #180

rlizzo commented Feb 19, 2020 •

edited

Loading

hhsecond left a comment

rlizzo commented Feb 20, 2020

codecov bot commented Feb 20, 2020 •

edited

Loading

hhsecond commented Feb 24, 2020

hhsecond Feb 24, 2020

elistevens Feb 24, 2020

hhsecond Feb 26, 2020

elistevens Feb 26, 2020

rlizzo Mar 4, 2020

Column API and DataType Containers #180

Column API and DataType Containers #180

Conversation

rlizzo commented Feb 19, 2020 • edited Loading

Motivation and Context

Why is this change required? What problem does it solve?:

Description

Describe your changes in detail:

Types of changes

How Has This Been Tested?

Checklist:

hhsecond left a comment

Choose a reason for hiding this comment

rlizzo commented Feb 20, 2020

codecov bot commented Feb 20, 2020 • edited Loading

Codecov Report

hhsecond commented Feb 24, 2020

hhsecond Feb 24, 2020

Choose a reason for hiding this comment

elistevens Feb 24, 2020

Choose a reason for hiding this comment

hhsecond Feb 26, 2020

Choose a reason for hiding this comment

elistevens Feb 26, 2020

Choose a reason for hiding this comment

rlizzo Mar 4, 2020

Choose a reason for hiding this comment

rlizzo commented Feb 19, 2020 •

edited

Loading

codecov bot commented Feb 20, 2020 •

edited

Loading