Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column API and DataType Containers #180

Merged
merged 34 commits into from
Mar 4, 2020
Merged

Conversation

rlizzo
Copy link
Member

@rlizzo rlizzo commented Feb 19, 2020

Motivation and Context

Why is this change required? What problem does it solve?:

This PR implements the new columns API, replacing the old arraysets and metadata terminology. The concept of a column is to act as a highly modular container for potentially arbitrary data types. As a simple proof of concept, a string type column has been introduced - replacing the functionality of metadata in a much more useful and expandable way.

Description

Describe your changes in detail:

More details will be posted soon.

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Documentation update
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

  • Ready for review
  • Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

  • Current tests cover modifications made
  • New tests have been added to the test suite
  • Modifications were made to existing tests to support these changes
  • Tests may be needed, but they are not included when the PR was proposed
  • I don't know. Help!

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have signed (or will sign when prompted) the tensorwork CLA.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Combined reader and writer arrayset column methods in order to allow
schema type (and specification), along with argument validators, to be
dynamically mixed in. This is to support the new "columns" API where
different column data types (previously 'arraysets' and 'metadata') are
combined into a unified API, each operating on backends independently.

Renamed columns/arrayset_nested.py -> columns/nested.py and
columns/arrayset_flat.py to columns/flat.py to mark this change in how
we are thinking about the accessor methods.

Tests are observed to pass.
…of moving in preparation for pulling out column types
…set up special (highly isolated) mock of hangar repo dirs / dbs and initialize the column class for writing to lmdb30 backend (and reading back out)
@rlizzo rlizzo requested a review from hhsecond February 19, 2020 17:10
@rlizzo rlizzo self-assigned this Feb 19, 2020
@rlizzo rlizzo requested a review from lantiga February 19, 2020 20:15
@rlizzo rlizzo added the enhancement New feature or request label Feb 19, 2020
@rlizzo rlizzo added this to the v0.5.0 milestone Feb 19, 2020
Copy link
Member

@hhsecond hhsecond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far. I have added a few minor comments. I'll spend more time soon

src/hangar/backends/__init__.py Show resolved Hide resolved
src/hangar/typesystem/descriptors.py Show resolved Hide resolved
@rlizzo
Copy link
Member Author

rlizzo commented Feb 20, 2020

@hhsecond any idea why this test is failing? I'm not sure what tensorflow wants here...

https://travis-ci.com/tensorwerk/hangar-py/jobs/289376731#L423-L506

@codecov
Copy link

codecov bot commented Feb 20, 2020

Codecov Report

Merging #180 into master will not change coverage by %.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #180   +/-   ##
=======================================
  Coverage   95.25%   95.25%           
=======================================
  Files          97       97           
  Lines       16175    16175           
  Branches     1547     1547           
=======================================
  Hits        15407    15407           
  Misses        525      525           
  Partials      243      243           

@hhsecond
Copy link
Member

@hhsecond any idea why this test is failing? I'm not sure what tensorflow wants here...

https://travis-ci.com/tensorwerk/hangar-py/jobs/289376731#L423-L506

@rlizzo Here is the culprit

return column

@writer_checkout_only
def create_str_column(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlizzo @lantiga
IMO; these names are verbose and maybe we could try something else. Few ideas on top of my mind (which needs polishing) are

  • co.columns.create(type='ndarray' / 'str', layout='flat' / 'nested') where the argument names are arbitrary and could be anything else
  • Constructors being nouns like np.array -> co.column.flat, co.column.nested, col.column.str

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that verbosity is a feature here; customers need that verbosity when they're trying to learn the API.

I don't think that co.columns.create is particularly Pythonic, since the intermediate columns object is tied to and operating on the co, but presumably is also a dict-like thing. Calling create on a dict to manipulate the dict's parent object? That's weird. Also, flat has no meaning to me, though perhaps that will be fixed once the v0.5.0 docs are live.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I said verbose, I meant, we are repeating information here. The API in this PR is co.columns.create_xxx_column and by this, we are repeating the word column which would be obvious even if we use co.columns.create(). I am not particularly sentimental about the name create. It could be anything else, like add or update (like dict.update, thinking aloud here).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood, sorry.

I think the function should go on the checkout object; putting it on the columns pseudo-container seems odd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey guys as a follow up for completeness here...

  1. The method has been renamed to add_ndarray_column() and add_string_column(). This is to reflect the actual workings of the method. Me and @lantiga considered define_foo_column() as well (as suggested by @elistevens) but decided against it because while acting as a definition, the operation also creates a column ready for use. We wanted to make it clear that nothing else needs to be done in order to use the column that was "defined/initialized"

  2. The method does now live on the checkout class (instead of checkout.columns). This is just better API design. ex:

co = repo.checkout(write=True)
col = co.add_ndarray_column('foo', shape=(10, 10), dtype=np.uint8)
col[1] = np.arange(10, dtype=np.uint8)
co.commit('added column')
co.close()
  1. I'm going to be working on improving docstrings and tutorials shortely, but with the new "generalized" column forms, the options essentially break down into the following:
  • "layout": the method by which data in columns are named/indexed/accessed by. This can currently take on the values of "flat" (for single-level story type mapping) OR "nested" (for a sample name which contains subsample name/data pairs)

  • "type": indicates the format of data stored within the column container. Valid values are "ndarray" or "string". These are not accepted as keyword arguments, but are determined by the method name which is actually called to create the column (add_ndarray_column() for "ndarray" and add_str_column() for "string")

  • "variable_shape": bool indicating if data length / size can be variable sized (will have different defaults and avalability depending on the column "type"). For example, ndarraycolumns accept bothTrueandFalsevalues, whilestringcolumn types only acceptTrue`.

  • "backend" and "backend_options": specify specific backend and any filter / compression options to apply to the column. These are advanced arguments which are type & value checked / enforced based on the definition of column values described above. I won't get into the full details here...

Making this all clear is a high priority for me, and i'm going to need input on the explanations we eventually get around to writing. Let me know if you have any questions!

@rlizzo rlizzo mentioned this pull request Mar 4, 2020
@rlizzo rlizzo changed the title Column API and DataType Containers - WIP Column API and DataType Containers Mar 4, 2020
@rlizzo rlizzo merged commit fc68f58 into tensorwerk:master Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants