Convenience Methods #183

rlizzo · 2020-03-05T21:42:30Z

Motivation and Context

Why is this change required? What problem does it solve?:

Added numerous convenience methods for general usage, major changes to the checkout column data reader API to allow support of arbitrary column layouts.

Description

Describe your changes in detail:

Added log method to checkout instances
added diff method to Repository class
added CLI diff command

API Changes to Checkout `getitem()` and `get()` methods.

Checkout object can be thought of as a "dataset" ("dset") mapping a
view of samples across columns.

>>> dset = repo.checkout(branch='master')
>>>
# Get an column contained in the checkout.
>>> dset['foo']
ColumnDataReader
>>>
# Get a specific sample from ``'foo'`` (returns a single array)
>>> dset['foo', '1']
np.array([1])
>>>
# Get multiple samples from ``'foo'`` (returns a list of arrays, in order
# of input keys)
>>> dset[['foo', '1'], ['foo', '2'],  ['foo', '324']]
[np.array([1]), np.ndarray([2]), np.ndarray([324])]
>>>
# Get sample from multiple columns, column/data returned is ordered
# in same manner as input of func.
>>> dset[['foo', '1'], ['bar', '1'],  ['baz', '1']]
[np.array([1]), np.ndarray([1, 1]), np.ndarray([1, 1, 1])]
>>>
# Get multiple samples from multiple columns\
>>> keys = [(col, str(samp)) for samp in range(2) for col in ['foo', 'bar']]
>>> keys
[('foo', '0'), ('bar', '0'), ('foo', '1'), ('bar', '1')]
>>> dset[keys]
[np.array([1]), np.array([1, 1]), np.array([2]), np.array([2, 2])]

Arbitrary column layouts are supported by simply adding additional members
to the keys for each piece of data. For example, getting data from a column
with a nested layout:

>> dset['nested_col', 'sample_1', 'subsample_0']
np.array([1, 0])
>>>
# a sample accessor object can be retrieved at will...
>>> dset['nested_col', 'sample_1']
<class 'FlatSubsampleReader'>(column_name='nested_col', sample_name='sample_1')
>>>
# to get all subsamples in a nested sample use the Ellipsis operator
>>> dset['nested_col', 'sample_1', ...]
{'subsample_0': np.array([1, 0]),
 'subsample_1': np.array([1, 1]),
 ...
 'subsample_n': np.array([1, 255])}

Retrieval of data from different column types can be mixed and combined
as desired. For example, retrieving data from both flat and nested columns
simultaneously:

>>> dset[('nested_col', 'sample_1', '0'), ('foo', '0')]
[np.array([1, 0]), np.array([0])]
>>> dset[('nested_col', 'sample_1', ...), ('foo', '0')]
[{'subsample_0': np.array([1, 0]), 'subsample_1': np.array([1, 1])},
 np.array([0])]
>>> dset[('foo', '0'), ('nested_col', 'sample_1')]
[np.array([0]),
 <class 'FlatSubsampleReader'>(column_name='nested_col', sample_name='sample_1')]

If a column or data key does not exist, then this method will raise a KeyError.
As an alternative, missing keys can be gracefully handeled by calling :meth:get()
instead. This method does not (by default) raise an error if a key is missing.
Instead, a (configurable) default value is simply inserted in it's place.

>>> dset['foo', 'DOES_NOT_EXIST']
-------------------------------------------------------------------
KeyError                           Traceback (most recent call last)
<ipython-input-40-731e6ea62fb8> in <module>
----> 1 res = co['foo', 'DOES_NOT_EXIST']
KeyError: 'DOES_NOT_EXIST'

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Documentation update
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

Ready for review
Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

Current tests cover modifications made
New tests have been added to the test suite
Modifications were made to existing tests to support these changes
Tests may be needed, but they are not included when the PR was proposed
I don't know. Help!

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have signed (or will sign when prompted) the tensorwork CLA.
I have added tests to cover my changes.
All new and existing tests passed.

…nd CLI

… now

codecov · 2020-03-05T21:53:57Z

Codecov Report

Merging #183 into master will decrease coverage by 0.19%.
The diff coverage is 91.72%.

@@            Coverage Diff             @@
##           master     #183      +/-   ##
==========================================
- Coverage   95.25%   95.06%   -0.19%     
==========================================
  Files          97       98       +1     
  Lines       16175    15954     -221     
  Branches     1547     1539       -8     
==========================================
- Hits        15407    15166     -241     
- Misses        525      537      +12     
- Partials      243      251       +8

Impacted Files	Coverage Δ
src/hangar/typesystem/__init__.py	`100% <ø> (ø)`	⬆️
src/hangar/columns/constructors.py	`90.59% <ø> (ø)`	⬆️
src/hangar/diff.py	`96.07% <0%> (-0.85%)`	⬇️
src/hangar/constants.py	`100% <100%> (ø)`	⬆️
src/hangar/columns/__init__.py	`100% <100%> (ø)`	⬆️
src/hangar/records/summarize.py	`93.94% <100%> (+0.61%)`	⬆️
src/hangar/columns/column.py	`100% <100%> (ø)`	⬆️
src/hangar/columns/common.py	`95.08% <100%> (ø)`	⬆️
tests/test_diff.py	`99.74% <100%> (+0.05%)`	⬆️
src/hangar/utils.py	`95.83% <100%> (+0.09%)`	⬆️
... and 13 more

…tutorails / changelog

rlizzo · 2020-03-06T09:08:24Z

@hhsecond please review.

rlizzo added 4 commits March 5, 2020 03:06

added log method to checkout objects, and diff method to repo class a…

aef49c6

…nd CLI

using sane getitem method from columns with different layouts

9983d7d

updating tests

258a8a5

finished documentation and optimization of datasetget class mixin for…

3d3fd2d

… now

rlizzo added the enhancement New feature or request label Mar 5, 2020

rlizzo added this to the v0.5.0 milestone Mar 5, 2020

rlizzo self-assigned this Mar 5, 2020

updated tests

e85678f

rlizzo added 2 commits March 6, 2020 02:43

fleshed out cython SizedDict class, fixed failing tests, and updated …

8a0e148

…tutorails / changelog

added tests for sizedict

eda0230

rlizzo requested a review from hhsecond March 6, 2020 08:36

updated test cases

ac21972

rlizzo force-pushed the convenience-methods branch from 8823268 to ac21972 Compare March 7, 2020 08:16

rlizzo merged commit 27c66f7 into tensorwerk:master Mar 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convenience Methods #183

Convenience Methods #183

rlizzo commented Mar 5, 2020

codecov bot commented Mar 5, 2020 •

edited

Loading

rlizzo commented Mar 6, 2020

Convenience Methods #183

Convenience Methods #183

Conversation

rlizzo commented Mar 5, 2020

Motivation and Context

Why is this change required? What problem does it solve?:

Description

Describe your changes in detail:

API Changes to Checkout __getitem__() and get() methods.

Screenshots (if appropriate):

Types of changes

How Has This Been Tested?

Checklist:

codecov bot commented Mar 5, 2020 • edited Loading

Codecov Report

rlizzo commented Mar 6, 2020

API Changes to Checkout `getitem()` and `get()` methods.

codecov bot commented Mar 5, 2020 •

edited

Loading