Dynamic Arrayset Backend Update and Manual Configuration of Filter Options #133

rlizzo · 2019-10-10T16:27:15Z

Motivation and Context

Why is this change required? What problem does it solve?:

Ability for backend filter opts to be set at arrayset init time
Added user facing calls which allow backend and opts to by changed at any time in the future, without rewriting all the data contained in an arrayset

Description

Describe your changes in detail:

print('set backend opts at init time')

opts = {'shuffle': True,
 'complib': 'blosc:lz4',
 'complevel': 3,
 'fletcher32': True}

dset_trimgs = co.arraysets.init_arrayset('train_images', prototype=sample_trimg, backend='00', backend_opts=opts)

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[0] = trimgs[0]

print('change to numpy backend with default opts')
dset_trimgs.change_default_backend('10')

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[1] = trimgs[1]

print('change to hdf5 again with different opts')

opts = {'shuffle': False,
 'complib': 'blosc:zstd',
 'complevel': 7,
 'fletcher32': True}
dset_trimgs.change_default_backend('00', opts=opts)

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[2] = trimgs[2]
print('records are independent files as expected')
 pp(dset_trimgs._sspecs)

set backend opts at init time
00
{'shuffle': True, 'complib': 'blosc:lz4', 'complevel': 3, 'fletcher32': True}
change to numpy backend with default opts
10
{}
change to hdf5 again with different opts
00
{'shuffle': False, 'complib': 'blosc:zstd', 'complevel': 7, 'fletcher32': True}
records are independent files as expected
{0: HDF5_00_DataHashSpec(backend='00', uid='n0nyjH', dataset='0', dataset_idx=0, shape=(784,)),
 1: NUMPY_10_DataHashSpec(backend='10', uid='2kUaI3', checksum='1257994616', collection_idx=0, shape=(784,)),
 2: HDF5_00_DataHashSpec(backend='00', uid='vuA0bJ', dataset='0', dataset_idx=0, shape=(784,))}

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Documentation update
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

Ready for review
Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

Current tests cover modifications made
New tests have been added to the test suite
Modifications were made to existing tests to support these changes
Tests may be needed, but they are not included when the PR was proposed
I don't know. Help!

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have signed (or will sign when prompted) the tensorwork CLA.
I have added tests to cover my changes.
All new and existing tests passed.

codecov · 2019-10-10T16:38:25Z

Codecov Report

Merging #133 into master will increase coverage by 0.45%.
The diff coverage is 90.98%.

@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
+ Coverage   92.81%   93.26%   +0.45%     
==========================================
  Files          61       62       +1     
  Lines       10608    10923     +315     
  Branches     1041     1059      +18     
==========================================
+ Hits         9845    10187     +342     
+ Misses        555      518      -37     
- Partials      208      218      +10

Impacted Files	Coverage Δ
src/hangar/records/parsing.py	`98.51% <ø> (ø)`	⬆️
tests/test_cli.py	`100% <ø> (ø)`	⬆️
src/hangar/dataloaders/tfloader.py	`75% <ø> (+6.82%)`	⬆️
src/hangar/__init__.py	`100% <ø> (+33.33%)`	⬆️
...gar/remote/request_header_validator_interceptor.py	`100% <ø> (+37.78%)`	⬆️
src/hangar/dataloaders/torchloader.py	`100% <ø> (+5.41%)`	⬆️
tests/test_initiate.py	`100% <ø> (ø)`	⬆️
tests/test_cli_io.py	`100% <ø> (ø)`	⬆️
tests/test_arrayset_backends.py	`100% <100%> (ø)`
tests/test_remotes.py	`98.51% <100%> (ø)`	⬆️
... and 22 more

rlizzo · 2019-10-11T14:23:14Z

@elistevens, care to take a look at this and let me know what you think? I've got a local branch with tests, but before finalizing that I want to make sure the API and utility is what we would expect.

elistevens · 2019-10-12T15:26:24Z

The API is functional, but doesn't strike me as friendly. A couple things that would make it more user-friendly:

Allow human-readable backend names instead of keys. 'hdf5' instead of '00' etc.
Allow setting global default backend+opts, so I can do that once on a project-wide module import and then everything is using the same setup, unless that particular callsite needs to customize.
Consolidate the backend= and backend_opts= params into something like this, such that the cross product of the below backend values and the call sites work:

backend = '00'
backend = 'hdf5'
backend = {
    'name': 'hdf5',
    'complib': 'blosc:lz4',
    'complevel': 3,
    'fletcher32': True,
}

co.arraysets.init_arrayset(
    'train_images', 
    prototype=sample_trimg, 
    backend=backend)

dset_trimgs.change_default_backend(backend)
dset_trimgs.change_default_backend(**backend)

I'm not usually a fan of arguments that can be different types, but I think it makes sense here.

rlizzo · 2019-10-15T06:48:21Z

Ok. this makes sense. Will look at it from the ground up before attempting to merge.

… calls which allow backend and opts to by changed at any time in the future, without rewriting all the data contained in an arrayset

rlizzo · 2019-10-16T11:16:43Z

So I've gone another round at this.

Consolidate the backend= and backend_opts= params into something like this, such that the cross product of the below backend values and the call sites work:

Done! The format i've gone with is:

# for backend specification, but default options
co.arraysets.init_arrayset('foo', prototype=np.arange(10), backend_opts='00')

# for backend specification, and manual options
opts = {
    'backend': '00',
    'shuffle': 'byte',
    'complib': 'blosc:zstd',
    'complevel': 5,
    'fletcher32': True,
}
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts=opts)

# required opts depend on backend. `numpy` backend requires none, so
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts='10')
# is equivalent to
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts={'backend': '10'})

Allow setting global default backend+opts, so I can do that once on a project-wide module import and then everything is using the same setup, unless that particular callsite needs to customize.

I'd like to hold off on this one for now because:

This would have to go on the client side, and we don't have a good method to handle configuration for repo parameters yet (literally all we have is a single .ini file to read user_name and user_email)
In my view, It's actually not that necessary? The opts used to set up the storage backend are actually saved as part of arrayset's schema (version controlled in the repository). To hangar, these options are literally as much a part of the definition of an arrayset as the shape or dtype. Their record is permanently persisted, and available on either a read or write-enabled checkout. To create a different arrayset with them, just access the arraysets .backend_opts property, and feed it into the init_arrayset() function. This persists across a network as well... If you push to a hangar server, and I pull that arrayset from it, I'll automatically put the data in the same backend you did (with the same options).

I think it's a nice quality of life addition, but we need a real configuration management solution on the client before going down this road. How much of a pain will this be for you? (I'm probably being quite myopic here...)

Allow human-readable backend names instead of keys. 'hdf5' instead of '00' etc.

TBD. The format codes are the way everything is organized in the backend. the codes are needed to remain unambiguous about which backend a sample record spec belongs to (There's a 1000% chance that there will be more than hdf5 - and probably numpy/tdb - based backend in the future...) We chose not to go with names to avoid "human" problems in the first place.

I know it's a trivial mapping to make, but it's not necessary right now, and can be added in the future when there's more then 2-3 backends

rlizzo added enhancement New feature or request WIP Don't merge; Work in Progress labels Oct 10, 2019

rlizzo self-assigned this Oct 10, 2019

rlizzo force-pushed the change-backend-filters branch from eed99ab to ef254d6 Compare October 16, 2019 00:40

rlizzo added 5 commits October 16, 2019 01:13

initial work completed

aa837d8

completed ability to set backend opts at init time, added user facing…

175d7c2

… calls which allow backend and opts to by changed at any time in the future, without rewriting all the data contained in an arrayset

cleaned up implementation still need to add tests

0a6721b

updated filter option parser

0d9571a

added tests for backend changes

990536a

rlizzo force-pushed the change-backend-filters branch from ef254d6 to 990536a Compare October 16, 2019 05:15

rlizzo added 2 commits October 16, 2019 05:47

all tests now passing. finalized API

9d2b028

added missing tests for other recent features

920207e

Updated ChangeLog

68ba58c

rlizzo added Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. and removed WIP Don't merge; Work in Progress labels Oct 16, 2019

rlizzo requested a review from hhsecond October 16, 2019 11:30

rlizzo merged commit e7d2e12 into tensorwerk:master Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Arrayset Backend Update and Manual Configuration of Filter Options #133

Dynamic Arrayset Backend Update and Manual Configuration of Filter Options #133

rlizzo commented Oct 10, 2019 •

edited

Loading

codecov bot commented Oct 10, 2019 •

edited

Loading

rlizzo commented Oct 11, 2019

elistevens commented Oct 12, 2019

rlizzo commented Oct 15, 2019

rlizzo commented Oct 16, 2019

Dynamic Arrayset Backend Update and Manual Configuration of Filter Options #133

Dynamic Arrayset Backend Update and Manual Configuration of Filter Options #133

Conversation

rlizzo commented Oct 10, 2019 • edited Loading

Motivation and Context

Why is this change required? What problem does it solve?:

Description

Describe your changes in detail:

Screenshots (if appropriate):

Types of changes

How Has This Been Tested?

Checklist:

codecov bot commented Oct 10, 2019 • edited Loading

Codecov Report

rlizzo commented Oct 11, 2019

elistevens commented Oct 12, 2019

rlizzo commented Oct 15, 2019

rlizzo commented Oct 16, 2019

rlizzo commented Oct 10, 2019 •

edited

Loading

codecov bot commented Oct 10, 2019 •

edited

Loading