Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Arrayset Backend Update and Manual Configuration of Filter Options #133

Merged
merged 8 commits into from
Oct 16, 2019

Conversation

rlizzo
Copy link
Member

@rlizzo rlizzo commented Oct 10, 2019

Motivation and Context

Why is this change required? What problem does it solve?:

  • Ability for backend filter opts to be set at arrayset init time
  • Added user facing calls which allow backend and opts to by changed at any time in the future, without rewriting all the data contained in an arrayset

Description

Describe your changes in detail:

print('set backend opts at init time')

opts = {'shuffle': True,
 'complib': 'blosc:lz4',
 'complevel': 3,
 'fletcher32': True}

dset_trimgs = co.arraysets.init_arrayset('train_images', prototype=sample_trimg, backend='00', backend_opts=opts)

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[0] = trimgs[0]

print('change to numpy backend with default opts')
dset_trimgs.change_default_backend('10')

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[1] = trimgs[1]

print('change to hdf5 again with different opts')

opts = {'shuffle': False,
 'complib': 'blosc:zstd',
 'complevel': 7,
 'fletcher32': True}
dset_trimgs.change_default_backend('00', opts=opts)

print(dset_trimgs.backend)
print(dset_trimgs.backend_opts)
dset_trimgs[2] = trimgs[2]
print('records are independent files as expected')
 pp(dset_trimgs._sspecs)
set backend opts at init time
00
{'shuffle': True, 'complib': 'blosc:lz4', 'complevel': 3, 'fletcher32': True}
change to numpy backend with default opts
10
{}
change to hdf5 again with different opts
00
{'shuffle': False, 'complib': 'blosc:zstd', 'complevel': 7, 'fletcher32': True}
records are independent files as expected
{0: HDF5_00_DataHashSpec(backend='00', uid='n0nyjH', dataset='0', dataset_idx=0, shape=(784,)),
 1: NUMPY_10_DataHashSpec(backend='10', uid='2kUaI3', checksum='1257994616', collection_idx=0, shape=(784,)),
 2: HDF5_00_DataHashSpec(backend='00', uid='vuA0bJ', dataset='0', dataset_idx=0, shape=(784,))}

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Documentation update
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

  • Ready for review
  • Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

  • Current tests cover modifications made
  • New tests have been added to the test suite
  • Modifications were made to existing tests to support these changes
  • Tests may be needed, but they are not included when the PR was proposed
  • I don't know. Help!

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have signed (or will sign when prompted) the tensorwork CLA.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@rlizzo rlizzo added enhancement New feature or request WIP Don't merge; Work in Progress labels Oct 10, 2019
@rlizzo rlizzo self-assigned this Oct 10, 2019
@codecov
Copy link

codecov bot commented Oct 10, 2019

Codecov Report

Merging #133 into master will increase coverage by 0.45%.
The diff coverage is 90.98%.

@@            Coverage Diff             @@
##           master     #133      +/-   ##
==========================================
+ Coverage   92.81%   93.26%   +0.45%     
==========================================
  Files          61       62       +1     
  Lines       10608    10923     +315     
  Branches     1041     1059      +18     
==========================================
+ Hits         9845    10187     +342     
+ Misses        555      518      -37     
- Partials      208      218      +10
Impacted Files Coverage Δ
src/hangar/records/parsing.py 98.51% <ø> (ø) ⬆️
tests/test_cli.py 100% <ø> (ø) ⬆️
src/hangar/dataloaders/tfloader.py 75% <ø> (+6.82%) ⬆️
src/hangar/__init__.py 100% <ø> (+33.33%) ⬆️
...gar/remote/request_header_validator_interceptor.py 100% <ø> (+37.78%) ⬆️
src/hangar/dataloaders/torchloader.py 100% <ø> (+5.41%) ⬆️
tests/test_initiate.py 100% <ø> (ø) ⬆️
tests/test_cli_io.py 100% <ø> (ø) ⬆️
tests/test_arrayset_backends.py 100% <100%> (ø)
tests/test_remotes.py 98.51% <100%> (ø) ⬆️
... and 22 more

@rlizzo
Copy link
Member Author

rlizzo commented Oct 11, 2019

@elistevens, care to take a look at this and let me know what you think? I've got a local branch with tests, but before finalizing that I want to make sure the API and utility is what we would expect.

@elistevens
Copy link

The API is functional, but doesn't strike me as friendly. A couple things that would make it more user-friendly:

  • Allow human-readable backend names instead of keys. 'hdf5' instead of '00' etc.
  • Allow setting global default backend+opts, so I can do that once on a project-wide module import and then everything is using the same setup, unless that particular callsite needs to customize.
  • Consolidate the backend= and backend_opts= params into something like this, such that the cross product of the below backend values and the call sites work:
backend = '00'
backend = 'hdf5'
backend = {
    'name': 'hdf5',
    'complib': 'blosc:lz4',
    'complevel': 3,
    'fletcher32': True,
}

co.arraysets.init_arrayset(
    'train_images', 
    prototype=sample_trimg, 
    backend=backend)

dset_trimgs.change_default_backend(backend)
dset_trimgs.change_default_backend(**backend)

I'm not usually a fan of arguments that can be different types, but I think it makes sense here.

@rlizzo
Copy link
Member Author

rlizzo commented Oct 15, 2019

Ok. this makes sense. Will look at it from the ground up before attempting to merge.

@rlizzo
Copy link
Member Author

rlizzo commented Oct 16, 2019

So I've gone another round at this.

  • Consolidate the backend= and backend_opts= params into something like this, such that the cross product of the below backend values and the call sites work:

Done! The format i've gone with is:

# for backend specification, but default options
co.arraysets.init_arrayset('foo', prototype=np.arange(10), backend_opts='00')

# for backend specification, and manual options
opts = {
    'backend': '00',
    'shuffle': 'byte',
    'complib': 'blosc:zstd',
    'complevel': 5,
    'fletcher32': True,
}
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts=opts)

# required opts depend on backend. `numpy` backend requires none, so
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts='10')
# is equivalent to
co.arraysets.init_arrayset('foo',  prototype=np.arange(10),  backend_opts={'backend': '10'})
  • Allow setting global default backend+opts, so I can do that once on a project-wide module import and then everything is using the same setup, unless that particular callsite needs to customize.

I'd like to hold off on this one for now because:

  1. This would have to go on the client side, and we don't have a good method to handle configuration for repo parameters yet (literally all we have is a single .ini file to read user_name and user_email)
  2. In my view, It's actually not that necessary? The opts used to set up the storage backend are actually saved as part of arrayset's schema (version controlled in the repository). To hangar, these options are literally as much a part of the definition of an arrayset as the shape or dtype. Their record is permanently persisted, and available on either a read or write-enabled checkout. To create a different arrayset with them, just access the arraysets .backend_opts property, and feed it into the init_arrayset() function. This persists across a network as well... If you push to a hangar server, and I pull that arrayset from it, I'll automatically put the data in the same backend you did (with the same options).

I think it's a nice quality of life addition, but we need a real configuration management solution on the client before going down this road. How much of a pain will this be for you? (I'm probably being quite myopic here...)

  • Allow human-readable backend names instead of keys. 'hdf5' instead of '00' etc.

TBD. The format codes are the way everything is organized in the backend. the codes are needed to remain unambiguous about which backend a sample record spec belongs to (There's a 1000% chance that there will be more than hdf5 - and probably numpy/tdb - based backend in the future...) We chose not to go with names to avoid "human" problems in the first place.

I know it's a trivial mapping to make, but it's not necessary right now, and can be added in the future when there's more then 2-3 backends

@rlizzo rlizzo added Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. and removed WIP Don't merge; Work in Progress labels Oct 16, 2019
@rlizzo rlizzo requested a review from hhsecond October 16, 2019 11:30
@rlizzo rlizzo merged commit e7d2e12 into tensorwerk:master Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants