[BACKEND] Record Format Update #70

rlizzo · 2019-05-30T07:59:35Z

Motivation and Context

Why is this change required? What problem does it solve?:

This PR is a rework of the backend record structure in which hangar references are stored. Prior to this, using different backends to store data on disk in the same repository was not supported. While it was always possible for a single repository to use different backends on different machines, a method to dynamically parse, assign, and switch backends was strongly needed (without having to rely on a server).

Though this is a large change, a dynamic record parser is very much required if we are to fulfill some of our core goals: intelligent backend selection, threaded data loaders, and partial data clones.

This is an active WIP, and will likely be a large change which will break backwards compatibility with the 0.1 release.

Description

Describe your changes in detail:

A new backends module is introduced which contains all functions related to:

parsing data location specifications
dynamic dispatch logic
backend methods which deal with disk IO

A new prefix has been added to the the hashenv (Data Hash db) values:

a two digit code corresponds to the choice of backend.
- Codes in the range of [00 : 49] point to some backend specification for handeling data on the local disk.
- Codes in the range of [50 : 99] point to some remote backend specification.
- Once a code is assigned to a backend specification, I intend for it to never be changed in the future. Any updates or improvements which require changing the record structure in backwards in non backwards compatible ways will require a new code assignment and parser/dispatch implementation.
The data following the backend code prefix (XX:) in the Data Hash db values are formatted and parsed in any way which the backend implementation finds convenient, they do not need to be understood by any outside observer. That said, they should not be considered private in any way; The values WILL be viewed by other parts of the application.

I'm working on a common set of interfaces which should be defined for each backend. More to follow both here and in the docs as I move towards a cleaner implementation.

For this PR, I have reserved five codes for the following backends:

    CODE_BACKEND_MAP = {
        # LOCALS -> [0:50]
        '00': 'hdf5_00',
        '01': 'numpy_00',
        '02': 'tiledb_00',
        # REMOTES -> [50:100]
        '50': 'remote_00',
        '51': 'tiledb_01',
    }

Right now the local hdf5 and local numpy backends work as intended; it's a bit rough around the edges, but this is already showing promise.

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

Ready for review
Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

Current tests cover modifications made
New tests have been added to the test suite
Modifications were made to existing tests to support these changes
Tests may be needed, but they are not included when the PR was proposed
I don't know. Help!

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have signed (or will sign when prompted) the tensorwork CLA.
I have added tests to cover my changes.
All new and existing tests passed.

rlizzo · 2019-05-30T14:04:49Z

@hhsecond and @lantiga, id appreciate any time you have to look over the changes at a thousand foot view and provide any comments or feedback before this progresses much further.

rlizzo · 2019-06-03T20:43:21Z

Hi @hhsecond, before I leave for holiday for a week I wanted to say that I would appreciate any work and review you could provide. Please check the module level docstrings in `src/backends/selection py' for a brief overview of the theory. All current tests pass with the PR as it sits right now, but the remote client server implementation are horribly broken because I threw them together so quick without using standard user facing API calls.

Please don't merge this into master before I get back, but I think if you ha e the time to do some work, you can create. Dev branch and merge changes into that and submit PRs to that branch.

If you can make a contribution, please focus on the remote client/serve calls. I'm sorry it's such a mess... The only calls that will need to be updated are the "PushData" and "FetchData" calls on both the client and server side. Those may need to be fully rewritten, and I'm sorry it's such a mess of logic, so don't worry if it's not feasible for u to reimplement.

To get started, read the proto definition (though you shouldn't need to modify that or the generated code). Then read the top level (user facing fetch and push methods to understand the flow of commands and how the negotiation occurs. Then dig in to the methods. The big change will switch calls into Hdf5_FileHandles into the backend selection module and specifying remote_operation=True (which places file symlinks into a different directory on disk so we don't modify the staging area during a fetch, and so we can resume incomplete downloads) just use the 00 or 01 backend right now, don't worry about the 50 one since it is a dummy method to say "we know some data exists with this hash, but it's not on the local disk and it's not at a URL we know about, so ask the server for how to get it". That will be for partial clones which i need to mock up separately.

rlizzo · 2019-06-03T20:44:11Z

If you have questions, please coordinate here with @lantiga or tag me directly so I can reply when I check emails once a day.

Thanks!

hhsecond

It took some time but it was fun :). Apart from a few things I wasn't certain about in the way HDF5 works and very minor tweaks required in the code, it looks good to me.

src/hangar/backends/hdf5.py

hhsecond · 2019-06-11T16:51:14Z

src/hangar/backends/hdf5.py

+        if not remote_operation:
+            if not os.path.isdir(self.STOREDIR):
+                return
+            store_uids = [psplitext(x)[0] for x in os.listdir(self.STOREDIR) if x.endswith('.hdf5')]


Can't we just loop over the files and do the if check and the key-value assignment in that same loop? Basically I was wondering whether the number of non-hdf5 files in this folder would make a huge difference in the performance. And the same applies everywhere we have similar loop.

Technically yes, but doing so would require that we persist the .hdf5 file extension in every data lmdb record (to avoid the for the following os.path.splitext function call) However, this would waste disk space (compounded for every array record stored in the database via that backend).

It's not an issue, since the only files which should ever be placed in this directory are the .hdf5 files created by hangar. We just need the checks in order to exclude OS created "." (dot) files/directories such as MacOSX resource forks... if the user modifies the contents of this directory, the local copy of that repo would just be crazily screwed up. (think what would happen if you just went about deleting or adding files to a .git directory).

src/hangar/backends/selection.py

…a few bugs for variable sized in the np backend

…module which was useless

…t in the staging area

…specified

rlizzo · 2019-06-13T07:01:25Z

Ok. thanks for the review @hhsecond, I'm going to merge this into the dev branch until I can fix the remote implementation and get it working.

rlizzo added enhancement New feature or request WIP Don't merge; Work in Progress labels May 30, 2019

rlizzo added this to the V0.2.0 Release milestone May 30, 2019

rlizzo self-assigned this May 30, 2019

This was referenced May 31, 2019

[FEATURE REQUEST] Do not close checkout on staging area hard reset #55

Closed

Environment Singleton Needed? #24

Closed

rlizzo force-pushed the record-update branch 2 times, most recently from 57c624d to 6a02838 Compare June 1, 2019 12:13

rlizzo added the Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. label Jun 2, 2019

rlizzo mentioned this pull request Jun 2, 2019

[BUG REPORT] Test Cases Needed for New Merge/Diff Logic and API #56

Closed

3 tasks

hhsecond mentioned this pull request Jun 3, 2019

New logic for __len__ #74

Closed

17 tasks

hhsecond approved these changes Jun 12, 2019

View reviewed changes

rlizzo added 15 commits June 13, 2019 02:36

initial work on restructure

8831343

working on refactoring hdf5 filehandle class

590b45a

nearing initial work completion for hdf5 filehandle work

48aa9c9

isolated the record parsing for backends

21d4dcd

updated np to new format

11607c0

other improvements

e6aee3f

rework the config operations

923e3bd

added adler32 checksum to npmmap write/read methods.

ccf847d

fixes for test breaking

df4c991

removed metadata singleton. added variable size dataset tests, fixed …

ae822bb

…a few bugs for variable sized in the np backend

fixed some out of date conventions in the metadata files

a8ec51f

updated typechecking methods and removed old configurationof logging …

28a116c

…module which was useless

added remote specification (early stage)

55cb88a

added ability to perform a hard reset on from a write-enabled checkou…

2160dad

…t in the staging area

hopefully update for windows fix

4ba43b7

rlizzo added 3 commits June 13, 2019 02:36

major cleanups and module level documentation updates

041f1b0

added automatic backend selection for when backend is not explicitly …

00990a0

…specified

removed schema uuid from records as it was unnecessary

d302b72

rlizzo force-pushed the record-update branch from f9e1ed1 to d302b72 Compare June 13, 2019 06:36

added changelog

f6125d6

rlizzo changed the base branch from master to dev June 13, 2019 07:00

rlizzo merged commit cd6b568 into tensorwerk:dev Jun 13, 2019

rlizzo mentioned this pull request Jun 13, 2019

[BACKEND] Remote Server Updates for New Record Format #78

Merged

17 tasks

rlizzo mentioned this pull request Jul 6, 2019

Merge Dev Features Into Master #83

Merged

rlizzo deleted the record-update branch July 11, 2019 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND] Record Format Update #70

[BACKEND] Record Format Update #70

rlizzo commented May 30, 2019

rlizzo commented May 30, 2019

rlizzo commented Jun 3, 2019

rlizzo commented Jun 3, 2019

hhsecond left a comment

hhsecond Jun 11, 2019

rlizzo Jun 13, 2019

rlizzo commented Jun 13, 2019

[BACKEND] Record Format Update #70

[BACKEND] Record Format Update #70

Conversation

rlizzo commented May 30, 2019

Motivation and Context

Description

Screenshots (if appropriate):

Types of changes

How Has This Been Tested?

Checklist:

rlizzo commented May 30, 2019

rlizzo commented Jun 3, 2019

rlizzo commented Jun 3, 2019

hhsecond left a comment

Choose a reason for hiding this comment

hhsecond Jun 11, 2019

Choose a reason for hiding this comment

rlizzo Jun 13, 2019

Choose a reason for hiding this comment

rlizzo commented Jun 13, 2019