Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BACKEND] Record Format Update #70

Merged
merged 19 commits into from
Jun 13, 2019
Merged

[BACKEND] Record Format Update #70

merged 19 commits into from
Jun 13, 2019

Conversation

rlizzo
Copy link
Member

@rlizzo rlizzo commented May 30, 2019

Motivation and Context

Why is this change required? What problem does it solve?:

This PR is a rework of the backend record structure in which hangar references are stored. Prior to this, using different backends to store data on disk in the same repository was not supported. While it was always possible for a single repository to use different backends on different machines, a method to dynamically parse, assign, and switch backends was strongly needed (without having to rely on a server).

Though this is a large change, a dynamic record parser is very much required if we are to fulfill some of our core goals: intelligent backend selection, threaded data loaders, and partial data clones.

This is an active WIP, and will likely be a large change which will break backwards compatibility with the 0.1 release.

Description

Describe your changes in detail:

A new backends module is introduced which contains all functions related to:

  • parsing data location specifications
  • dynamic dispatch logic
  • backend methods which deal with disk IO

A new prefix has been added to the the hashenv (Data Hash db) values:

  • a two digit code corresponds to the choice of backend.
    • Codes in the range of [00 : 49] point to some backend specification for handeling data on the local disk.
    • Codes in the range of [50 : 99] point to some remote backend specification.
    • Once a code is assigned to a backend specification, I intend for it to never be changed in the future. Any updates or improvements which require changing the record structure in backwards in non backwards compatible ways will require a new code assignment and parser/dispatch implementation.
  • The data following the backend code prefix (XX:) in the Data Hash db values are formatted and parsed in any way which the backend implementation finds convenient, they do not need to be understood by any outside observer. That said, they should not be considered private in any way; The values WILL be viewed by other parts of the application.

I'm working on a common set of interfaces which should be defined for each backend. More to follow both here and in the docs as I move towards a cleaner implementation.

For this PR, I have reserved five codes for the following backends:

    CODE_BACKEND_MAP = {
        # LOCALS -> [0:50]
        '00': 'hdf5_00',
        '01': 'numpy_00',
        '02': 'tiledb_00',
        # REMOTES -> [50:100]
        '50': 'remote_00',
        '51': 'tiledb_01',
    }

Right now the local hdf5 and local numpy backends work as intended; it's a bit rough around the edges, but this is already showing promise.

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

  • Ready for review
  • Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

  • Current tests cover modifications made
  • New tests have been added to the test suite
  • Modifications were made to existing tests to support these changes
  • Tests may be needed, but they are not included when the PR was proposed
  • I don't know. Help!

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have signed (or will sign when prompted) the tensorwork CLA.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@rlizzo rlizzo added enhancement New feature or request WIP Don't merge; Work in Progress labels May 30, 2019
@rlizzo rlizzo added this to the V0.2.0 Release milestone May 30, 2019
@rlizzo rlizzo self-assigned this May 30, 2019
@rlizzo
Copy link
Member Author

rlizzo commented May 30, 2019

@hhsecond and @lantiga, id appreciate any time you have to look over the changes at a thousand foot view and provide any comments or feedback before this progresses much further.

@rlizzo rlizzo force-pushed the record-update branch 2 times, most recently from 57c624d to 6a02838 Compare June 1, 2019 12:13
@rlizzo rlizzo added the Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. label Jun 2, 2019
@hhsecond hhsecond mentioned this pull request Jun 3, 2019
17 tasks
@rlizzo
Copy link
Member Author

rlizzo commented Jun 3, 2019

Hi @hhsecond, before I leave for holiday for a week I wanted to say that I would appreciate any work and review you could provide. Please check the module level docstrings in `src/backends/selection py' for a brief overview of the theory. All current tests pass with the PR as it sits right now, but the remote client server implementation are horribly broken because I threw them together so quick without using standard user facing API calls.

Please don't merge this into master before I get back, but I think if you ha e the time to do some work, you can create. Dev branch and merge changes into that and submit PRs to that branch.

If you can make a contribution, please focus on the remote client/serve calls. I'm sorry it's such a mess... The only calls that will need to be updated are the "PushData" and "FetchData" calls on both the client and server side. Those may need to be fully rewritten, and I'm sorry it's such a mess of logic, so don't worry if it's not feasible for u to reimplement.

To get started, read the proto definition (though you shouldn't need to modify that or the generated code). Then read the top level (user facing fetch and push methods to understand the flow of commands and how the negotiation occurs. Then dig in to the methods. The big change will switch calls into Hdf5_FileHandles into the backend selection module and specifying remote_operation=True (which places file symlinks into a different directory on disk so we don't modify the staging area during a fetch, and so we can resume incomplete downloads) just use the 00 or 01 backend right now, don't worry about the 50 one since it is a dummy method to say "we know some data exists with this hash, but it's not on the local disk and it's not at a URL we know about, so ask the server for how to get it". That will be for partial clones which i need to mock up separately.

@rlizzo
Copy link
Member Author

rlizzo commented Jun 3, 2019

If you have questions, please coordinate here with @lantiga or tag me directly so I can reply when I check emails once a day.

Thanks!

Copy link
Member

@hhsecond hhsecond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took some time but it was fun :). Apart from a few things I wasn't certain about in the way HDF5 works and very minor tweaks required in the code, it looks good to me.

src/hangar/backends/hdf5.py Outdated Show resolved Hide resolved
if not remote_operation:
if not os.path.isdir(self.STOREDIR):
return
store_uids = [psplitext(x)[0] for x in os.listdir(self.STOREDIR) if x.endswith('.hdf5')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just loop over the files and do the if check and the key-value assignment in that same loop? Basically I was wondering whether the number of non-hdf5 files in this folder would make a huge difference in the performance. And the same applies everywhere we have similar loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, but doing so would require that we persist the .hdf5 file extension in every data lmdb record (to avoid the for the following os.path.splitext function call) However, this would waste disk space (compounded for every array record stored in the database via that backend).

It's not an issue, since the only files which should ever be placed in this directory are the .hdf5 files created by hangar. We just need the checks in order to exclude OS created "." (dot) files/directories such as MacOSX resource forks... if the user modifies the contents of this directory, the local copy of that repo would just be crazily screwed up. (think what would happen if you just went about deleting or adding files to a .git directory).

src/hangar/backends/selection.py Outdated Show resolved Hide resolved
@rlizzo rlizzo changed the base branch from master to dev June 13, 2019 07:00
@rlizzo
Copy link
Member Author

rlizzo commented Jun 13, 2019

Ok. thanks for the review @hhsecond, I'm going to merge this into the dev branch until I can fix the remote implementation and get it working.

@rlizzo rlizzo merged commit cd6b568 into tensorwerk:dev Jun 13, 2019
@rlizzo rlizzo deleted the record-update branch July 11, 2019 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Review Author has determined PR changes area nearly complete and ready for formal review. enhancement New feature or request WIP Don't merge; Work in Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants