-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BACKEND] Record Format Update #70
Conversation
57c624d
to
6a02838
Compare
Hi @hhsecond, before I leave for holiday for a week I wanted to say that I would appreciate any work and review you could provide. Please check the module level docstrings in `src/backends/selection py' for a brief overview of the theory. All current tests pass with the PR as it sits right now, but the remote client server implementation are horribly broken because I threw them together so quick without using standard user facing API calls. Please don't merge this into master before I get back, but I think if you ha e the time to do some work, you can create. Dev branch and merge changes into that and submit PRs to that branch. If you can make a contribution, please focus on the remote client/serve calls. I'm sorry it's such a mess... The only calls that will need to be updated are the "PushData" and "FetchData" calls on both the client and server side. Those may need to be fully rewritten, and I'm sorry it's such a mess of logic, so don't worry if it's not feasible for u to reimplement. To get started, read the proto definition (though you shouldn't need to modify that or the generated code). Then read the top level (user facing fetch and push methods to understand the flow of commands and how the negotiation occurs. Then dig in to the methods. The big change will switch calls into Hdf5_FileHandles into the backend selection module and specifying |
If you have questions, please coordinate here with @lantiga or tag me directly so I can reply when I check emails once a day. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took some time but it was fun :). Apart from a few things I wasn't certain about in the way HDF5 works and very minor tweaks required in the code, it looks good to me.
if not remote_operation: | ||
if not os.path.isdir(self.STOREDIR): | ||
return | ||
store_uids = [psplitext(x)[0] for x in os.listdir(self.STOREDIR) if x.endswith('.hdf5')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just loop over the files and do the if
check and the key-value assignment in that same loop? Basically I was wondering whether the number of non-hdf5 files in this folder would make a huge difference in the performance. And the same applies everywhere we have similar loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically yes, but doing so would require that we persist the .hdf5
file extension in every data lmdb record (to avoid the for the following os.path.splitext
function call) However, this would waste disk space (compounded for every array record stored in the database via that backend).
It's not an issue, since the only files which should ever be placed in this directory are the .hdf5
files created by hangar. We just need the checks in order to exclude OS created "."
(dot
) files/directories such as MacOSX resource forks... if the user modifies the contents of this directory, the local copy of that repo would just be crazily screwed up. (think what would happen if you just went about deleting or adding files to a .git
directory).
…a few bugs for variable sized in the np backend
…module which was useless
…t in the staging area
Ok. thanks for the review @hhsecond, I'm going to merge this into the |
Motivation and Context
Why is this change required? What problem does it solve?:
This PR is a rework of the backend record structure in which hangar references are stored. Prior to this, using different backends to store data on disk in the same repository was not supported. While it was always possible for a single repository to use different backends on different machines, a method to dynamically parse, assign, and switch backends was strongly needed (without having to rely on a server).
Though this is a large change, a dynamic record parser is very much required if we are to fulfill some of our core goals: intelligent backend selection, threaded data loaders, and partial data clones.
This is an active WIP, and will likely be a large change which will break backwards compatibility with the 0.1 release.
Description
Describe your changes in detail:
A new backends module is introduced which contains all functions related to:
A new prefix has been added to the the
hashenv
(Data Hash db) values:[00 : 49]
point to some backend specification for handeling data on the local disk.[50 : 99]
point to some remote backend specification.XX:
) in the Data Hash db values are formatted and parsed in any way which the backend implementation finds convenient, they do not need to be understood by any outside observer. That said, they should not be considered private in any way; The values WILL be viewed by other parts of the application.I'm working on a common set of interfaces which should be defined for each backend. More to follow both here and in the docs as I move towards a cleaner implementation.
For this PR, I have reserved five codes for the following backends:
Right now the local hdf5 and local numpy backends work as intended; it's a bit rough around the edges, but this is already showing promise.
Screenshots (if appropriate):
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Is this PR ready for review, or a work in progress?
How Has This Been Tested?
Put an
x
in the boxes that apply:Checklist: