Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A schema for collections? #40

Open
bmcfee opened this issue Jun 9, 2015 · 6 comments
Open

A schema for collections? #40

bmcfee opened this issue Jun 9, 2015 · 6 comments
Labels
enhancement question schema Issues pertaining to schema definitions
Milestone

Comments

@bmcfee
Copy link
Contributor

bmcfee commented Jun 9, 2015

Going back to this comment, we punted on the idea of managing extrinsic data (eg, file paths) explicitly from within a JAMS object. Now that the dust has settled a bit on JAMS schema, I'm wondering if we can come up with a better solution than sandboxing this stuff.

I bring this up because maintaining links between audio content and annotations is still kind of a pain, and I'd prefer to not solve it over and over again.

How do people feel about introducing an interface/schema for managing collections of jamses? At the most basic level, this would provide a simple index of audio content, jams content, and collection-level information. (It might also be useful to index which annotation namespaces are present in each jams file.) This kind of thing can spiral out of control easily, so if we do it, we should keep it tightly scoped.

@ejhumphrey
Copy link
Collaborator

How's about a FileManager object that inherits from a dict or list, depending on whether or not key or integer-based indexing makes sense (I typically use, and prefer, key-based indexing so you're robust to shuffling / partitioning), and contains a FileCollection, consisting of fields which point to any number of file paths.

As an added bonus, we / the user could register different load / open methods with filetypes for transparent (lazy) loading, i.e. "npz" -> np.load, "jams" -> jams.load, etc.
For example...

fmgr = FileManager()
fmgr['my_song'] = FileCollection(
    audio='/path/to/my/song.wav', 
    annotation='/a/different/file.jams',
    features='/data/features/my_song.npz')

# Assuming 'npz' -> np.load by default
data = fmgr['my_song'].features.load()

Additionally, if everything inherits from JObject, then this database-style object can be saved / loaded just as easily.

Thoughts?

@bmcfee
Copy link
Contributor Author

bmcfee commented Jul 14, 2015

How's about a FileManager object that inherits from a dict or list, depending on whether or not key or integer-based indexing makes sense (I typically use, and prefer, key-based indexing so you're robust to shuffling / partitioning)

I'd argue that int-based indexing never makes sense, unless the int is actually treated as a key (eg in gtzan).

It may also be worth looking at something like asdf for inspiration, since they have many of the same problems we do.

As an added bonus, we / the user could register different load / open methods with filetypes for transparent (lazy) loading, i.e. "npz" -> np.load, "jams" -> jams.load, etc.

I like this idea, but transparent loading seems a little tricky to get exactly right. Ideally, I'd want to be able to clobber load arguments (such as audio sampling rate). This could be supported pretty easily by setting defaults on kwargs, but the resulting api may be kind of a mess.

Maybe we should ponder on that a bit.

@bmcfee bmcfee modified the milestone: 0.2.1 Jul 18, 2015
@bmcfee
Copy link
Contributor Author

bmcfee commented Sep 14, 2015

Circling back on this after a bit of pondering.

 fmgr = FileManager()
 fmgr['my_song'] = FileCollection(
     audio='/path/to/my/song.wav', 
     annotation='/a/different/file.jams',
     features='/data/features/my_song.npz')

This looks exactly like a dataframe to me.

 # Assuming 'npz' -> np.load by default
 data = fmgr['my_song'].features.load()

How about something a little less objecty? I like your idea of having a dispatch object that can map a key (eg features) to a loader function (np.load). Why does that need to be attached to the object? We could just as easily construct the dispatcher as an object, and feed it a data frame where keys correspond to samples, and each column is a field that can be loaded via dispatch.

This way, we don't have to worry about schematizing the whole thing, and it becomes much easier to import data sets on the fly. (We can also tag along non-loadable fields at the same level, such as an artist id for split filtering.)

@bmcfee
Copy link
Contributor Author

bmcfee commented Feb 1, 2016

Punting this to #98

@bmcfee bmcfee modified the milestones: 0.3.0, 0.2.2 Feb 1, 2016
@bmcfee bmcfee modified the milestones: 0.3.0, 0.4.0 May 11, 2017
@bmcfee
Copy link
Contributor Author

bmcfee commented May 31, 2018

Having thought on this for years at this point, I think the reasonable course of action here is as follows:

  1. Implement the unified schema refactor proposed in RFC: more rigid, but simpler schema validation #178
  2. Expose the schema over the web with proper versioning and references.
  3. Any collection-level schema can be built using references to (2). Objects can be sharded and linked by uuids at the collection-level, but the objects themselves do not need to contain identifiers. This way, the schema can be backward-compatible.
  4. Provide a standard implementation / example schema for managing jams collections in mongodb (the famed jamongo) using the above.

@bmcfee
Copy link
Contributor Author

bmcfee commented Jun 5, 2018

Provide a standard implementation / example schema for managing jams collections in mongodb (the famed jamongo) using the above.

Of course, it couldn't be that simple. MongoDB does not support $ref in json schema (?!).

@bmcfee bmcfee added the schema Issues pertaining to schema definitions label Aug 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement question schema Issues pertaining to schema definitions
Projects
None yet
Development

No branches or pull requests

2 participants