Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for cloud-based datastores? #23

Open
j6k4m8 opened this issue Nov 8, 2021 · 4 comments
Open

Support for cloud-based datastores? #23

j6k4m8 opened this issue Nov 8, 2021 · 4 comments

Comments

@j6k4m8
Copy link

j6k4m8 commented Nov 8, 2021

This looks like a super powerful tool, looking forward to using it! I'd love to implement an API abstraction for cloud datastores like BossDB or CloudVolume so that one could, in theory, generate peta-scale segmentation without having to download the data and reformat into n5/hdf.

These datastores tend to have client-side libraries that support numpy-like indexing: e.g:

# Import intern (pip install intern)
from intern import array

# Save a cutout to a numpy array in ZYX order:
em = array("bossdb://microns/minnie65_8x8x40/em")
data = em[19000:19016, 56298:57322, 79190:80214]

My understanding is that this should be a simple drop-in replacement for the ws_path and ws_key if we had a class that looked something like this:

from intern import array

class BossDBAdapterFile:

    def __init__(self, filepath: str):
        self.array = array(filepath)

    def __getitem__(self, groupname: str):
        return self.array

    ...

(I expect I've forgotten a few key APIs / organization, but the gist is this)

Is this something that you imagine is feasible? Desirable? My hypothesis is that this would be pretty straightforward and open up a ton of cloud-scale capability, but I may be misunderstanding. Maybe there's a better place to plug in here than "pretending" to be an n5 file?

@constantinpape
Copy link
Owner

Hi Jordan :).

Supporting BossDB or cloudvolume should indeed be relatively straight forward and would be a great addition here.
I am using open_file from elf (another of my libraries that wraps most of the "custom" algorithms that are used here) internally to open n5, hdf5 and zarr files (also implements read-only support for some other file formats).

So the clean way would be to extend open_file s.t. it can return a wrapper around the cloud-stores that enables read and (if necessary) write access. The extensions for open_file are implemented here. Note that open_file currently just relies on the file extension, see here. But it would be totally ok to add some checks beforehand that check if the input is a url (or whatever address you would pass for the cloud-store) and then return the appropriate wrapper if it is.

@j6k4m8
Copy link
Author

j6k4m8 commented Nov 9, 2021

Hi hi :)

Super super awesome! In that case I'll start retrofitting open_file — do you prefer I open a draft PR into elf so you can keep an eye on progress and make sure I'm not going totally off the deep end? Happy to close this issue in the meantime, or leave it open in pursuit of eventually getting cloud datasets running through these workflows!

@constantinpape
Copy link
Owner

do you prefer I open a draft PR into elf so you can keep an eye on progress and make sure I'm not going totally off the deep end?

Sure, feel free to open a draft PR and ping me in there for feedback.

Happy to close this issue in the meantime, or leave it open in pursuit of eventually getting cloud datasets running through these workflows!

Yeah, let's keep this issue open and discuss integration within cluster_tools once we can open the cloud stores in elf.
I'm sure a couple more things will come up here.

@j6k4m8
Copy link
Author

j6k4m8 commented Nov 9, 2021

Starting here! constantinpape/elf#41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants