Support for cloud-based datastores? #23

j6k4m8 · 2021-11-08T21:11:16Z

This looks like a super powerful tool, looking forward to using it! I'd love to implement an API abstraction for cloud datastores like BossDB or CloudVolume so that one could, in theory, generate peta-scale segmentation without having to download the data and reformat into n5/hdf.

These datastores tend to have client-side libraries that support numpy-like indexing: e.g:

# Import intern (pip install intern)
from intern import array

# Save a cutout to a numpy array in ZYX order:
em = array("bossdb://microns/minnie65_8x8x40/em")
data = em[19000:19016, 56298:57322, 79190:80214]

My understanding is that this should be a simple drop-in replacement for the ws_path and ws_key if we had a class that looked something like this:

from intern import array

class BossDBAdapterFile:

    def __init__(self, filepath: str):
        self.array = array(filepath)

    def __getitem__(self, groupname: str):
        return self.array

    ...

(I expect I've forgotten a few key APIs / organization, but the gist is this)

Is this something that you imagine is feasible? Desirable? My hypothesis is that this would be pretty straightforward and open up a ton of cloud-scale capability, but I may be misunderstanding. Maybe there's a better place to plug in here than "pretending" to be an n5 file?

The text was updated successfully, but these errors were encountered:

constantinpape · 2021-11-08T21:25:21Z

Hi Jordan :).

Supporting BossDB or cloudvolume should indeed be relatively straight forward and would be a great addition here.
I am using open_file from elf (another of my libraries that wraps most of the "custom" algorithms that are used here) internally to open n5, hdf5 and zarr files (also implements read-only support for some other file formats).

So the clean way would be to extend open_file s.t. it can return a wrapper around the cloud-stores that enables read and (if necessary) write access. The extensions for open_file are implemented here. Note that open_file currently just relies on the file extension, see here. But it would be totally ok to add some checks beforehand that check if the input is a url (or whatever address you would pass for the cloud-store) and then return the appropriate wrapper if it is.

j6k4m8 · 2021-11-09T01:56:54Z

Hi hi :)

Super super awesome! In that case I'll start retrofitting open_file — do you prefer I open a draft PR into elf so you can keep an eye on progress and make sure I'm not going totally off the deep end? Happy to close this issue in the meantime, or leave it open in pursuit of eventually getting cloud datasets running through these workflows!

constantinpape · 2021-11-09T08:30:23Z

do you prefer I open a draft PR into elf so you can keep an eye on progress and make sure I'm not going totally off the deep end?

Sure, feel free to open a draft PR and ping me in there for feedback.

Happy to close this issue in the meantime, or leave it open in pursuit of eventually getting cloud datasets running through these workflows!

Yeah, let's keep this issue open and discuss integration within cluster_tools once we can open the cloud stores in elf.
I'm sure a couple more things will come up here.

j6k4m8 · 2021-11-09T14:51:27Z

Starting here! constantinpape/elf#41

j6k4m8 mentioned this issue Nov 9, 2021

Add cloud protocol support, starting with BossDB constantinpape/elf#41

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for cloud-based datastores? #23

Support for cloud-based datastores? #23

j6k4m8 commented Nov 8, 2021

constantinpape commented Nov 8, 2021

j6k4m8 commented Nov 9, 2021

constantinpape commented Nov 9, 2021

j6k4m8 commented Nov 9, 2021

Support for cloud-based datastores? #23

Support for cloud-based datastores? #23

Comments

j6k4m8 commented Nov 8, 2021

constantinpape commented Nov 8, 2021

j6k4m8 commented Nov 9, 2021

constantinpape commented Nov 9, 2021

j6k4m8 commented Nov 9, 2021