Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cloud protocol support, starting with BossDB #41

Merged
merged 15 commits into from
Nov 11, 2021

Conversation

j6k4m8
Copy link
Contributor

@j6k4m8 j6k4m8 commented Nov 9, 2021

This PR starts to add support for cloud datasets by passing a file with the neuroglancer "protocol"-style prefixes (e.g. bossdb://).

It works by shimming a File and Dataset class to wrap the intern library so that it behaves like the h5py.File API:

>>> f = InternFile("bossdb://phelps_hildebrand_graham2021/FANC/em")
>>> f['data'] # numpy-like Dataset

(For reference, all public BossDB datasets are listed here; Janelia DVID data (dvid://) are listed here. As far as I know, there isn't a central repository for CloudVolume-format (precomputed://) datasets.)

Some more discussion in constantinpape/cluster_tools#23

Still to-do:

Feedback welcome, @constantinpape! :)

elf/io/files.py Show resolved Hide resolved
elf/io/intern_wrapper.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
Copy link
Owner

@constantinpape constantinpape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is on a good way. Now you only need to implement the classes in intern_wrapper and add tests.

elf/io/extensions.py Show resolved Hide resolved
elf/io/files.py Show resolved Hide resolved
elf/io/intern_wrapper.py Show resolved Hide resolved
elf/io/intern_wrapper.py Show resolved Hide resolved
elf/io/intern_wrapper.py Show resolved Hide resolved
@constantinpape
Copy link
Owner

* Add optional local cache (?) Is this necessary or are slices stored temporarily in elf / cluster_tools workflows?

I have a cache implementation in elf: https://github.com/constantinpape/elf/blob/master/elf/wrapper/cached_volume.py
But I have not tested it much regarding performance and not used it much at all; so it might not be very useful yet.
It's also not used in cluster_tools yet, but that would be easy.
I would be very interested in a contribution that either tests and improves CachedVolume or implements a better caching mechanism, but I suggest we leave that for a follow up PR.

* Write some example workflows (maybe belongs in https://github.com/constantinpape/cluster_tools)

Yes, I think that should rather go into cluster_tools once we have a working intern wrapper here.

1 similar comment
@constantinpape
Copy link
Owner

* Add optional local cache (?) Is this necessary or are slices stored temporarily in elf / cluster_tools workflows?

I have a cache implementation in elf: https://github.com/constantinpape/elf/blob/master/elf/wrapper/cached_volume.py
But I have not tested it much regarding performance and not used it much at all; so it might not be very useful yet.
It's also not used in cluster_tools yet, but that would be easy.
I would be very interested in a contribution that either tests and improves CachedVolume or implements a better caching mechanism, but I suggest we leave that for a follow up PR.

* Write some example workflows (maybe belongs in https://github.com/constantinpape/cluster_tools)

Yes, I think that should rather go into cluster_tools once we have a working intern wrapper here.

@j6k4m8
Copy link
Contributor Author

j6k4m8 commented Nov 11, 2021

@constantinpape — ready for your review and thoughts I think!

I added tests and as far as I can tell, things are playing nicely. Curious to see if the test workflow passes :)

@j6k4m8 j6k4m8 marked this pull request as ready for review November 11, 2021 00:35
Copy link
Owner

@constantinpape constantinpape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. We could still add some chunking information to the wrapper if possible and slightly extend the tests.

elf/io/extensions.py Show resolved Hide resolved

# TODO chunks are arbitrary, how do we handle this?
@property
def chunks(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some chunking that should be observed? (Even if it's not exposed as chunks by the intern API?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran some benchmarks and found that there's not an enormous difference between "cuboid-aligned" and "non-aligned" reads with BossDB in terms of performance, because of the server-side cache. We can add the 512²64 chunks here but it's luckily not a big contributor/detractor to performance! I figured it made more sense to remain "agnostic" here in the same way MRC does.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And are parallel writes to data with overlapping chunks ok? If yes, and if performance is not an issue, we can return None indeed.
(Note that this will need some updates in cluster_tools then, but I think it's better to update it there so that it can deal with arbitrary chunk sizes rather than adding an artificial one here.)


ds = InternDataset("bossdb://witvliet2020/Dataset_1/em")
cutout = ds[210:211, 7000:7064, 7000:7064]
self.assertEqual(cutout.shape, (1, 64, 64))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also download directly via the intern API here and check for equality?
(I know that the wrapper is very thin, so this is very unlikely to fail, but I think it's better to be more careful in the tests ;))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! I meant to ask how you wanted me to handle this to keep our tests in this repo isolated from the "in-use" database. My original plan was to download from a dataset and check a handful of voxels for equality, but that feels sloppy. We could also include a small .npy array in the tests directory to check full-array equality of a few cutouts, but that feels like adding unnecessary clutter.

My instinct is to do the former:

data = InternDataset("bossdb:// ... ") # some known, public dataset
assert data[100, 200, 300] == 137 # magic number
assert data[10, 300, 200] == 42 # magic number

How does that sound?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct is to do the former:

data = InternDataset("bossdb:// ... ") # some known, public dataset
assert data[100, 200, 300] == 137 # magic number
assert data[10, 300, 200] == 42 # magic number

How does that sound?

Yes, I think that's a good solution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just added this in the latest commit!)

@j6k4m8
Copy link
Contributor Author

j6k4m8 commented Nov 11, 2021

I'm realizing the new tests are being skipped because there's no intern installation:

test_can_access_dataset (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)'
test_can_download_dataset (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)'
test_file (io_tests.test_intern_wrapper.TestInternWrapper) ... skipped 'Needs intern (pip install intern)'

Mind if I add it to the test suite? (I think the right way to do this is to use the environment.yaml dependency mgmt and add a line like

- intern

Does that sound right?

https://github.com/constantinpape/elf/blob/master/.github/workflows/environment.yaml#L5
(https://github.com/conda-forge/intern-feedstock)

@constantinpape
Copy link
Owner

Yes, please go ahead and add it to the env file.

@constantinpape
Copy link
Owner

Good that we activated the tests, looks like there is indeed something wrong ;):

Traceback (most recent call last):
  File "/home/runner/work/elf/elf/test/io_tests/test_intern_wrapper.py", line 20, in test_can_access_dataset
    self.assertEqual(ds.shape, (300, 36000, 22000))
AssertionError: Tuples differ: (300, 26000, 22000) != (300, 36000, 22000)

First differing element 1:
26000
36000

- (300, 26000, 22000)
?       ^

+ (300, 36000, 22000)
?       ^

@j6k4m8
Copy link
Contributor Author

j6k4m8 commented Nov 11, 2021

Oops. The thing that's wrong is I don't know how to copy and paste correctly! :) The test was wrong, not the code. A good catch!

Copy link
Owner

@constantinpape constantinpape left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we only need to return the dtype as np dtype (I went ahead and did this directly.)

elf/io/intern_wrapper.py Show resolved Hide resolved
@constantinpape
Copy link
Owner

@j6k4m8 looks like tests are passing now. This is good to be merged from my side. Anything you still want to add?

@j6k4m8
Copy link
Contributor Author

j6k4m8 commented Nov 11, 2021

Awesome!! Looks like it's still broken on Windows? (idk if that's expected..?)

I'm good to merge this once you're happy with it! My next step is to write a simple cloud segmentation example next, "read from the cloud, segment, and write completed seg back to the cloud"... I think in cluster_tools! :)

@constantinpape
Copy link
Owner

Looks like it's still broken on Windows? (idk if that's expected..?)

Sorry I only waited till the linux tests passed to write this...
Indeed, it still fails on windows because it can't find the intern package installed; maybe there's something wrong with the windows conda package?!
I can have a quick look later and gonna ping you here if I find something.

My next step is to write a simple cloud segmentation example next, "read from the cloud, segment, and write completed seg back to the cloud"... I think in cluster_tools! :)

Sounds good!

@constantinpape
Copy link
Owner

It looks like there is some issue with the conda intern pacakge:
Apparently import intern works, but from intern import array does not. With the change in 8f9edfd from intern import array is used to check if intern is available and now the tests are skipped on windows because this fails.
I will merge this anyway as this seems to be an upstream problem and usage on windows is probably not so important, but I will create an issue to keep track of this.

@constantinpape constantinpape merged commit 5141cfc into constantinpape:master Nov 11, 2021
@j6k4m8 j6k4m8 deleted the add-cloud-protocols branch November 11, 2021 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants