Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Bag of Images" object for unregistered fields of view #751

Closed
ambrosejcarr opened this issue Oct 28, 2018 · 4 comments
Closed

"Bag of Images" object for unregistered fields of view #751

ambrosejcarr opened this issue Oct 28, 2018 · 4 comments

Comments

@ambrosejcarr
Copy link
Member

ambrosejcarr commented Oct 28, 2018

Because we requested registered & resliced data for the MVP, ImageStack makes the implicit assumption that contained data is registered, and thus that data can be treated as a 5-d tensor that lives in a consistent euclidean space.

As users push back on this requirement, our attempts to use unregistered data is causing fragmentation of our object model. For example, we are discussing the need to assert that data lives in the same coordinate space to submit it to certain types of spot detection (see #695).

When data is not registered, it means:

  1. propagation of physical coordinates to IntensityTable is more complex and must be done tile-by-tile
  2. data cannot be filtered volumetrically in > 2d, as other axes may not align.
  3. Pixels cannot be considered to be aligned across axes of the ImageStack (for spot calling)

The purpose of this issue is to discuss a short and long-term plan to solve this problem. My best guess at how to tackle this is below

Long-term Proposal

  1. make a "bag of images" object that can provokes different and limited processing by Filter and SpotFinder components that currently take ImageStack objects. Given that users have not had trouble storing unregistered data in SpaceTx-format, the underlying data should be stored in the same way on disk and potentially also in-memory.
  2. Registration methods that we expose in the future would then take "bag of images" and emit ImageStacks.
  3. Assert on creation of an ImageStack that all tiles hold the same physical coordinates.
  4. SpotFinder methods that require registration can take only ImageStack objects. More flexible spot finders can take either ImageStack or "bag of images".
  5. Filter methods will not offer 3d processing to "bag of images" objects.

Short-term (spaceTx) Proposal:

Unregistered data is currently strictly out-of-scope for starfish. To avoid creating technical debt by introducing inappropriate flexibility into ImageStack, we should:

  1. Decide whether we will accept unsliced data + registration transformations from users. if so, work with those users to learn to apply those transformations outside of spaceTx and starfish, and then store the resliced data as the substrate to Starfish. Store the unsliced data as well. SpaceTx-format should retain the ability to store per-tile coordinates so that we can expand to work with unregistered data in the future.
  2. Assert on creation of an ImageStack that all tiles hold the same physical coordinates (all data is registered & resliced)

cc @joshmoore @berl @kevinyamauchi @dganguli @shanaxel42 @ttung

@berl
Copy link
Collaborator

berl commented Oct 28, 2018

@ambrosejcarr thanks for generating a concrete issue and proposals on this somewhat nebulous problem.

Some clarification:

Unregistered data is currently strictly out-of-scope for starfish

From the beginning, we (and our spec) have supported registered data, meaning data that includes coordinates for each tile, not just resliced data that is already an aligned 5D tensor. Any departure from this is walking back from our original spec and should be discussed as such. Personally, I'm strongly in favor of keeping it.

Overall, I agree with @ambrosejcarr 's assessment of the current situation- we have assumed that the raw data could come in with arbitrary locations for every .tif file. Fortunately, a lot of the refactoring work described above and the work put back on contributors can be avoided if we reconsider this assumption.

I propose that we modify the assumption to match real-world data generation: Within a FOV, images for each (channel, round) share the same x,y coordinates. This doesn't exactly solve the 5-D ImageStack problem, but it does:

  1. Allow 3D filtering and spotfinding
  2. Make propagation of spatial information into IntensityTables trivial
  3. Retain the ability to deal with overlap between FOVs (something yet to be discussed, and relevant even with data that has constant (x,y), values across all (r,c) pairs)
  4. Allow the processing to be flexible exactly where it needs to be for non-barcoded methods. These methods need registration precision of ~1um, not ~0.1 um, and won't generate nonsense results if the registration quality isn't perfect everywhere.

As a contributor, I say that loading, resampling and rewriting entire data sets is too much to ask, now and in the future. Furthermore, reslicing of a data set is a lot easier to say than do. Specifically, implementing image translations, cropping, and dealing with FOV overlap is prone to introducing artifacts into downstream image processing pipelines. Also, the tools out there to do it are either moderately complicated (render) or moderately painful (FIJI).

I look forward to more discussion here- this will definitely impact priorities for the next few months as SpaceTx data generation ramps up.

@ambrosejcarr
Copy link
Member Author

ambrosejcarr commented Oct 28, 2018

From the beginning, we (and our spec) have supported registered data, meaning data that includes coordinates for each tile, not just resliced data that is already an aligned 5D tensor

I think there might have been some difference of interpretation there, but Deep and I are on your page here: we can take registered data as you describe, without reslicing. Sorry if that wasn't clear.

However, it may make more sense to reslice it ourselves outside of starfish, at least for the spaceTx experiments, to triage the work necessary to treat data as it is acquired. Treating registered data directly could mean a general interpretation of complex transformations, which opens a scary box.

As a contributor, I say that loading, resampling and rewriting entire data sets is too much to ask, now and in the future

I agree with the "in the future" part of this, but can you help me understand why this is so burdensome? As far as I can tell, for a one-off experiment it would just mean a few dozen compute hours and an extra 1-60 tb (size on 1 fov) of temporary storage, hence my proposal above that we (starfish team) might want to resclice it to free time to focus on infrastructure work.

Certainly it seems easier to me to take registered data from users and reslice it on s3 than it does to adapt starfish in the short term to support data that's not pixel-aligned in 3d. What am I missing here?

Thanks Brian!

@berl
Copy link
Collaborator

berl commented Oct 29, 2018

can you help me understand why this is so burdensome?
If it works perfectly and is done by the starfish team (which I didn't quite get before), then it's not burdensome. I'm most concerned about
a) introducing artifacts (e.g. at edges due to FFT ringing, cropping, etc), some of which are prone to amplification by downstream processing
b) handling of overlap between FOVs and the loss/weirdness due to cropping to the 5D ImageStack
c) actual implementation- I'd suggest using Render for this task, but if you figure out another solution that works straight from SpaceTx format it could be broadly useful outside starfish.

These problems are completely avoided by doing per-round, per-channel processing and handling the registration in the IntensityTable. And this seems to me to be pretty easy to implement and useful for several reasons beyond just handling SpaceTx data in the short term.

I'm also a little concerned that a focus on building for resliced data will make registered (but not resliced) data second-class citizens in starfish. I'm biased here of course, because I'm generating this kind of data, but I'm not the only one!

@ambrosejcarr
Copy link
Member Author

I'm also a little concerned that a focus on building for resliced data will make registered (but not resliced) data second-class citizens in starfish

Don't worry. Just gathering information to figure out what makes sense to implement now (and we won't make that decision unilaterally). It's more a question of "when". I think we could get your workflows functioning in current starfish without too much trouble (as you outline, thanks!), but I'm concerned about some of the ISS and expansion approaches.

I think a separation of objects (ImageStack vs BagOfImages) might be the right way to differentiate the two types of workflows, which would keep both as first class citizens.

Your other points are very helpful. It's good to understand it's not a computational/cost issue, but an algorithm one.

I generally think your point about processing like the data is generated is a very compelling one will advocate that starfish accommodate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants