-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let LightningDataModule return spatiotemporal metadata #66
Conversation
Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as the bounding box, coordinate reference system, and date. The bbox and crs is in the raster image's native UTM projection for now, while the date is just a YYYY-MM-DD formatted string. Unit tests have been updated to ensure that the extra metadata is passed through.
# Get date | ||
date: str = pathlib.Path(filepath).name[15:25] # YYYY-MM-DD format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't like that we have to parse the date from the filename in a hardcoded way. @yellowcap, I hinted on this at #54 (comment), but would it be possible to save the datetime information in the GeoTIFF's metadata somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the next pipeline run you'll be able to get the date using
dataset.tags()["date"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, can't wait!
Reduce duplicate code py using pytest.mark.parametrize, looping over fit and predict stages.
Improved the docstring of the _array_to_torch function, mentioning the input parameters (filepath) and the contents of the output dictionary (image, bbox, crs, date). Also updated the type hint of the function.
src/datamodule.py
Outdated
# Get date | ||
date: str = pathlib.Path(filepath).name[15:25] # YYYY-MM-DD format | ||
|
||
return {"image": tensor, "bbox": bbox, "crs": crs, "date": date} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tempted to store this in an Arrow Table with FixedShapedTensorArray
for the image and bbox 'columns'. Possibly revisit this in the future.
Gonna leave this up for review for a day or so before merging. Once merged, I'll proceed to work on part 2/2, which is to get the model to output embeddings with spatiotemporal metadata columns! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Just one small comment on variable name.
src/datamodule.py
Outdated
bbox: torch.Tensor = torch.as_tensor( # xmin, ymin, xmax, ymax | ||
data=dataset.bounds, dtype=torch.float64 | ||
) | ||
crs: int = torch.as_tensor(data=dataset.crs.to_epsg(), dtype=torch.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable could be called epsg
to make clear its a crs as epsg integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was considering this actually, I've changed it in 6a0e8cf!
Since we're storing the EPSG integer and not the CRS representation.
Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
* ✨ Save embeddings with spatiotemporal metadata to GeoParquet Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back. * 📝 Document how embeddings are generated and saved to geoparquet Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like. * 📝 Mention in main README.md that embeddings are saved to geoparquet Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024. * 🎨 Update type hint of batch inputs, and add some inline comments Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
* 🗃️ Let LightningDataModule return spatiotemporal metadata Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as the bounding box, coordinate reference system, and date. The bbox and crs is in the raster image's native UTM projection for now, while the date is just a YYYY-MM-DD formatted string. Unit tests have been updated to ensure that the extra metadata is passed through. * ♻️ Refactor test_geotiffdatapipemodule to use parametrization Reduce duplicate code py using pytest.mark.parametrize, looping over fit and predict stages. * 📝 Document returned outputs from _array_to_torch function Improved the docstring of the _array_to_torch function, mentioning the input parameters (filepath) and the contents of the output dictionary (image, bbox, crs, date). Also updated the type hint of the function. * 🚚 Rename crs to epsg Since we're storing the EPSG integer and not the CRS representation.
* ✨ Save embeddings with spatiotemporal metadata to GeoParquet Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back. * 📝 Document how embeddings are generated and saved to geoparquet Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like. * 📝 Mention in main README.md that embeddings are saved to geoparquet Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024. * 🎨 Update type hint of batch inputs, and add some inline comments Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as:
Note:
This is part 1/2 of adding spatiotemporal metadata to the output embedding table later, as mentioned at #35 (comment).