Save embeddings with spatiotemporal metadata to GeoParquet #73

weiji14 · 2023-12-07T04:16:03Z

What I am changing

Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file!

How I did it

In the LightningModule's predict_step, use geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. A sample table would look like this:

date embeddings geometry

2021-01-01 [0.1, 0.4, ... x768] POLYGON(...)

2021-06-30 [0.2, 0.5, ... x768] POLYGON(...)

2021-12-31 [0.3, 0.6, ... x768] POLYGON(...)
The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray (TODO), and geometry is in WKB.
Each row would store the embedding for a single 256x256 chip, and the entire table could realistically store N rows for an entire MGRS tile (10000x1000) across different dates.

TODO in this PR:

Save embeddings to GeoParquet
Improve docstring

TODO in the future:

Ensure embeddings are saved as FixedShapeTensorArray? (see Save embeddings with spatiotemporal metadata to GeoParquet #73 (comment))

How you can test it

Locally, download some GeoTIFF data into the data/ folder, and then run:

python trainer.py fit --trainer.max_epochs=10 --trainer.precision=bf16-mixed --data.data_path=data/46REU --data.num_workers=4  # train the model
python trainer.py predict --ckpt_path=checkpoints/last.ckpt --data.batch_size=1024 --trainer.precision=bf16-mixed --data.num_workers=0  # generate embeddings

This should produce an embedding_0.gpq file under the data/embeddings/ folder
- Sample file (need to unzip, about 3.0MB uncompressed): embeddings_0.gpq.zip
Extra configuration options can be found using python trainer.py predict --help

To load the embeddings from the geoparquet file:

import geopandas as gpd

geodataframe: gpd.GeoDataFrame = gpd.read_parquet(path="embeddings_0.gpq")
assert geodataframe.shape == (2, 3)
print(geodataframe)

        date	        embeddings	                                        geometry
0	2022-12-12	[-1.1094263, 1.0212796, -0.58915687, -1.144523...	POLYGON ((93.02647 30.71001, 93.02648 30.73311...
1	2022-12-12	[-1.1253564, 1.0260286, -0.5860151, -1.1528502...	POLYGON ((93.34729 30.70955, 93.34738 30.73265...
2	2022-12-12	[-1.1190275, 1.0268829, -0.59865385, -1.147052...	POLYGON ((93.74777 30.63856, 93.74794 30.66166...
3	2022-12-12	[-1.1115837, 1.0286477, -0.60599935, -1.143061...	POLYGON ((93.80119 30.63824, 93.80138 30.66134...
4	2022-12-12	[-1.1172316, 1.0246403, -0.59833527, -1.143900...	POLYGON ((93.82790 30.63808, 93.82810 30.66118...
...	...	...	...
750	2022-12-12	[-1.11294, 1.0265714, -0.6015097, -1.1443343, ...	POLYGON ((93.40048 30.64010, 93.40057 30.66320...
751	2022-12-12	[-1.1207774, 1.029693, -0.5964609, -1.1490294,...	POLYGON ((93.45391 30.63992, 93.45402 30.66302...
752	2022-12-12	[-1.1309807, 1.0274287, -0.57653224, -1.162805...	POLYGON ((93.58748 30.63939, 93.58762 30.66249...
753	2022-12-12	[-1.1268965, 1.0305986, -0.59025705, -1.154876...	POLYGON ((93.61420 30.63926, 93.61434 30.66236...
754	2022-12-12	[-1.1171025, 1.0268872, -0.60177326, -1.146309...	POLYGON ((93.69434 30.63886, 93.69450 30.66196...

755 rows × 3 columns

If you have a newer version of QGIS, it's also possible to load the GeoParquet file directly. The below screenshot shows the bounding box locations of the 755 embeddings (1 embedding for each 256x256 chip):

Related Issues

Extends #56, continuation of #66.

Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back.

Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like.

weiji14 · 2023-12-07T05:06:55Z

src/model_vit.py

+                "embeddings": pa.FixedShapeTensorArray.from_numpy_ndarray(
+                    embeddings_mean.cpu().detach().__array__()
+                ),


Although we've converted the embedding into a FixedShapeTensorArray here, pandas/geopandas still interprets this column as an object dtype, and this is saved as an object dtype to the parquet file too (see the unit test). Need to see if there's a way to preserve the dtype.

Found a way to save this embeddings column as a FixedShapeTensorArray dtype instead of an object dtype like so:

Suggested change

"embeddings": pa.FixedShapeTensorArray.from_numpy_ndarray(

embeddings_mean.cpu().detach().__array__()

),

"embeddings": gpd.pd.arrays.ArrowExtensionArray(

values=pa.FixedShapeTensorArray.from_numpy_ndarray(embeddings)

),

However, while we can save this FixedShapeTensorArray to GeoParquet, loading this embeddings column as a FixedShapeTensorArray is challenging, and might involve code that looks like this:

geodataframe: gpd.GeoDataFrame = gpd.read_parquet( path="data/embeddings/embeddings_0.gpq", schema=pa.schema( fields=[ pa.field( name="embeddings", type=pa.fixed_shape_tensor( value_type=pa.float32(), shape=[768] ), ), pa.field(name="geometry", type=pa.binary()), ] ), )

But this technically still results in an embeddings column with object dtype... Also, QGIS can load this geoparquet file with FixedShapeTensorArray, but would crash when you try to open the attribute table, because it can't handle FixedShapeTensorArray yet. So probably best to keep it in object dtype for now.

weiji14 · 2023-12-07T05:10:59Z

src/model_vit.py

+        outpath = f"{outfolder}/embeddings_{batch_idx}.gpq"
+        gdf.to_parquet(path=outpath, schema_version="1.0.0")
+        print(f"Saved embeddings of shape {tuple(embeddings_mean.shape)} to {outpath}")


It is possible to save several rows worth of embeddings to a single geoparquet file now. So, we can decide on how to lump embeddings together. E.g. save all the embeddings for one MGRS tile in one year together.

New 512x512 image chips are being processed now-ish, see #76 (comment). Will use a new filename convention in a follow up PR (with the MGRS code in it) once we've got a new model trained on that new dataset.

Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024.

Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.

weiji14 · 2023-12-08T00:35:56Z

There are a couple of things that can be improved as mentioned above, such as the filenaming scheme, and streamlining how the embeddings are saved to the GeoParquet file, but will merge this in first, and handle those nice-to-haves in follow-up PRs.

Output embeddings to a geopandas.GeoDataFrame with columns 'source_url', 'date', 'embeddings', and 'geometry'. Essentially copying and adapting the code from a767164 in #73, but modifying how the encoder's masking is disabled, and how the mean/average of the embeddings is computed over a slice of the raw embeddings.

* ✨ Save embeddings with spatiotemporal metadata to GeoParquet Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back. * 📝 Document how embeddings are generated and saved to geoparquet Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like. * 📝 Mention in main README.md that embeddings are saved to geoparquet Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024. * 🎨 Update type hint of batch inputs, and add some inline comments Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.

#96) * 🍻 Implement CLAYModule's predict_step to generate embeddings table Output embeddings to a geopandas.GeoDataFrame with columns 'source_url', 'date', 'embeddings', and 'geometry'. Essentially copying and adapting the code from a767164 in #73, but modifying how the encoder's masking is disabled, and how the mean/average of the embeddings is computed over a slice of the raw embeddings. * 🚚 Rename output file to {MGRS}_{MINDATE}_{MAXDATE}_v{VERSION}.gpq The output GeoParquet file now has a filename with a format like "{MGRS:5}_{MINDATE:8}_{MAXDATE:8}_v{VERSION:3}.gpq", e.g. "12ABC_20210101_20231231_v001.gpq". Have implemented this in model_vit.py, and copied over the same `on_predict_epoch_end` method to model_clay.py. Also, we are no longer saving out the index column to the GeoParquet file. * ✅ Fix failing test by updating to new output filename Forgot to update the filename in the unit test to conform to the new `{MGRS}_{MINDATE}_{MAXDATE}_v{VERSION}.gpq` format. Patches f19cf8f. * ✅ Parametrized test to check CLAYModule's predict loop Splitting the previous integration test on the neural network model into separate fit and predict unit tests. Only testing the prediction loop of CLAYModule, because training/validating the model might be too much for CPU-based Continuous Integration. Also for testing CLAYModule, we are using 32-true precision instead of bf16-mixed, because `torch.cat` doesn't work with float16 tensors on the CPU, see pytorch/pytorch#100932 (should be fixed with Pytorch 2.2). * ⏪ Save index column to GeoParquet file Decided that the index column might be good to keep for now, since it might help to speed up row counts? But we are resetting the index first before saving it. Partially reverts f19cf8f. * ✅ Fix unit test to include index column After f1439e3, need to ensure that the index column is checked in the output geodataframe.

weiji14 self-assigned this Dec 7, 2023

weiji14 commented Dec 7, 2023

View reviewed changes

weiji14 mentioned this pull request Dec 7, 2023

Send early sample of embeddings #35

Closed

📝 Mention in main README.md that embeddings are saved to geoparquet

f743b53

Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024.

weiji14 marked this pull request as ready for review December 8, 2023 00:05

🎨 Update type hint of batch inputs, and add some inline comments

384650c

Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.

weiji14 merged commit decea30 into main Dec 8, 2023
2 checks passed

weiji14 deleted the embed-spatiotemporal-metadata branch December 8, 2023 00:36

This was referenced Dec 10, 2023

Embeddings with land use land cover fields, or other attributes #84

Closed

Rename embeddings file to include MGRS code and store GeoTIFF source_url #86

Merged

weiji14 mentioned this pull request Dec 20, 2023

Generate embeddings from CLAYModule trained with latlon/time encodings #96

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save embeddings with spatiotemporal metadata to GeoParquet #73

Save embeddings with spatiotemporal metadata to GeoParquet #73

weiji14 commented Dec 7, 2023 •

edited

Loading

weiji14 Dec 7, 2023 •

edited

Loading

weiji14 Dec 7, 2023 •

edited

Loading

weiji14 Dec 7, 2023

weiji14 Dec 8, 2023

weiji14 commented Dec 8, 2023

date	embeddings	geometry
2021-01-01	[0.1, 0.4, ... x768]	POLYGON(...)
2021-06-30	[0.2, 0.5, ... x768]	POLYGON(...)
2021-12-31	[0.3, 0.6, ... x768]	POLYGON(...)

Save embeddings with spatiotemporal metadata to GeoParquet #73

Save embeddings with spatiotemporal metadata to GeoParquet #73

Conversation

weiji14 commented Dec 7, 2023 • edited Loading

What I am changing

How I did it

How you can test it

Related Issues

weiji14 Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

weiji14 Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

weiji14 Dec 7, 2023

Choose a reason for hiding this comment

weiji14 Dec 8, 2023

Choose a reason for hiding this comment

weiji14 commented Dec 8, 2023

weiji14 commented Dec 7, 2023 •

edited

Loading

weiji14 Dec 7, 2023 •

edited

Loading

weiji14 Dec 7, 2023 •

edited

Loading