Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save embeddings with spatiotemporal metadata to GeoParquet #73

Merged
merged 4 commits into from
Dec 8, 2023

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Dec 7, 2023

What I am changing

  • Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file!

How I did it

  • In the LightningModule's predict_step, use geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. A sample table would look like this:

    date embeddings geometry
    2021-01-01 [0.1, 0.4, ... x768] POLYGON(...)
    2021-06-30 [0.2, 0.5, ... x768] POLYGON(...)
    2021-12-31 [0.3, 0.6, ... x768] POLYGON(...)
  • The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray (TODO), and geometry is in WKB.

  • Each row would store the embedding for a single 256x256 chip, and the entire table could realistically store N rows for an entire MGRS tile (10000x1000) across different dates.

TODO in this PR:

  • Save embeddings to GeoParquet
  • Improve docstring

TODO in the future:

How you can test it

  • Locally, download some GeoTIFF data into the data/ folder, and then run:
python trainer.py fit --trainer.max_epochs=10 --trainer.precision=bf16-mixed --data.data_path=data/46REU --data.num_workers=4  # train the model
python trainer.py predict --ckpt_path=checkpoints/last.ckpt --data.batch_size=1024 --trainer.precision=bf16-mixed --data.num_workers=0  # generate embeddings
  • This should produce an embedding_0.gpq file under the data/embeddings/ folder
  • Extra configuration options can be found using python trainer.py predict --help

To load the embeddings from the geoparquet file:

import geopandas as gpd

geodataframe: gpd.GeoDataFrame = gpd.read_parquet(path="embeddings_0.gpq")
assert geodataframe.shape == (2, 3)
print(geodataframe)
        date	        embeddings	                                        geometry
0	2022-12-12	[-1.1094263, 1.0212796, -0.58915687, -1.144523...	POLYGON ((93.02647 30.71001, 93.02648 30.73311...
1	2022-12-12	[-1.1253564, 1.0260286, -0.5860151, -1.1528502...	POLYGON ((93.34729 30.70955, 93.34738 30.73265...
2	2022-12-12	[-1.1190275, 1.0268829, -0.59865385, -1.147052...	POLYGON ((93.74777 30.63856, 93.74794 30.66166...
3	2022-12-12	[-1.1115837, 1.0286477, -0.60599935, -1.143061...	POLYGON ((93.80119 30.63824, 93.80138 30.66134...
4	2022-12-12	[-1.1172316, 1.0246403, -0.59833527, -1.143900...	POLYGON ((93.82790 30.63808, 93.82810 30.66118...
...	...	...	...
750	2022-12-12	[-1.11294, 1.0265714, -0.6015097, -1.1443343, ...	POLYGON ((93.40048 30.64010, 93.40057 30.66320...
751	2022-12-12	[-1.1207774, 1.029693, -0.5964609, -1.1490294,...	POLYGON ((93.45391 30.63992, 93.45402 30.66302...
752	2022-12-12	[-1.1309807, 1.0274287, -0.57653224, -1.162805...	POLYGON ((93.58748 30.63939, 93.58762 30.66249...
753	2022-12-12	[-1.1268965, 1.0305986, -0.59025705, -1.154876...	POLYGON ((93.61420 30.63926, 93.61434 30.66236...
754	2022-12-12	[-1.1171025, 1.0268872, -0.60177326, -1.146309...	POLYGON ((93.69434 30.63886, 93.69450 30.66196...

755 rows × 3 columns

If you have a newer version of QGIS, it's also possible to load the GeoParquet file directly. The below screenshot shows the bounding box locations of the 755 embeddings (1 embedding for each 256x256 chip):

image

Related Issues

Extends #56, continuation of #66.

Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back.
@weiji14 weiji14 self-assigned this Dec 7, 2023
Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like.
Comment on lines +211 to +213
"embeddings": pa.FixedShapeTensorArray.from_numpy_ndarray(
embeddings_mean.cpu().detach().__array__()
),
Copy link
Contributor Author

@weiji14 weiji14 Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we've converted the embedding into a FixedShapeTensorArray here, pandas/geopandas still interprets this column as an object dtype, and this is saved as an object dtype to the parquet file too (see the unit test). Need to see if there's a way to preserve the dtype.

Copy link
Contributor Author

@weiji14 weiji14 Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a way to save this embeddings column as a FixedShapeTensorArray dtype instead of an object dtype like so:

Suggested change
"embeddings": pa.FixedShapeTensorArray.from_numpy_ndarray(
embeddings_mean.cpu().detach().__array__()
),
"embeddings": gpd.pd.arrays.ArrowExtensionArray(
values=pa.FixedShapeTensorArray.from_numpy_ndarray(embeddings)
),

However, while we can save this FixedShapeTensorArray to GeoParquet, loading this embeddings column as a FixedShapeTensorArray is challenging, and might involve code that looks like this:

geodataframe: gpd.GeoDataFrame = gpd.read_parquet(
    path="data/embeddings/embeddings_0.gpq",
    schema=pa.schema(
        fields=[
            pa.field(
                name="embeddings",
                type=pa.fixed_shape_tensor(
                    value_type=pa.float32(), shape=[768]
                ),
            ),
            pa.field(name="geometry", type=pa.binary()),
        ]
    ),
)

But this technically still results in an embeddings column with object dtype... Also, QGIS can load this geoparquet file with FixedShapeTensorArray, but would crash when you try to open the attribute table, because it can't handle FixedShapeTensorArray yet. So probably best to keep it in object dtype for now.

Comment on lines +228 to +230
outpath = f"{outfolder}/embeddings_{batch_idx}.gpq"
gdf.to_parquet(path=outpath, schema_version="1.0.0")
print(f"Saved embeddings of shape {tuple(embeddings_mean.shape)} to {outpath}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to save several rows worth of embeddings to a single geoparquet file now. So, we can decide on how to lump embeddings together. E.g. save all the embeddings for one MGRS tile in one year together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New 512x512 image chips are being processed now-ish, see #76 (comment). Will use a new filename convention in a follow up PR (with the MGRS code in it) once we've got a new model trained on that new dataset.

Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024.
@weiji14 weiji14 marked this pull request as ready for review December 8, 2023 00:05
Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
@weiji14
Copy link
Contributor Author

weiji14 commented Dec 8, 2023

There are a couple of things that can be improved as mentioned above, such as the filenaming scheme, and streamlining how the embeddings are saved to the GeoParquet file, but will merge this in first, and handle those nice-to-haves in follow-up PRs.

@weiji14 weiji14 merged commit decea30 into main Dec 8, 2023
2 checks passed
@weiji14 weiji14 deleted the embed-spatiotemporal-metadata branch December 8, 2023 00:36
weiji14 added a commit that referenced this pull request Dec 20, 2023
Output embeddings to a geopandas.GeoDataFrame with columns 'source_url', 'date', 'embeddings', and 'geometry'. Essentially copying and adapting the code from a767164 in #73, but modifying how the encoder's masking is disabled, and how the mean/average of the embeddings is computed over a slice of the raw embeddings.
brunosan pushed a commit that referenced this pull request Dec 27, 2023
* ✨ Save embeddings with spatiotemporal metadata to GeoParquet

Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back.

* 📝 Document how embeddings are generated and saved to geoparquet

Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like.

* 📝 Mention in main README.md that embeddings are saved to geoparquet

Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024.

* 🎨 Update type hint of batch inputs, and add some inline comments

Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
weiji14 added a commit that referenced this pull request Jan 12, 2024
#96)

* 🍻 Implement CLAYModule's predict_step to generate embeddings table

Output embeddings to a geopandas.GeoDataFrame with columns 'source_url', 'date', 'embeddings', and 'geometry'. Essentially copying and adapting the code from a767164 in #73, but modifying how the encoder's masking is disabled, and how the mean/average of the embeddings is computed over a slice of the raw embeddings.

* 🚚 Rename output file to {MGRS}_{MINDATE}_{MAXDATE}_v{VERSION}.gpq

The output GeoParquet file now has a filename with a format like "{MGRS:5}_{MINDATE:8}_{MAXDATE:8}_v{VERSION:3}.gpq", e.g. "12ABC_20210101_20231231_v001.gpq". Have implemented this in model_vit.py, and copied over the same `on_predict_epoch_end` method to model_clay.py. Also, we are no longer saving out the index column to the GeoParquet file.

* ✅ Fix failing test by updating to new output filename

Forgot to update the filename in the unit test to conform to the new `{MGRS}_{MINDATE}_{MAXDATE}_v{VERSION}.gpq` format. Patches f19cf8f.

* ✅ Parametrized test to check CLAYModule's predict loop

Splitting the previous integration test on the neural network model into separate fit and predict unit tests. Only testing the prediction loop of CLAYModule, because training/validating the model might be too much for CPU-based Continuous Integration. Also for testing CLAYModule, we are using 32-true precision instead of bf16-mixed, because `torch.cat` doesn't work with float16 tensors on the CPU, see pytorch/pytorch#100932 (should be fixed with Pytorch 2.2).

* ⏪ Save index column to GeoParquet file

Decided that the index column might be good to keep for now, since it might help to speed up row counts? But we are resetting the index first before saving it. Partially reverts f19cf8f.

* ✅ Fix unit test to include index column

After f1439e3, need to ensure that the index column is checked in the output geodataframe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant