Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename embeddings file to include MGRS code and store GeoTIFF source_url #86

Merged
merged 5 commits into from
Dec 19, 2023

Conversation

weiji14
Copy link
Contributor

@weiji14 weiji14 commented Dec 11, 2023

What I am changing

  • Improve usability of the GeoParquet embeddings file by:
    1. Renaming the file from a generic embeddings_0.gpq to a format like {MGRS}_v{VERSION}.gpq as suggested at Send early sample of embeddings #35 (comment)
    2. Storing a URL to the source GeoTIFF file used to create the embedding, e.g. s3://.../.../claytile_32VLM_20221119_v02_0200.tif, for better provenance

How I did it

  • In the LightningDataModule's datapipe, return a source_url for each GeoTIFF file being loaded

  • In the LightningModule's predict_step, create a source_url column in the geopandas.GeoDataFrame (in addition to the previous three columns done at Save embeddings with spatiotemporal metadata to GeoParquet #73). A sample table would look like this:

    source_url date embeddings geometry
    s3://.../.../claytile_*.tif 2021-01-01 [0.1, 0.4, ... x768] POLYGON(...)
    s3://.../.../claytile_*.tif 2021-06-30 [0.2, 0.5, ... x768] POLYGON(...)
    s3://.../.../claytile_*.tif 2021-12-31 [0.3, 0.6, ... x768] POLYGON(...)
  • The source_url column is stored in the string[pyarrow] format (which will be the default in Pandas 3.0 per PDEP10)

  • Each row would store the embeddings for a single 512x512 chip, and the entire table could realistically store N rows for an entire MGRS tile (10000x1000) across different dates.

TODO in this PR:

  • Save source_url column to GeoParquet file
  • Rename embeddings file to a format like {MGRS}_{VERSION}.gpq
  • Refactor to allow multiple workers instead of 1 worker

TODO in the future:

  • Sort by ascending date, and remove extra index column?
  • Improve the logic of the LightningModule's prediction loop to enable appending to an existing MGRS geoparquet file?

How you can test it

  • Setup credentials to access the AWS S3 bucket at s3://clay-tiles-02/02/
  • Run the following commands (ideally in an AWS EC2 instance on us-east-1 where the GeoTIFF files are stored):
    # Train the model
    python trainer.py fit --trainer.max_epochs=10 \
                          --trainer.precision=bf16-mixed \
                          --data.data_path=s3://clay-tiles-02/02/32VLM \
                          --data.num_workers=8
    # Generate embeddings GeoParquet file
    python trainer.py predict --ckpt_path=checkpoints/last.ckpt \
                              --trainer.precision=bf16-mixed \
                              --data.batch_size=1024 \
                              --data.data_path=s3://clay-tiles-02/02/32VLM \
                              --data.num_workers=0
    • This should produce a geoparquet file named 32VLM_v01.gpq under the data/embeddings/ folder
  • Extra configuration options can be found using python trainer.py predict --help

To load the embeddings from the geoparquet file:

import geopandas as gpd

geodataframe: gpd.GeoDataFrame = gpd.read_parquet(path="32VLM_v01.gpq")
assert geodataframe.shape == (823, 4)  # 823 rows, 4 columns
print(geodataframe)
	source_url	                                        date	        embeddings	                                        geometry
0	s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...	2017-05-19	[-1.0804343, -1.1861055, 0.2579711, -1.1242834...	POLYGON ((5.46822 60.34364, 5.46324 60.38953, ...
1	s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...	2017-05-19	[-1.081955, -1.1901798, 0.2592258, -1.1241777,...	POLYGON ((5.56081 60.34607, 5.55596 60.39196, ...
2	s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...	2017-05-19	[-1.0853468, -1.1995519, 0.26269174, -1.127272...	POLYGON ((5.65341 60.34844, 5.64870 60.39433, ...
3	s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...	2017-05-19	[-1.0773537, -1.1837404, 0.25767463, -1.119480...	POLYGON ((5.74603 60.35074, 5.74145 60.39663, ...
4	s3://clay-tiles-02/02/32VLM/2017-05-19/claytil...	2017-05-19	[-1.0771247, -1.187013, 0.26040226, -1.124507,...	POLYGON ((5.83867 60.35297, 5.83421 60.39887, ...
...	...	...	...	...
818	s3://clay-tiles-02/02/32VLM/2019-08-27/claytil...	2019-08-27	[-1.0937738, -1.1862404, 0.26832822, -1.123034...	POLYGON ((7.18770 59.45848, 7.18524 59.50443, ...
819	s3://clay-tiles-02/02/32VLM/2019-08-27/claytil...	2019-08-27	[-1.0931807, -1.1811237, 0.26974052, -1.117826...	POLYGON ((7.27798 59.45970, 7.27564 59.50566, ...
820	s3://clay-tiles-02/02/32VLM/2019-08-27/claytil...	2019-08-27	[-1.0908315, -1.1857345, 0.26635545, -1.121208...	POLYGON ((7.36827 59.46086, 7.36605 59.50682, ...
821	s3://clay-tiles-02/02/32VLM/2022-11-19/claytil...	2022-11-19	[-1.0904396, -1.2076643, 0.26954767, -1.134142...	POLYGON ((7.24451 60.10306, 7.24206 60.14901, ...
822	s3://clay-tiles-02/02/32VLM/2022-11-19/claytil...	2022-11-19	[-1.0872881, -1.2177591, 0.27005005, -1.140369...	POLYGON ((6.42729 59.95170, 6.42372 59.99763, ...

823 rows × 4 columns

Related Issues

Follow-up to #73, addresses #35 (comment)

Passing the URL or path of the GeoTIFF file through the datapipe, and into the model's prediction loop. The geopandas.GeoDataFrame now has an extra 'source_url' string column, and this is saved to the GeoParquet file too.
@weiji14 weiji14 added this to the v0 Release milestone Dec 11, 2023
@weiji14 weiji14 self-assigned this Dec 11, 2023
For each MGRS code (e.g. 12ABC), save a GeoParquet file with a name formatted like `{MGRS:5}_v{VERSION:2}.gpq`, e.g. 12ABC_v01.gpq. Have updated the unit test to check that rows with different MGRS codes are saved to different files.
Using ZStandard compression instead of Parquet's default Snappy compression. Should result is slightly smaller filesizes, and slightly faster data transfer and compression (especially over the network). Also changed an assert statement to an if-then-raise instead.
Speed up embedding generation by enabling multiple workers to fetch and load mini-batches of GeoTIFF files independently, and run the prediction. The prediction or generated embeddings from each worker (a geopandas.GeoDataFrame) is then concatenated together row-wise, before getting passed to the GeoParquet output script. This is done via LightningModule's `on_predict_epoch_end` hook. Also documented these new processing steps in the docstring.
Comment on lines +292 to +293
# Output to a GeoParquet filename like {MGRS:5}_v{VERSION:2}.gpq
outpath = f"{outfolder}/{mgrs_code}_v01.gpq"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattpaul, are you ok with a filename like 12ABC_v01.gpq? We could also add a prefix like embedding_ or MGRS_ if you prefer.

Copy link

@mattpaul mattpaul Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This convention looks good. Will all the chips for a given MGRS tile be contained in a single GeoParquet file? Assuming something like date or build number goes into outfolder ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll stick with that filename convention then!

Will all the chips for a given MGRS tile be contained in a single GeoParquet file? Assuming something like date or build number goes into outfolder ?

Yes, each file (e.g. 32VLM_v01.gpq) would contain all the embeddings for every chip within that MGRS tile. It will also contain multiple dates, so you could have overlapping chips in any one spatial area, due to images taken at different dates. Easiest way might be to show you how this looks like in QGIS:

Peek 2023-12-13 18-19

I've put 75% transparency for each green chip, and you'll notice that some chip areas are lighter in colour (so only 1 embedding from 1 date), whereas others are darker in colour from multiple overalps (2 embeddings from 2 dates).

Not sure what you mean by build number. Do you mean the _v01 part?

Copy link

@mattpaul mattpaul Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, cool, I see... the animated gif helps, thanks!

@weiji14 from our previous discussion around having {DATE} in the file name I thought that embeddings from the same tile/chip geometry taken at different dates would be serialized into separate vector embedding files which might be preferable considering we may want to incrementally add to the set of vector embeddings over time.

@leothomas thoughts re: pros vs cons of combining multiple embeddings from different dates / times into a single GeoParquet export file per tile across all time? I don't know if we plan to ever want to train the model on all 7 years of available Sentinel-2 data but as a thought experiment for the sake of argument that would yield a rather large file that increases in size over time. Contrast that with an incremental approach in which the number of GeoParquet file objects in S3 would grow over time as the model learns more and more data but the individual file size is expected to remain static which makes it easier to scale ingestion horizontally.

re: build number - I am thinking about the model training lifecycle. If we zoom out so to speak and consider Clay's roadmap over time w.r.t. model training considering the availability of new Sentinel data products, etc. we will likely want to re-train the model periodically over time.

Perhaps once a month or once a quarter we may want to refresh the training dataset with the latest data available from Sentinel-1,2 or perhaps train on new sensor data as well in addition to the Sentinel data products.

We will want to be able to distinguish between old builds or old model version vs new versions because parts of our distributed system will need to continue working with old version of data, embeddings and map image tile assets while the next version is being processed.

Hence we'll want something like a build number or model version to uniquely identify each along the way. Similar to how OpenAI has tagged versions of GPT, etc.

I assumed that _v01 was meant to track our embedding file format (which has already changed quite a few times if I'm not mistaken), or was that intended to track model version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiji14 from our previous discussion around having {DATE} in the file name I thought that embeddings from the same tile/chip geometry taken at different dates would be serialized into separate vector embedding files which might be preferable considering we may want to incrementally add to the set of vector embeddings over time.

@leothomas thoughts re: pros vs cons of combining multiple embeddings from different dates / times into a single GeoParquet export file per tile across all time? I don't know if we plan to ever want to train the model on all 7 years of available Sentinel-2 data but as a thought experiment for the sake of argument that would yield a rather large file that increases in size over time. Contrast that with an incremental approach in which the number of GeoParquet file objects in S3 would grow over time as the model learns more and more data but the individual file size is expected to remain static which makes it easier to scale ingestion horizontally.

Just to do some quick math. One embedding file for an MGRS tile (roughly 3 dates, ~1000 rows in total) is about 3.0MB. Let's say there are 73 dates in a year (365/5-day revisit), and 7 years (2017-2023), so $3.0\text{MB} \times 73 \times 7 = 1533\text{MB} = 1.5\text{GB}$. This is likely an overestimate since Parquet's columnar nature allows for good compression, but let's assume 1.5GB, is that too big a file to ingest? We'll have 56984 MGRS tiles for the whole world, so that would be $1.5\text{GB} \times 56984\text{files} = 85.5\text{TB}$ of embeddings in total to process (again, likely a gross overestimate, as we won't process everything).

What about if we named the file like {MGRS}_{MINDATE}_{MAXDATE}_{VERSION}.gpq, e.g. 32VLM_20170517_20221119_v01.gpq? Storing 1 date per file would result in too many files, but we can store a range of dates at least (e.g. annual, every two years, etc). Using a range-based {MINDATE}_{MAXDATE} naming scheme would:

  1. Allow someone to quickly find the relevant files to process based on a YYYYMMDD date just by looking at the filenames, and
  2. Be fairly compatible with any changes in the range of dates we want to store in one file, compared to say, a naming scheme with just one date like {YYYYMMDD}.

re: build number - I am thinking about the model training lifecycle. If we zoom out so to speak and consider Clay's roadmap over time w.r.t. model training considering the availability of new Sentinel data products, etc. we will likely want to re-train the model periodically over time.

Perhaps once a month or once a quarter we may want to refresh the training dataset with the latest data available from Sentinel-1,2 or perhaps train on new sensor data as well in addition to the Sentinel data products.

We will want to be able to distinguish between old builds or old model version vs new versions because parts of our distributed system will need to continue working with old version of data, embeddings and map image tile assets while the next version is being processed.

Hence we'll want something like a build number or model version to uniquely identify each along the way. Similar to how OpenAI has tagged versions of GPT, etc.

I assumed that _v01 was meant to track our embedding file format (which has already changed quite a few times if I'm not mistaken), or was that intended to track model version?

Gotcha, so we want a way to track both model versions, and schema revisions to the embedding file itself. I was maybe thinking of using v01 to track both, but we could probably do either:

  1. In the filename, have two parts like {MODEL-VERSION}-{SCHEMA-REVISION}, e.g. _vit-1-abcde_v001.gpq, where vit-1-abcde would mean Vision Transformer model 1, and abcde signifies the hash of the trained model, and v001 being the schema version (e.g. what the columns and metadata inside the parquet file look like). The three digits could almost be SemVer-like, so 001 would mean schema v0.0.1.
  2. Have just {MODEL_VERSION} in the filename e.g. _vit-1-abcde.gpq, and store the schema revision number internally in the Parquet file's metadata.

What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick back of the napkin math. While there probably isn't a hard ceiling on what constitutes too big for our primary use case (vector ingestion service), remember too that folks might want to open/work with these GeoParquet files from within jupyter notebooks or browser based javascript environments with resource limitations.

I would say yes let's include both {MODEL-VERSION}, {SCHEMA-REVISION} in parquet metadata for sure. It might also make sense to include {MODEL-VERSION} in the filename, or rather, the pathname like so:

{outfolder}/{MODEL-VERSION}/{MGRS_CODE}.gpq

We could add DATE to the pathname as well, perhaps after MODEL is best:

{MODEL-VERSION}/{DATE}/{MGRS_CODE}.gpq

I like your idea around date range. Hopefully it your code wouldn't have to do a table scan of all embeddings to determine MINDATE, MAXDATE... remember that could be 1.5G!

If date range is available / doable, that could look like:

{MODEL-VERSION}/{MINDATE}_{MAXDATE}/{MGRS_CODE}.gpq

Yeah, that's looking good to me!

Which begs the question - what are the typical use cases we foresee w.r.t. exporting vector embeddings? I imagine during development you might hard-code things to only work with a subset such as one-tile, but how do we expect to do things in production on a regular basis?

Would export be something done on demand by an engineer/operator from the command line or API request, etc. specifying parameters for desired MGRS tile(s) or date range?

Also, an aside, are we generating the GeoParquet file to scratch disk space and then uploading to S3? or is it generated in memory? The reason I ask is that when I make "file name" suggestions I am specifically thinking of the S3 object's file name to be clear. Feel free to name things on disk any which way you like or makes the most sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... or does it make more sense for {MGRS} to come before {DATE} ?

Not sure how often someone will be browsing the S3 bucket we use for ingestion to look for a specific MGRS tile's embedding but that might be a viable use case for data scientists in which case they know the MGRS code but not the date range a priori, in which case yeah, MGRS should probably come first left-to-right.

{MODEL-VERSION}/{MGRS}/{MINDATE}_{MAXDATE}.gpq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. To summarize real quick, I think we should go for {MGRS}_{MINDATE}_{MAXDATE}.gpq for the filename would be best for now, and {MODEL-VERSION} will definitely be used at the folder level (i.e. a path like {MODEL-VERSION}/{MGRS)_{MINDATE})_{MAXDATE}.gpq. I'll need to have a think about whether to include the model version and schema revision in the filename too, but will do that in a separate PR.

Which begs the question - what are the typical use cases we foresee w.r.t. exporting vector embeddings? I imagine during development you might hard-code things to only work with a subset such as one-tile, but how do we expect to do things in production on a regular basis?

Would export be something done on demand by an engineer/operator from the command line or API request, etc. specifying parameters for desired MGRS tile(s) or date range?

Also, an aside, are we generating the GeoParquet file to scratch disk space and then uploading to S3? or is it generated in memory? The reason I ask is that when I make "file name" suggestions I am specifically thinking of the S3 object's file name to be clear. Feel free to name things on disk any which way you like or makes the most sense.

Just to be clear, the neural network model is not likely something we update frequently, since it costs a lot of time and money to train a Foundation Model on lots of data. At such, you might only see a new model version every 3 months, or every 6 months, depending on what budget there is.

As for generating the embeddings (aka the embedding factory), this can be done in multiple ways, depending on what vector database you've decided upon, and what downstream applications you have in mind

  • Currently we are processing each MGRS tile in batches, and doing a bulk export of embeddings to a GeoParquet file. For applications such as similarity search where you need a vector database of indexed embeddings to search over, a bulk batch method like this makes sense. E.g. if you want to find images of all the solar panels in the world.
  • Alternatively, if you want to generate embeddings on demand via an API request, that may not even require setting up a vector databse capable of storing millions of rows. For example, if you want to visualize embeddings for 1 MGRS tile over 5 years in a time-series, it might be better to set up something like HuggingFace's Inference API to do this, and just get a hundred rows out.

So my question is, are you looking at bulk scale embedding generation (100k+ rows), or small scale embedding generation (100s of rows)?

# Output to a GeoParquet filename like {MGRS:5}_v{VERSION:2}.gpq
outpath = f"{outfolder}/{mgrs_code}_v01.gpq"
_gdf: gpd.GeoDataFrame = gdf.loc[mgrs_codes == mgrs_code]
_gdf.to_parquet(path=outpath, schema_version="1.0.0", compression="ZSTD")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the Parquet compression from the default 'SNAPPY' to 'ZSTD' here. I ran some quick benchmarks using the code at https://github.com/jtmiclat/geoparquet-compression-benchmark/blob/main/benchmark.py, and it seems like ZSTD results in a smaller file size and slightly faster read speeds (decompression).

We can also tweak the compression ratio and other options (see https://arrow.apache.org/docs/13.0/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table), but this seems to be good enough for now, given that we'll be ingesting these GeoParquet files into some vector database like Postgres+pgvector.

@weiji14 weiji14 marked this pull request as ready for review December 19, 2023 00:13
@weiji14
Copy link
Contributor Author

weiji14 commented Dec 19, 2023

Gonna merge this in first, and work on integrating with the new model (#47) next.

@weiji14 weiji14 merged commit 776cce8 into main Dec 19, 2023
2 checks passed
@weiji14 weiji14 deleted the embeddings/rename-file-with-mgrs-code branch December 19, 2023 00:15
brunosan pushed a commit that referenced this pull request Dec 27, 2023
…url (#86)

* 🗃️ Store source_url of GeoTIFF to GeoParquet file

Passing the URL or path of the GeoTIFF file through the datapipe, and into the model's prediction loop. The geopandas.GeoDataFrame now has an extra 'source_url' string column, and this is saved to the GeoParquet file too.

* 🚚 Save one GeoParquet file for each unique MGRS tile

For each MGRS code (e.g. 12ABC), save a GeoParquet file with a name formatted like `{MGRS:5}_v{VERSION:2}.gpq`, e.g. 12ABC_v01.gpq. Have updated the unit test to check that rows with different MGRS codes are saved to different files.

* ⚡ Save GeoParquet file with ZSTD compression

Using ZStandard compression instead of Parquet's default Snappy compression. Should result is slightly smaller filesizes, and slightly faster data transfer and compression (especially over the network). Also changed an assert statement to an if-then-raise instead.

* ♻️ Predict with multiple workers and gather results to save

Speed up embedding generation by enabling multiple workers to fetch and load mini-batches of GeoTIFF files independently, and run the prediction. The prediction or generated embeddings from each worker (a geopandas.GeoDataFrame) is then concatenated together row-wise, before getting passed to the GeoParquet output script. This is done via LightningModule's `on_predict_epoch_end` hook. Also documented these new processing steps in the docstring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants