Batch setup #54

yellowcap · 2023-11-24T11:42:13Z

This is the version that we used to run our first deployment at scale.

What changes

Main pipeline logic changes:

Only concatenate S2, DEM, and S1 arrays for each tile during tililng. This saves lots of memory as S2 can be kept in Uint16.
Use indexing on the arrays instead of the .sel function. Indexing is almost instant, and sel is slow for large arrays.
Make the target bucket name an argument

Organisational changes:

Add info about how to run the pipeline on batch.
Add mgrs sample data as geopackage.

Keep S2 in Uint16 as long as possible, subset using indexing instead of sel

Geojson was too big for the linter to be happy

weiji14 · 2023-11-28T23:27:59Z

scripts/tile.py

+                parts = [part[:, y_start:y_end, x_start:x_end] for part in stack]
+
+                # Only concat here to save memory, it converts S2 data to float
+                tile = xr.concat(parts, dim="band").rename("tile")


@yellowcap or @lillythomas, can you double check the Sentinel-1 pixel values? I've been getting NaN loss values, and tracked it down to the Sentinel-1 VV and VH bands being all NaNs. Sentinel-2 and DEM bands are ok. Sample code:

import os import matplotlib.pyplot as plt import rioxarray import xarray as xr os.environ["AWS_PROFILE"] = "xxx" da = rioxarray.open_rasterio( filename="s3://clay-tiles-01/01/56HKH/2022-06-18/claytile-56HKH-2022-06-18-01-1.tif" ) da = da.compute() # %% # Plot different bands da.sel(band=1).plot.imshow() # Blue plt.show() da.sel(band=10).plot.imshow() # SWIR2 plt.show() da.sel(band=11).plot.imshow() # VH plt.show() da.sel(band=12).plot.imshow() # VV plt.show() da.sel(band=13).plot.imshow() # DEM plt.show()

Tried a few other tiles in other locations like MGRS 29SMD over Lisbon, and it seems like the DEM can come up with NaNs as well (in places over the ocean?). So maybe check through all those bands.

Good catch! Looks like the nodata filter is missing checks on S1 and DEM. That should be easy to correct.

Ok, looks like Lilly will be fixing this in #60.

converted to issue for tracking #68

weiji14 · 2023-12-04T01:25:39Z

scripts/tile.py

+                counter += 1
+                if counter % 100 == 0:
+                    print(f"Counted {counter} tiles")


When saving the TIF files, can the counter part of the filename be zero-padded to a consistent length? The previous batch had filenames like claytile-......-1.tif, claytile-......-2.tif, ... claytile-......-20.tif, claytile-......-300.tif. Having a consistent length with numbers like 001, 002, 020, 300 would make sorting easier, and allow us to parse out the date using a reverse index (e.g. filepath[-21:-11]).

Alternatively, if we can store the actual timestamp in the metadata of the GeoTIFF, that would be even better!

weiji14 · 2023-12-06T00:32:07Z

scripts/mgrs_sample.gpkg

Should this geopackage be stored on git? Or just on the s3 bucket?

weiji14 · 2023-12-06T00:37:31Z

scripts/tile.py

@@ -115,12 +105,8 @@ def tiler(stack, date, mgrs):
                with rasterio.open(name, "r+") as rst:


At about L102 on the filename, @mattpaul had some suggestions at #35 (comment) to:

Use underscore as the delimiter between MGRS/DATE/VERSION/COUNTER

Preface the VERSION with a v (e.g. v0)

Also mentioned at #44 (comment) that we could drop the hyphen in the date, i.e. YYYY-MM-DD becomes YYYYMMDD, but that didn't seem to have been applied in the initial batch.

ok thanks will take that into consideration before next run.

yellowcap · 2023-12-06T09:52:41Z

Converted remaining questions to issues and will merge now for the sake of moving fast.

* Add bucket as argument to cli * Improve efficency of datacube Keep S2 in Uint16 as long as possible, subset using indexing instead of sel * Simplify print statements * Add and document batch setup * Add sample as geopackage Geojson was too big for the linter to be happy * Small edit on README

yellowcap added 6 commits November 23, 2023 12:11

Add bucket as argument to cli

ce95c09

Improve efficency of datacube

178cdb9

Keep S2 in Uint16 as long as possible, subset using indexing instead of sel

Simplify print statements

016d7b0

Add and document batch setup

89a553f

Add sample as geopackage

919e5bb

Geojson was too big for the linter to be happy

Small edit on README

f48c015

weiji14 reviewed Nov 28, 2023

View reviewed changes

weiji14 added the data-pipeline Pull Requests about the data pipeline label Nov 29, 2023

weiji14 added this to the v0 Release milestone Nov 29, 2023

weiji14 reviewed Dec 4, 2023

View reviewed changes

weiji14 mentioned this pull request Dec 4, 2023

Let LightningDataModule return spatiotemporal metadata #66

Merged

weiji14 reviewed Dec 6, 2023

View reviewed changes

weiji14 mentioned this pull request Dec 6, 2023

Send early sample of embeddings #35

Closed

This was referenced Dec 6, 2023

Catch Nodata for S1 and DEM #68

Closed

Zero padding of chip number in data pipeline #69

Closed

Timestamp in TIF metadata for data pipeline output #70

Closed

yellowcap closed this Dec 6, 2023

yellowcap reopened this Dec 6, 2023

yellowcap merged commit bff007c into main Dec 6, 2023
3 checks passed

yellowcap deleted the batch-setup branch December 6, 2023 09:53

weiji14 mentioned this pull request Dec 6, 2023

Move MGRS sample file out of Git and make it a parameter #71

Closed

weiji14 mentioned this pull request Dec 22, 2023

Document how to run the datacube pipeline with a batch job #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch setup #54

Batch setup #54

yellowcap commented Nov 24, 2023 •

edited

Loading

weiji14 Nov 28, 2023

weiji14 Nov 29, 2023

yellowcap Nov 29, 2023

weiji14 Dec 4, 2023 •

edited

Loading

yellowcap Dec 6, 2023

weiji14 Dec 4, 2023 •

edited

Loading

weiji14 Dec 4, 2023

yellowcap Dec 6, 2023

weiji14 Dec 6, 2023

yellowcap Dec 6, 2023

weiji14 Dec 6, 2023

yellowcap Dec 6, 2023

yellowcap commented Dec 6, 2023

		@@ -115,12 +105,8 @@ def tiler(stack, date, mgrs):
		with rasterio.open(name, "r+") as rst:

Batch setup #54

Batch setup #54

Conversation

yellowcap commented Nov 24, 2023 • edited Loading

What changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weiji14 Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yellowcap commented Dec 6, 2023

yellowcap commented Nov 24, 2023 •

edited

Loading

weiji14 Dec 4, 2023 •

edited

Loading

weiji14 Dec 4, 2023 •

edited

Loading