Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch setup #54

Merged
merged 6 commits into from
Dec 6, 2023
Merged

Batch setup #54

merged 6 commits into from
Dec 6, 2023

Conversation

yellowcap
Copy link
Member

@yellowcap yellowcap commented Nov 24, 2023

This is the version that we used to run our first deployment at scale.

What changes

Main pipeline logic changes:

  • Only concatenate S2, DEM, and S1 arrays for each tile during tililng. This saves lots of memory as S2 can be kept in Uint16.
  • Use indexing on the arrays instead of the .sel function. Indexing is almost instant, and sel is slow for large arrays.
  • Make the target bucket name an argument

Organisational changes:

  • Add info about how to run the pipeline on batch.
  • Add mgrs sample data as geopackage.

Comment on lines +81 to +84
parts = [part[:, y_start:y_end, x_start:x_end] for part in stack]

# Only concat here to save memory, it converts S2 data to float
tile = xr.concat(parts, dim="band").rename("tile")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yellowcap or @lillythomas, can you double check the Sentinel-1 pixel values? I've been getting NaN loss values, and tracked it down to the Sentinel-1 VV and VH bands being all NaNs. Sentinel-2 and DEM bands are ok. Sample code:

import os
import matplotlib.pyplot as plt
import rioxarray
import xarray as xr

os.environ["AWS_PROFILE"] = "xxx"

da = rioxarray.open_rasterio(
    filename="s3://clay-tiles-01/01/56HKH/2022-06-18/claytile-56HKH-2022-06-18-01-1.tif"
)
da = da.compute()

# %%
# Plot different bands
da.sel(band=1).plot.imshow()  # Blue
plt.show()
da.sel(band=10).plot.imshow()  # SWIR2
plt.show()
da.sel(band=11).plot.imshow()  # VH
plt.show()
da.sel(band=12).plot.imshow()  # VV
plt.show()
da.sel(band=13).plot.imshow()  # DEM
plt.show()

Blue band
SWIR2 band
VH band
VV band
DEM band

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried a few other tiles in other locations like MGRS 29SMD over Lisbon, and it seems like the DEM can come up with NaNs as well (in places over the ocean?). So maybe check through all those bands.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Looks like the nodata filter is missing checks on S1 and DEM. That should be easy to correct.

Copy link
Contributor

@weiji14 weiji14 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks like Lilly will be fixing this in #60.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converted to issue for tracking #68

@weiji14 weiji14 added the data-pipeline Pull Requests about the data pipeline label Nov 29, 2023
@weiji14 weiji14 added this to the v0 Release milestone Nov 29, 2023
Comment on lines +89 to +91
counter += 1
if counter % 100 == 0:
print(f"Counted {counter} tiles")
Copy link
Contributor

@weiji14 weiji14 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When saving the TIF files, can the counter part of the filename be zero-padded to a consistent length? The previous batch had filenames like claytile-......-1.tif, claytile-......-2.tif, ... claytile-......-20.tif, claytile-......-300.tif. Having a consistent length with numbers like 001, 002, 020, 300 would make sorting easier, and allow us to parse out the date using a reverse index (e.g. filepath[-21:-11]).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if we can store the actual timestamp in the metadata of the GeoTIFF, that would be even better!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this geopackage be stored on git? Or just on the s3 bucket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#71

@@ -115,12 +105,8 @@ def tiler(stack, date, mgrs):
with rasterio.open(name, "r+") as rst:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At about L102 on the filename, @mattpaul had some suggestions at #35 (comment) to:

  1. Use underscore as the delimiter between MGRS/DATE/VERSION/COUNTER
  2. Preface the VERSION with a v (e.g. v0)

Also mentioned at #44 (comment) that we could drop the hyphen in the date, i.e. YYYY-MM-DD becomes YYYYMMDD, but that didn't seem to have been applied in the initial batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks will take that into consideration before next run.

@yellowcap
Copy link
Member Author

Converted remaining questions to issues and will merge now for the sake of moving fast.

@yellowcap yellowcap closed this Dec 6, 2023
@yellowcap yellowcap reopened this Dec 6, 2023
@yellowcap yellowcap merged commit bff007c into main Dec 6, 2023
3 checks passed
@yellowcap yellowcap deleted the batch-setup branch December 6, 2023 09:53
brunosan pushed a commit that referenced this pull request Dec 27, 2023
* Add bucket as argument to cli

* Improve efficency of datacube

Keep S2 in Uint16 as long as possible, subset using indexing instead of sel

* Simplify print statements

* Add and document batch setup

* Add sample as geopackage

Geojson was too big for the linter to be happy

* Small edit on README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-pipeline Pull Requests about the data pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants