-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch setup #54
Batch setup #54
Conversation
Keep S2 in Uint16 as long as possible, subset using indexing instead of sel
Geojson was too big for the linter to be happy
parts = [part[:, y_start:y_end, x_start:x_end] for part in stack] | ||
|
||
# Only concat here to save memory, it converts S2 data to float | ||
tile = xr.concat(parts, dim="band").rename("tile") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yellowcap or @lillythomas, can you double check the Sentinel-1 pixel values? I've been getting NaN loss values, and tracked it down to the Sentinel-1 VV and VH bands being all NaNs. Sentinel-2 and DEM bands are ok. Sample code:
import os
import matplotlib.pyplot as plt
import rioxarray
import xarray as xr
os.environ["AWS_PROFILE"] = "xxx"
da = rioxarray.open_rasterio(
filename="s3://clay-tiles-01/01/56HKH/2022-06-18/claytile-56HKH-2022-06-18-01-1.tif"
)
da = da.compute()
# %%
# Plot different bands
da.sel(band=1).plot.imshow() # Blue
plt.show()
da.sel(band=10).plot.imshow() # SWIR2
plt.show()
da.sel(band=11).plot.imshow() # VH
plt.show()
da.sel(band=12).plot.imshow() # VV
plt.show()
da.sel(band=13).plot.imshow() # DEM
plt.show()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried a few other tiles in other locations like MGRS 29SMD over Lisbon, and it seems like the DEM can come up with NaNs as well (in places over the ocean?). So maybe check through all those bands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Looks like the nodata filter is missing checks on S1 and DEM. That should be easy to correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks like Lilly will be fixing this in #60.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converted to issue for tracking #68
counter += 1 | ||
if counter % 100 == 0: | ||
print(f"Counted {counter} tiles") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When saving the TIF files, can the counter part of the filename be zero-padded to a consistent length? The previous batch had filenames like claytile-......-1.tif
, claytile-......-2.tif
, ... claytile-......-20.tif
, claytile-......-300.tif
. Having a consistent length with numbers like 001
, 002
, 020
, 300
would make sorting easier, and allow us to parse out the date using a reverse index (e.g. filepath[-21:-11]
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if we can store the actual timestamp in the metadata of the GeoTIFF, that would be even better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/mgrs_sample.gpkg
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this geopackage be stored on git? Or just on the s3 bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -115,12 +105,8 @@ def tiler(stack, date, mgrs): | |||
with rasterio.open(name, "r+") as rst: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At about L102 on the filename, @mattpaul had some suggestions at #35 (comment) to:
- Use underscore as the delimiter between MGRS/DATE/VERSION/COUNTER
- Preface the VERSION with a v (e.g.
v0
)
Also mentioned at #44 (comment) that we could drop the hyphen in the date, i.e. YYYY-MM-DD becomes YYYYMMDD, but that didn't seem to have been applied in the initial batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks will take that into consideration before next run.
Converted remaining questions to issues and will merge now for the sake of moving fast. |
* Add bucket as argument to cli * Improve efficency of datacube Keep S2 in Uint16 as long as possible, subset using indexing instead of sel * Simplify print statements * Add and document batch setup * Add sample as geopackage Geojson was too big for the linter to be happy * Small edit on README
This is the version that we used to run our first deployment at scale.
What changes
Main pipeline logic changes:
.sel
function. Indexing is almost instant, andsel
is slow for large arrays.Organisational changes: