From 0e34498fbde6b4671250b2fda649aec56a3ea0d3 Mon Sep 17 00:00:00 2001 From: Daniel Wiesmann Date: Tue, 16 Jan 2024 14:52:47 +0000 Subject: [PATCH] Document data sampling strategy (#124) * Document data sampling strategy * Fix typo in specification file --- docs/_toc.yml | 2 ++ docs/data_sampling.md | 47 +++++++++++++++++++++++++++++++++++++++++++ docs/specification.md | 2 +- 3 files changed, 50 insertions(+), 1 deletion(-) create mode 100644 docs/data_sampling.md diff --git a/docs/_toc.yml b/docs/_toc.yml index dc20a3e5..299721a6 100644 --- a/docs/_toc.yml +++ b/docs/_toc.yml @@ -22,6 +22,8 @@ parts: file: data_datacube - title: Benchmark dataset labels file: data_labels + - title: Data sampling strategy + file: data_sampling - caption: Running the model chapters: - title: Run over a region diff --git a/docs/data_sampling.md b/docs/data_sampling.md new file mode 100644 index 00000000..7d3bd5af --- /dev/null +++ b/docs/data_sampling.md @@ -0,0 +1,47 @@ +# Data sampling strategy + +To create a balanced dataset for model training, we used a sampling strategy +based on land cover classes from the [ESA WorldCover](https://esa-worldcover.org/) +layer. + +Our unit of analysis for sampling was the MGRS tile, the global tiling scheme +that is used for distributing Sentinel-2 imagery. For each MGRS tile, we +computed landcover statistics for all the classes in the WorldCover layer. To +speed up processing, we used the third level overview in the WorldCover layer, +which has a spatial resolution of 80 meters. + +The goal of the landcover sampling was to ensure coverage of each class at +a reasonable level. For each class, we selected a number of random MGRS tiles +out of the subset of MGRS tiles with the highest fraction of that class present. + +As an example, for "Wetlands" we selected 50 random ones out of the MGRS tiles +with the highest wetland fraction globally. For the Built-up class on the other +hand we selected the 400 most urban MGRS tiles. + +In addition to the landcover classes, we also added diversity by selecting 500 +tiles out of the 3000 tiles with the highest count of land cover classes present +in the tile. + +The following table summarizes the selection criteria for each class. + +| Class | Nr of Tiles | From highest | +|---|---|---| +Diversity | 500 | 3000 +Built-up | 400 | 400 +Herbaceous wetland | 50 | 500 +Mangroves | 50 | 500 +Moss and lichen | 50 | 500 +Cropland | 100 | 500 +Tree cover | 100 | 500 +Shrubland | 50 | 500 +Grassland | 50 | 500 +Bare / sparse vegetation | 50 | 500 +Snow and Ice | 50 | 500 +Permanent water bodies | 100 | 1000 + +After selecting MGRS tiles for each of these criteria, we removed duplicates. +This resulted in a sample of 1517 MGRS tiles total in our sample. + +The resulting sample file can be downloaded from the following link + +https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample.fgb diff --git a/docs/specification.md b/docs/specification.md index cdf2d55e..eed00ac3 100644 --- a/docs/specification.md +++ b/docs/specification.md @@ -21,7 +21,7 @@ We also generated embeddings for all trainning data, which can be found on Sourc ## Model Architecture -Clat is a Unet, with a modified ViT encoder down to embeddings, and a decoder to reconstruct the masked parts of the original image. The loss function is the MSE between the original image and the reconstructed image. +Clay is a MAE, with a modified ViT encoder down to embeddings, and a decoder to reconstruct the masked parts of the original image. The loss function is the MSE between the original image and the reconstructed image. For details, check the source code [here](https://github.com/Clay-foundation/model/blob/v0.0.1/src/model_clay.py).