Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document data sampling strategy #124

Merged
merged 2 commits into from
Jan 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ parts:
file: data_datacube
- title: Benchmark dataset labels
file: data_labels
- title: Data sampling strategy
file: data_sampling
- caption: Running the model
chapters:
- title: Run over a region
Expand Down
47 changes: 47 additions & 0 deletions docs/data_sampling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Data sampling strategy

To create a balanced dataset for model training, we used a sampling strategy
based on land cover classes from the [ESA WorldCover](https://esa-worldcover.org/)
layer.

Our unit of analysis for sampling was the MGRS tile, the global tiling scheme
that is used for distributing Sentinel-2 imagery. For each MGRS tile, we
computed landcover statistics for all the classes in the WorldCover layer. To
speed up processing, we used the third level overview in the WorldCover layer,
which has a spatial resolution of 80 meters.

The goal of the landcover sampling was to ensure coverage of each class at
a reasonable level. For each class, we selected a number of random MGRS tiles
out of the subset of MGRS tiles with the highest fraction of that class present.

As an example, for "Wetlands" we selected 50 random ones out of the MGRS tiles
with the highest wetland fraction globally. For the Built-up class on the other
hand we selected the 400 most urban MGRS tiles.

In addition to the landcover classes, we also added diversity by selecting 500
tiles out of the 3000 tiles with the highest count of land cover classes present
in the tile.

The following table summarizes the selection criteria for each class.

| Class | Nr of Tiles | From highest |
|---|---|---|
Diversity | 500 | 3000
Built-up | 400 | 400
Herbaceous wetland | 50 | 500
Mangroves | 50 | 500
Moss and lichen | 50 | 500
Cropland | 100 | 500
Tree cover | 100 | 500
Shrubland | 50 | 500
Grassland | 50 | 500
Bare / sparse vegetation | 50 | 500
Snow and Ice | 50 | 500
Permanent water bodies | 100 | 1000

After selecting MGRS tiles for each of these criteria, we removed duplicates.
This resulted in a sample of 1517 MGRS tiles total in our sample.

The resulting sample file can be downloaded from the following link

https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample.fgb
2 changes: 1 addition & 1 deletion docs/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ We also generated embeddings for all trainning data, which can be found on Sourc

## Model Architecture

Clat is a Unet, with a modified ViT encoder down to embeddings, and a decoder to reconstruct the masked parts of the original image. The loss function is the MSE between the original image and the reconstructed image.
Clay is a MAE, with a modified ViT encoder down to embeddings, and a decoder to reconstruct the masked parts of the original image. The loss function is the MSE between the original image and the reconstructed image.

For details, check the source code [here](https://github.com/Clay-foundation/model/blob/v0.0.1/src/model_clay.py).

Expand Down