-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Landcover based sampling strategy. #29
Conversation
321730b
to
9e3a219
Compare
The sampling scripts v0 are complete and should be fully reproducible. The current scripts selects 950 tiles based on landcover, making sure we capture diverse tiles, and a good representation of every class. The resulting tiles are visualizeed in the map below Magenta are the selcted ones, the other ones are colored by number of land cover classes present in them. |
7abd947
to
18e9bd1
Compare
100 samples from all tiles with water between 30% an 70% (making sure we | ||
capture some, but exclude only purely water so we catch coasts) | ||
""" | ||
data = geopandas.read_file(Path(wd, "mgrs_stats.fgb")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share this mgrs_stats.fgb
file please? I got a permission error trying to access s3://esa-worldcover/v200/2021/map
for some reason, and downloading from https://worldcover2021.esa.int/downloader is taking a long time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange s3 sync s3://esa-worldcover/v200/2021/map $wd/esa-worldcover-v200-2021-map --no-sign-request
works perfectly for me. But yes, its quite a lot of data (120GB) so no need for everyone to reproduce the stats file! Attaching below.
result = pandas.concat( | ||
[ | ||
diversity, | ||
urban, | ||
wetland, | ||
mangroves, | ||
moss, | ||
cropland, | ||
trees, | ||
shrubland, | ||
grassland, | ||
bare, | ||
snow, | ||
water, | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand, this sampling function is independently getting the highest values for each category (plus some extra MGRS tiles for diversity and water areas), and then concatenating those rows together into a single dataframe?
Plotting your mgrs_sample.geojson
file from #29 (comment), I see a few cases where the exact MGRS tile is sampled more than once. E.g.:
MGRS tile 56VLM - sampled 3 times:
MGRS tile 17RMQ - sampled 2 times:
MGRS tile 32TLP - sampled 2 times:
The duplicates might be due to the independent random sampling per-category and then concatenation. Perhaps we could remove such duplicate rows before saving out the GeoJSON file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes your assessment is correct, I had another version before that would split the selected rows but that dropped along the way. So, good catch, will drop duplicates as part of the script.
Sharing the stats file here @weiji14 |
Attaching the updated sampled mgrs tiles as |
356cc0e
to
20a6682
Compare
The current sampling strategy is kept purposefully simple. We can expand to more criteria that are not only landcover, and potentially use a cluster based approach to replace the human biased perspective in the current implementation.
Refs #28