Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worldcover embeddings conus #153

Merged
merged 25 commits into from
Feb 28, 2024
Merged

Worldcover embeddings conus #153

merged 25 commits into from
Feb 28, 2024

Conversation

yellowcap
Copy link
Member

This is work to run embedding inference for all of the US Conus region using worldcover composites from 2020 and 2021.

I am currently pushing this into Batch to make it work there. Once the first jobs run through I will kickoff the processing at US level.

This uses partial inputs from RGBNIR, the 4 bands available from worldcover.

@yellowcap yellowcap marked this pull request as ready for review February 9, 2024 09:15
@yellowcap
Copy link
Member Author

For fun, here some search results from a local sim search test.

We have 2021 (almost) fully processed, and will run 2020 just after that.

Figure_3

@yellowcap
Copy link
Member Author

The Contiguous United states are fully processed for 2020 and 2021! 🇺🇸 🌎 🦅 🎉 🎆

The version that is complete for both years is 002, the embeddings are stored in the bucket uri below

s3://clay-worldcover-embeddings/v002/

Copy link
Collaborator

@srmsoumya srmsoumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good, nice use of pack operations.

@yellowcap yellowcap merged commit 472c92f into main Feb 28, 2024
5 checks passed
@yellowcap yellowcap deleted the worldcover-embeddings-conus branch February 28, 2024 09:29
from skimage import io

# Set working directory
wd = "/home/usr/Desktop/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yellowcap, I know this is already merged, but can you avoid such absolute/hardcoded paths?

Copy link
Member Author

@yellowcap yellowcap Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feeback @chuckwondo you are right, this isn't great, but is supposed to be a placeholder. Sometimes I use fake paths like /path/to/your/working/directory, to show what this is supposed to be so that people running the script could replace it.

But I am very happy to learn about better ways to do this, what is your favorite solution for this kind of thing? In cases like this with scripts, maybe env vars could be an option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to propose some ideas, but I need some context first. How is this script intended to be used? Is the intent to have the user first run the aws s3 sync command shown in the code comment below this line, and then just directly call this script?

Copy link
Member Author

@yellowcap yellowcap Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, the idea is that someone with access to the embeddings downloads them using aws s3 sync to a local folder, and then runs the script pointing to the embedding files and to a folder where the lancedb data should be stored.

I.e. the script needs two folders

  1. A place to download the raw data
  2. A folder to create / store the lancedb data

I made many scripts like this where some local workding directories are needed. Never really found a very satisfactory way of handling this. Env vars seem a bit cluncky and are not always easy to set up. Constants work, but then the script requires a hard coded default value.

So if you have good ideas on how to approach this issue, they are most welcome!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add command-line arguments. For very simple scripts like this one, simply use Python's standard argparse module, so you don't have to add any dependencies. In this case, it sounds like you might want to use argparse to support a syntax like the following for running the script:

scripts/worldcover/embeddings_db.py --input-dir path/to/input/dir --db-dir path/to/db/dir

Both options should be required, and the script should also create the dir specified for --db-dir, so the user doesn't have to do so manually beforehand.

Copy link

@antymoro antymoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, just put a few questions - would love to hear your feedback, thank you!

# CKPT_PATH = "https://huggingface.co/made-with-clay/Clay/resolve/main/Clay_v0.1_epoch-24_val-loss-0.46.ckpt"
VERSION = "002"
BUCKET = "clay-worldcover-embeddings"
URL = "https://esa-worldcover-s2.s3.amazonaws.com/rgbnir/{year}/N{yidx}/ESA_WorldCover_10m_{year}_v{version}_N{yidx}W{xidx}_S2RGBNIR.tif"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those "N" and "W" coordinate values should not be hard-coded here, took me a while to spot why the code was fetching wrong tif files

y_size = min(CHIP_SIZE, TILE_SIZE - y_local_off)
y_another = y_size < CHIP_SIZE

tile_id = f"N{y_tile_index}W{str(x_tile_index).zfill(3)}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue here with the hard-coded "N" and "W" coordinates

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the path creation should probably be generalized. It was hardcoded here because the source file is for the US which is in the NW quadrant.

Generalizing will take some tinkering to make sure that this still works even if you are looking at an area that crosses these regions.

yoff = 0
embeddings = []
all_bounds = []
while yoff < RASTER_Y_SIZE:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this most likely is a non-issue but I was wondering why the code is not looping through the X axis as well? Is it because of how the worklow is designed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the script was ran in parallel for every "chip" in the x axis separately on AWS batch. The xoff value is computed based on a single index value that goes from left to right on the map until the area is completed. The index value is an env var for each separate job.

index = int(os.environ.get("AWS_BATCH_JOB_ARRAY_INDEX", 2))

xoff = index * CHIP_SIZE

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @yellowcap

index = int(os.environ.get("AWS_BATCH_JOB_ARRAY_INDEX", 2))

I assume the starting index is 0, correct?

Also, last last question: could you please also let me know what are the XORIGIN and YORIGIN values and why they differ slightly from the starting coordinates?

E_W_INDEX_START = 67
E_W_INDEX_END = 125
N_S_INDEX_START = 24
N_S_INDEX_END = 49
YORIGIN = 50.0
XORIGIN = -125.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the starting index is 0, correct?

yes

let me know what are the XORIGIN and YORIGIN values and why they differ slightly from the starting coordinates?

the E_W_INDEX_START, E_W_INDEX_END, N_S_INDEX_START, N_S_INDEX_END are the integer indexes of the worlcover grid reference.

X and YORIGIN are in lat/lon coordinates. They are the upper left corner of the area covered by the tiles in the US layer. I.e. the upper left corner of the worldcover tile with index (67, 24).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants