Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do a global run of embeddings #277

Open
brunosan opened this issue Jun 11, 2024 · 8 comments
Open

Do a global run of embeddings #277

brunosan opened this issue Jun 11, 2024 · 8 comments
Assignees

Comments

@brunosan
Copy link
Member

We've been using Clay v1 embeddings directly, and via the Build/Explore apps. We've also done several types of partial benchmarking, so we are starting to feel comfortable with the quality of the model. We therefore should think about making large runs of existing open data and create embeddings, for our benefit to continue learning about Clay, but also to enable the community to leverage these open embeddings.

We still need to make decisions once we decide to make large runs:

  • Instrument? Sentinel?, NAIP? One Instrument, couple?
  • Unit of schema? Should do them at the file-level? Or spatial reference?
  • Spatial resolution? We've seen that many applications need the highest possible spatial resolution. Hence if Sentinel, a small tile size (but not too small to make the embeddings of lower quality). 128x128?
  • locations, time? Large coverage seems most important, but many users also request temporal changes. So I suggest either only wide spatial coverage, or 80% a large coverage run, and the 20% remaining many snapshots over time.
  • What format? I propose we wait and follow guidance from @cholmes on https://github.com/cloudnativegeo/geo-embeddings-survey
  • Hosted? source.coop
  • License? Open. Is CC-by best? OpenRail-M?
  • What is the cost of creation? It would be great to come up with a number.

Ideally, we can wrap this code to execute easily down the line, e.g. taking a STAC list and a spec file for chip_size, ...
Note: Do not over-scope here, since we have the build app.

Probably out of scope, but the end-state at some point this year could be:

  • Sentinel-2 annual composites for EU
  • Sentinel-2 Level-2 files for a deforestation basin in Amazon with as many dates as possibe.
  • Same as above but Sentinel-1 files, or Landsat composites.
  • NAIP for whole states once.
  • NAIP for one state as many years as available.

Filing this also early to allow community requests, but we should aim to set a date for such run, e.g. end of June.

@brunosan
Copy link
Member Author

brunosan commented Jul 2, 2024

I was talking with @konstantinklemmer and asking his help to help us make decisions here. Also pinging @cholmes and @bengmstrong @BradNeuberg @Clay-foundation/all and please ping others.

We will need to make decision within a month for the "Big Embeddings run". This will imply lots of decisions that are free to do now and VERY EXPENSIVE to correct later.

My questions, and my suggestions, but none of my suggestions are strongly held.

  1. What instruments?
    I suggest Sentinel-2 annual composite.
  2. What locations/time?
    I suggest to start somewhere in South America latest composite, and Amazon basin all available years. Then increase as budget allows.
  3. What chip size?
    128x128, with small sections at 64x64, and 256x256.
  4. What output?
    Start with average of patch embedding, and maybe a separate file with all patch embeddings, and feature maps Saving Raw embeddings + feature maps. #291
  5. What format?
    Geoparquet. I'd follow Earth Index Columns and Metadata here https://github.com/cloudnativegeo/geo-embeddings-survey/blob/main/data/earth_index/readme.md
    I'll also add Lossess. Either training loss, or a simple loss.
  6. Host/License
    Hosted on source, with CC-By license.
  7. How much budget to put for this?
    Let's start churning and see costs. If it deviates a lot from estimates, we rethink. Assuming a $1/hour g5.xlarge instance with an NVIDIA A10, processing batches of 10 Sentinel-2 inputs takes 10 seconds. Each 128x128 chip covers 2.5 square kilometers. This means we can process roughly 360 inputs per dollar. With a $10,000 budget to start with, that translates to a coverage of 9,000,000 square kilometers. Let's put a 50% penalty just bcuase, and it should give us enough for South America??

What are your thought @yellowcap @srmsoumya ? How much effort to pull this on your side?
Should we continue trainning v1 first (#283 ) ?

Let's aim to kick this compute off July 15th?

@yellowcap
Copy link
Member

If we use worldcover I would suggest a chip size of 100x100 or 200x200, then the chips fit nicely into their 10k x 10k source files. Maybe for Sentinel-2 we would use 100x100 to have a more fine grained resolution. Not sure what kind of features we hope to find based on the embeddings.

Regarding feature maps output, there are 4 feature maps of 32x32 pixels for 768 embeddings, stored as float32. If we assume the input is 4 bands of Sentinel-2 imagery at uint16, then the feature maps are much heavier than the original data. So I would not advise to store the feature maps and rely on running the model at inference time when doing segmentation tasks (did I get this correctly @srmsoumya ?)

Regarding cost we would have to do more test runs to understand it better. We were able to do US level runs already with a reasonable budget, so I think doing some continental scale processing or even global processing should be doable.

Note that the Sentinel-2 composites have limited quality in tropical areas, they are mostly cloud free, but not without haze, and there are small nodata gaps here and there. At least for the Worldcover composites. Happy to look at other sources for composite imagery if people have suggestions.

Finally, I would add at least one NAIP run for all of the US to the wish-list as well.

@konstantinklemmer
Copy link

konstantinklemmer commented Jul 4, 2024

After discussing with @brunosan and thinking a bit more about it, here is my rough "wishlist":

  • Global coverage embedding map. Chip size is less relevant as long as the whole, continuous planet is covered.
  • Sentinel-2 would be the preferred sensor; should of course be cloud free.
  • Ideally two time steps for each location; e.g. January and July (to roughly cover seasonality), but that's secondary.
  • Major TOM Sentinal 2A might be an option: https://github.com/ESA-PhiLab/Major-TOM

For each observation, ideally we'd have the following data (roughly sketched out):
[chip_centroid_lon, chip_centroid_lat, timestamp, chip_thumbnail, clay_embedding, clay_loss]

This "wishlist" is motivated mostly by me wanting to dissect Clay embeddings and see what it learns. Guiding questions are e.g. How does the complexity of embeddings change over space? How representative are embeddings of environmental and human-activity measures? Can Clay embeddings be used as geographic priors?

This would also create a dense embedding database to be used in arbitrary downstream tasks. This allows direct comparison to competitors like MOSAIKS or SatCLIP. The approach would be as follows: Download Clay Embedding with lon/lat closest to downstream location -> Train model y_lonlat = f(ClayEmbedding_lonlat) -> Evaluate.

@bengmstrong
Copy link

Very cool that you're gearing up for a global run! Would love to pull/play with your embeddings. I agree that Sentinel-2 annual composites are the right starting point for global embeddings. To enable comparisons with other models it would be nice to use the same public free imagery. We've created/shared global sentinel-2 L2A composites for 2023 here which you are welcome to use (https://beta.source.coop/repositories/earthgenome/sentinel2-temporal-mosaics/description/) but they're a work in progress and do have some quality issues.

One other note @brunosan I think you dropped a factor of 10 in your back of the envelop math. Looks like you should be able to get through 3600 inputs per dollar right? (batch of 10 inputs / 10 seconds * 3600 sec/hour * 1hr/$) So it might be more affordable than you think!!

@brunosan brunosan changed the title Investigate global runs of embeddings Do a global run of embeddings Jul 5, 2024
@brunosan
Copy link
Member Author

brunosan commented Jul 5, 2024

Thanks everyone. I love that we are getting momentum here.

TLDR; So far
I'm leaning for a

  1. global run of the Sentinel-2 year composite, at 100px most recent year available with EG Sentinel-2 all-bands composite.
  2. NAIP for CONUS. Latest, with 100px chip size too.
  3. maybe? Selected locations (the training set?) to enable temporal and cross instrument studies.
  4. maybe? Satellogic set

Released as CC-By (inheriting EG CC-By)

Still TBD format and adding what losses.

Source imagery

Thanks @bengmstrong and EG team for the data release. It seems to fit perfectly. Besides the files and the STAC endpoint, this blog post explains the method.

It meets the criteria of:

  • Fully open license. (CC-By)
  • Global. (there are mentions of "errors", should be get a black-list of these and run them when fixed? I've spotted checked and I only see the usual hard places like permanent clouds locations)
  • Recent (2023) (only global open composite this recent).

Notes:

  • This seems to be a median reduction of the "best" 16 scenes per location. What "best" mean? (least cloudy of the ~35/year?)

Chip Size

Boils down to 50px of 100px in my opinion.
Costs grows quadratic since its an area. Also very small areas approach the patch size of 8px which means less chances for the self attention to learn about the surroundings.

Since we use the average, we can recreate embeddings at bigger chip sizes just averaging the smaller ones. It won't be the same sine the patches done with smaller chip sizes will not have paid "self-attention" to the patches outsize of that small chip.

It worries me that an area of 1km^2 is substantially big for many potential uses, limiting the usefulness of this large and expensive run, but doing smaller sizes is too expensive. we can do smaller chip sizes for selected places.

Cost estimates

From our "build" workers (the ones on the co-code app, which we might or might not use for this run), we see that in reality we are getting there ~10k chips/h/worker (we use H100 GPUs, so this the approach of a big GPU with a large batch than a cheaper GPU or CPUs). A worker costs $1800/month ($2.5$/h). Most of the time is spent on downloading, so chip size doesn't seem to be a strong factor.

This would mean 4k chips/$/h/worker.

Uncanny we get pretty much the same result that the napkin exercise (we should become consultants).

chip size (x10m/px) cost unit (km^2/$/h/worker) cost to run the world
50x50 px 1000 $510K
100x100 px 4000 $127K

50px is too expensive, 100px is doable. I'm hopefully @yellowcap optimism is true for this run.
Let's just start and assess what coverage/$ we get.

@BradNeuberg
Copy link

BradNeuberg commented Jul 5, 2024 via email

@brunosan
Copy link
Member Author

Update here.
We are going to do another v1 training run before the global embeddings run. Follow #283 for details.

@brunosan
Copy link
Member Author

Some update with @yellowcap. We are getting ready building the pipelines and testing the Earth Genome Sentinel-2 composites data:

  • Data comes in Web mercator projection, which is certainly great for map visualization, but it comes at a cost in terms of data to process. We've trained Clay in a projection that keep GSD across the tile. In Web mercator e.g. Norway is about 77% more pixels than the same near the equator. More pixel for the same feature. Not sure how Embeddings will suffer classifying same object in high latitudes (something to check).
  • The projection change also implies that there is a nodata boundary around each scene, and that the scene edge are not exacltly horizontal / vertical.

image

  • The nominal resolution of the pixels is 9.8 and 19.1 for the 10m and 20m bands (i.e. the resolution of web mercator zoom levels 14 and 13). But this is not the real resolution if you go away from the equator, hence the changes in nr of pixels. So when using a 256x256 pixel image for ML application, one is looking at different sized areas in reality.
  • In the STAC items some property of the proj extension are missing, for example the proj:shape property, which is required for stacchip. We can work around this. (CC @bengmstrong)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants