Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance creation fails if checkpoint/timeframe are passed #65

Open
vgeorge opened this issue May 22, 2023 · 7 comments
Open

Instance creation fails if checkpoint/timeframe are passed #65

vgeorge opened this issue May 22, 2023 · 7 comments

Comments

@vgeorge
Copy link
Member

vgeorge commented May 22, 2023

At the staging API, if the client does a request to POST /project/:id/instance including checkpoint_id and timeframe_id in the payload, an instance is created and starts running, but it never gets to the active=true state and it is not returned in GET /project/:id/instance list. It is possible to get its status at GET /project/:id/instance/:id.

cc @ingalls @geohacker

@ingalls
Copy link
Member

ingalls commented May 22, 2023

@vgeorge Per conversion this morning it is eventually set to active=true it's just very slow

@ingalls
Copy link
Member

ingalls commented May 22, 2023

Context

Models are now ~100x bigger than the NAIP model. The initially deployed NAIP model is approx 1.5mb, while currently deployed Sentinel models appears to be in the 150mb range. This large increase in model size lead to substantially longer init times in our worker instances. Unfortunately we do not currently surface that an instance has started but is in the process of configuring itself. Instances are only marked as active once they have downloaded the necessary background assets (models/checkpoints/etc) to be able to accept websocket messages.

As such the frontend has no way of knowing that an instance is in a "starting" phase expect to poll and wait for the active message to be set to true. If the user refreshes the page than any way of polling for instances is lost as the InstanceID is no longer retained by the frontend, and the instance list endpoint does not return k8s pod status.

We have two options to move forward here :

  1. The most straightforward (but most hacky) is to simply add a started boolean to the database. This boolean would be set to true when the instance registers with the api but it not yet active (in that it can accept websocket messages). This could then be filtered by the frontend to determine which instances have been created but are not yet in an active state. This is a far less desirable option as it would only surface that an instance started, not it's current state. If an instance stopped for any reason this value would not be updated and lead to an in-ability to filter stopped instances on the frontend.

  2. The preferred solution would be to include pod status in the list instances endpoint to allow the frontend to filter to pods that are currently started but not active. I spent several hours today playing around with this method but without the ability to connect to a k8s cluster to be able to test I was unable to get the endpoint working. Deploying to staging and then testing took too long to get to a working MVP. In this method we would do the following

    • Add a list pods call to the List Instance API before the Instances.list database call is made.
    • Add the pod status information to the instances list returned by the database where possible
    • Add a query param called something like running=true that would use filter the pod list to only running instances and then use Instance.from to append instance data from the database - returning only instances that matched the query params
    • This would need to be implemented by @geohacker or @ingalls would need a sync to get setup with a K8s to be able to implement locally

@vgeorge
Copy link
Member Author

vgeorge commented May 23, 2023

@ingalls thanks for writing this, it describes the issue very well. Please let me know if you need anything from me.

@geohacker
Copy link
Member

geohacker commented May 26, 2023

@ingalls @vgeorge I'm able to reproduce this ~5min init time for instances. @ingalls I think we can do what you suggest in option 2 and can talk though how to test locally. There's already a getPodStatus method that we can use. Before we go that route, I'd like to see if we can address the checkpoint size problem.

@srmsoumya could you look into this briefly and suggest options we have? Of course a 5min wait time for retraining is not a great experience while inference is relatively quick.

@ingalls in the init time, what's the specific action (https://github.com/developmentseed/pearl-backend/blob/develop/services/gpu/lib/ModelSrv.py#L658-L662) that's time consuming? It can't be the download right?

@srmsoumya
Copy link
Member

We are using a DeepLabv3 framework with EfficientNet-b5 as the backbone.

EfficientNet-b5 has 28M parameters, I can try changing that to EfficientNet-b2 or EfficientNet-b3 that have 7M & 10M parameters respectively, we can go further down, but there will be an accuracy tradeoff.

Inside the model.zip folder we have 2 files:

7.9M  Mar 27 19:52 model.npz  # seed dataset
114M Mar 27 19:52 model.pt  # model weights

We can reduce the number of data points in seed dataset that will reduce the space (I am currently using 16_000 embeddings). Apart from this, we can look at possible quantization or pruning options for the model, but this will be experimental.

For NAIP, I can see two models in the azure blob storage.

  • model-1: 1.5 MB
  • model-2: 121 MB

From the code and Martha's notes, I infer they are using a DeepLabv3 framework with ResNet18 backbone (11M parameters). I am not sure what is being used for model-1, though.

cc' @geohacker @ingalls

@srmsoumya
Copy link
Member

I just pulled the model-1.zip file & looks like it is a tiny model, I am surprised 🩺

In [9]: for e,v in model.items():
   ...:     print(e, v.shape)
   ...: 
conv1.weight torch.Size([64, 4, 3, 3])
conv1.bias torch.Size([64])
conv2.weight torch.Size([64, 64, 3, 3])
conv2.bias torch.Size([64])
conv3.weight torch.Size([64, 64, 3, 3])
conv3.bias torch.Size([64])
conv4.weight torch.Size([64, 64, 3, 3])
conv4.bias torch.Size([64])
conv5.weight torch.Size([64, 64, 3, 3])
conv5.bias torch.Size([64])
last.weight torch.Size([9, 64, 1, 1])
last.bias torch.Size([9])

@geohacker
Copy link
Member

@srmsoumya thanks for taking a look. I'm surprised too about the init time. Before we do any optimisations, let's get some numbers on what actions are time consuming. I suspect it's actually the k8s instance creation and not the checkpoint loading. If it's the instance creation, that should be similar to the cold start for any cpu/gpu and we can workaround by running some placeholders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants