PR Idea: Document a minimal cost deployment #717

consideRatio · 2018-06-11T23:05:06Z

Me and @UsDAnDreS have both attempted to minimize the amount of money spent by reducing the nodes and the available CPU cores on the nodes while making the initial cluster setup. It is not obvious on how to do this, and I think it would be good to share some experience, as well as update the guide to be less resource greedy.

I aim to look into how the z2jh guide can allow for a cheaper default setup and elaborate more in due time. My current setup has one single CPU node (n1-standard-1) and have configured google cloud to scale up more nodes when required for the singleuser-server pods.

betatim · 2018-06-12T11:48:09Z

What is your experience with the user experience when you need to scale up? For mybinder.org it takes quite a while to spawn a new node which makes me think that you don't want to scale "too often". However for clusters which are empty a lot (teaching for example) it is attractive to have one node with the hub on it that is very small and then scale as soon as someone logs in.

UsDAnDreS · 2018-06-12T19:15:01Z

@betatim Yes, I need it for weekly workshops (with under a 100 of potential users), hence requiring a weekly scale-up dynamic with emptiness in-between. I'm still on the stage of figuring on how to just configure the environment on the server, and will start thinking about scaling right after that, but any advice on the optimization and efficient use is much appreciated (as $'s are ticking off that Google Cloud free trial haha).

choldgraf · 2018-06-14T18:55:08Z

I'm +1 on language that makes scaling down/up easier for people in general (and maybe some "user stories" from different usage scenarios where scaling up/down would help)

minrk · 2018-06-19T09:05:08Z

I'm supporting a workshop of 50 students right now, and a traditional JupyterHub deployment doesn't suffer nearly as much on a scale-up event as Binder for two reasons:

there aren't as many concurrent spawns
there's only one image to pull, not several, so multiple pending spawns on one new node don't impede each other much

I haven't had a single spawn failure and it scales up from 0 to 3 nodes each morning as students show up.

My setup is similar to @consideRatio's with a 4-cpu node with hub, proxy, prometheus, etc. and a 0-N autoscale pool of 8-cpu nodes for users (I started with 16-cpu nodes, but ran into the GKE hard limit of 16 users per node when using persistent volumes #732). That way it gets a pretty good chance of scaling all they way back to 0 at the end of the day without my help.

minrk · 2018-06-21T07:48:25Z

This is continuing to work well, and I'm very happy! Completely unattended scale-down to just the one node at the end of the day, then back up to five in the morning:

(and a couple of people, I'm assuming instructors preparing for the next day, showing up around midnight, but still fitting on the first node.)

consideRatio · 2018-06-30T18:06:30Z

@minrk If you have 5 nodes, and 5 users, one on each node, then the cluster won't scale down right?

I'm amazed that this works so well when the nodes are massive, because you have these super sized nodes right?

minrk · 2018-07-01T21:40:05Z

My user nodes aren't huge, they are n1-standard-8. The limiting factor for my nodes is gcloud's very low persistent-volume-per-node limit, so I can't have more than 16 users per node. As a result, I have ~12 users per node.

The scale-down works very well for me because these are all students in a class in the same timezone, so they all go home at the end of the day and it scales back to 0.

Another thing that helps me scale down is that while I assign all non-user pods to one pool, I don't actually assign users strictly to the user pool. This is perhaps not the most prudent, but it allows my first few users to run without allocating a node in the user pool. That means when one or two users show up after everybody has stopped for the day, instead of scaling-up one user node, they stay on the always-on node with the hub.

Reliable scale-down when you have more constant activity (like Binder) is going to be harder until we get an actually reliable pod-packing scheduler running. As it is now, we have to fake packing by cordoning a node and hoping it drains before the next scale-up event.

consideRatio · 2018-07-01T22:45:28Z

@minrk ah excellent! I'm very happy to have spent a lot of time this week implementing a lot of what you will appreciate it seems.

Zero-to-jupyterhub-k8s 0.7+ easy setup:

Two node pools, with "hub.jupyter.org/node-purpose": "core" and "user" respectively and with a NoSchedule taint on the user pool, that the users tolerate. This means that kube-dns or one of the core pods happen to schedule there and cause issues, as they can if for example you run an upgrade and due to a lack of resource on the core node to handle two simultaneous hubs / proxies etc they schedule on a user node.
- The "core" node-pool requires only a single 1 core node
- The "user" node-pool: autoscaling, 4 or 8 core machines perhaps, perhaps preemtible nodes also.
Evictable placeholder user pods (with low PodPriority) create configurable amount of headroom
Continuous image pullers pulls images whenever placeholders scale up
A custom scheduler packs the user pod tight

Btw: google have or will increase that PVC restriction to 128 or similar as far as I recall.

consideRatio · 2020-10-05T20:39:39Z

Instead of documenting a minimalistic deployment, we are now documenting general optimizations for any deployment and I think that is a better approach. Hmm.. closing this issue!

consideRatio mentioned this issue Jun 11, 2018

"504: Gateway Time-out" when accessing the proxy-public External-IP from Linux Ubuntu 16.04 machine #716

Closed

consideRatio changed the title ~~Reduce cost of the default z2jh cluster setup~~ PR: Document a minimal cost deployment Sep 11, 2019

consideRatio changed the title ~~PR: Document a minimal cost deployment~~ PR Idea: Document a minimal cost deployment Sep 11, 2019

consideRatio added the pr-idea Pull Request ideas should have concrete and seemingly viable changes. label Sep 11, 2019

consideRatio closed this as completed Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR Idea: Document a minimal cost deployment #717

PR Idea: Document a minimal cost deployment #717

consideRatio commented Jun 11, 2018 •

edited

Loading

betatim commented Jun 12, 2018

UsDAnDreS commented Jun 12, 2018 •

edited

Loading

choldgraf commented Jun 14, 2018

minrk commented Jun 19, 2018 •

edited

Loading

minrk commented Jun 21, 2018

consideRatio commented Jun 30, 2018

minrk commented Jul 1, 2018 •

edited

Loading

consideRatio commented Jul 1, 2018 •

edited

Loading

consideRatio commented Oct 5, 2020

PR Idea: Document a minimal cost deployment #717

PR Idea: Document a minimal cost deployment #717

Comments

consideRatio commented Jun 11, 2018 • edited Loading

betatim commented Jun 12, 2018

UsDAnDreS commented Jun 12, 2018 • edited Loading

choldgraf commented Jun 14, 2018

minrk commented Jun 19, 2018 • edited Loading

minrk commented Jun 21, 2018

consideRatio commented Jun 30, 2018

minrk commented Jul 1, 2018 • edited Loading

consideRatio commented Jul 1, 2018 • edited Loading

consideRatio commented Oct 5, 2020

consideRatio commented Jun 11, 2018 •

edited

Loading

UsDAnDreS commented Jun 12, 2018 •

edited

Loading

minrk commented Jun 19, 2018 •

edited

Loading

minrk commented Jul 1, 2018 •

edited

Loading

consideRatio commented Jul 1, 2018 •

edited

Loading