Use Rook Ceph for Jupyterhub and Conda Store drives #2541

Adam-D-Lewis · 2024-06-25T01:13:20Z

Reference Issues or PRs

Related to #2534
Deploys rook operator and rook-cluster helm charts, sets up some StorageClasses backed by a Rook Ceph cluster and uses those storage classes for what were previously the jupyterhub and conda store NFS drives.

Developed on Azure at the moment.

What does this implement/fix?

Put a x in the boxes that apply

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds a feature)
Breaking change (fix or feature that would cause existing features not to work as expected)
Documentation Update
Code style update (formatting, renaming)
Refactoring (no functional changes, no API changes)
Build related changes
Other (please describe):

Testing

Did you test the pull request locally?
Did you add new tests?

Any other comments?

Adam-D-Lewis · 2024-06-25T21:51:53Z

See TODOs in code for what is still left.
Also, try to reduce storage node size.
Make using NFS or Ceph configurable in nebari-config.yaml
Test migration

Adam-D-Lewis · 2024-06-25T21:52:34Z

src/_nebari/stages/infrastructure/__init__.py



 DEFAULT_DO_NODE_GROUPS = {
-    "general": DigitalOceanNodeGroup(instance="g-8vcpu-32gb", min_nodes=1, max_nodes=1),
+    "general": DigitalOceanNodeGroup(instance="g-8vcpu-32gb", min_nodes=1, max_nodes=5),


No need to keep max_nodes as 5

Adam-D-Lewis · 2024-06-28T16:50:59Z

One issue I'm hitting is the Ceph cluster deployment takes a long time but helm thinks it's done b/c it just deploys a CephCluster object which is quick and the ceph operator handles the longer task of waiting for a bunch of rook resources to be created (in successive steps).

What this means in practice currently is that the deploy step times out, and ceph isn't ready for another 5-10 minutes even beyond the timeout.

I'm not sure the best way to address this. Maybe some sort of readiness check (in k8s or in python) can be added or increase the timeout or deploy ceph on an earlier stage of Nebari to give the user extra time to get ready.

Adam-D-Lewis · 2024-06-28T17:56:19Z

Currently, there are 3 storage nodes that require about 1cpu/2-6 GB of RAM. On Azure, I added 3 2cpu/8GB Ram nodes for storage adding $210/month to the operating cost of Nebari (on top of the pre-existing $280/month). We may be able to use a single node by using cluster-test.yaml from here, but we likely won't get the same performance speedup and would not have the redundancy benefit that a more typical multi-node ceph deployment can give.

Co-authored-by: Marcelo Villa <mvilla@quansight.com>

Adam-D-Lewis · 2024-06-28T21:43:20Z

B/c there isn't a pressing need for this, and the benefits of using ceph seem to only come with significant increased cost, I'm planning to put this on the backburner for the moment.

Questions we need to resolve:

Can you run Ceph without high-availability and disaster recovery with a single server so the cost is not increased beyond current costs? Is there a benefit in doing so? Speeds may not be faster anymore with a single node. You can consolidate block, object, and file system storage under ceph so maybe that's helpful.
Do we want to make Ceph configurable (support NFS and Ceph) or only allow Ceph from now on?

dcmcand · 2024-07-02T21:25:04Z

I think we can make Ceph the first stage after the k8s cluster (so stage 3?) and we can see if that is enough.

As for costs, we could make ceph single node the default and have a HA flag in the config to move to multnode, HA setup. That way we still have the abstracted storage interface, but not the increased cost.

@Adam-D-Lewis can you test with a single node ceph and see how it runs?

Adam-D-Lewis · 2024-07-02T21:36:31Z

@dcmcand Yeah, I'll test with a single node

Adam-D-Lewis · 2024-07-31T23:04:27Z

This PR is ready for review. You can do a deployment by adding storage.type = cephfs to the nebari config file. You shouldn't switch an existing nebari deployment from nfs (default) to cephfs b/c you'll lose all your files and conda envs. I'll write up a docs PR detailing this as experimental. I think we should start using this on Quansight's deployment (after we do a backup) to get a feel for it, and eventually make it the default if all goes well.

storage:
  type: cephfs

viniciusdc · 2024-08-01T14:32:18Z

@Adam-D-Lewis did you workout the destroy bits?

Adam-D-Lewis · 2024-08-01T14:35:08Z

@Adam-D-Lewis did you workout the destroy bits?

No, not yet. The destroy still occurs, but it does take longer at the moment, and a message saying stage 07 had problems destroying, but only stages 01 and 02 are usually necessary for a successful destroy.

Adam-D-Lewis · 2024-08-01T23:04:33Z

Update: Fixed ~~I deployed on Azure today and timed out. Redeploying resulted in a successful deployment.~~ I'll work on fixing that, but I'd appreciate an initial review even before then.

Adam-D-Lewis · 2024-08-02T21:51:10Z

Docs PR - nebari-dev/nebari-docs#493
deploy-preview-493--nebari-docs.netlify.app

Adam-D-Lewis · 2024-08-05T22:59:32Z

I attempted to just have the operator wait before it was destroyed with a local-exec provisioner, but that didn't work. There are all kinds of rook-ceph protections stopping the filesystem from being deleted. Attempting to delete the cephcluster results in a message along the lines of
"failed to reconcile CephCluster "prod/prod". failed to clean up CephCluster "prod/prod": failed until dependent objects delted CephFileSystemSubVolumeGroup and CephFileSystem."

The only way I could delete the CephFileSystemSubVolumeGroup and CephFileSystem was through manually deleting the finalizers on them. I tried adding the rook.io/force-deletion="true" annotation, but that didn't work. I also tried setting the cephClusterSpec.cephConfig.cleanupPolicy.confirmation = "yes-really-destroy-data" in the cephcluster helm chart values, but that didn't seem to have an effect either.

Even after deleting the CephFileSystemSubVolumeGroup and CephFileSystem via the finalizers, the Cephcluster still won't delete. The following logs are in the operator.
"failed to reconcile CephCluster "prod/prod". failed to clean up CephCluster "prod/prod": failed to check if volumes exist for CephCluster in namespace "prod": waiting for csi volume attachments in cluster "prod" to be cleaned up".

Overriding the Cephcluster finalizers also deleted it. So worse case, we could override the finalizers on those 3 objects in the destroy step of Nebari stage 07 and then nebari destory would succeed, but it's a bit hacky, but it's unlikely to cause a problem since we would only do so when we are deleting the cluster anyway.

My view is still that while this is not ideal, we should still merge it in and start testing rook-ceph and fix this issue before moving ceph from expirmental to general availability.

This reverts commit 6fd8a5a.

dcmcand · 2024-08-19T11:42:38Z

@Adam-D-Lewis can you merge the lastest develop in?

dcmcand

@Adam-D-Lewis I think this is great work.

Some questions though:

Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?
Is there a ticket for moving node storage to ceph?
It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?
I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?
Let's spend some time thinking about how to migrate from nfs to ceph and back
Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Adam-D-Lewis · 2024-08-19T16:40:58Z

Thanks for the review @dcmcand. I responded to your questions below.

@Adam-D-Lewis I think this is great work.

Some questions though:

Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?

This PR provides feature parity with the existing NFS storage state in Nebari. I'm happy to work on allowing the use of additional nodes to Ceph assuming that work remains in scope for JATIC. I have created a follow on ticket to make the number of RookCeph backing nodes variable - #2573. I assumed we would validate that Ceph storage works for a single node and then make it the default for future deployments, and only after that point would we work on increasing the number of rook/ceph nodes.

Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?

This does use dynamic provisioning of PVs since PVCs are provisioned from storage classes. I believe the PVC should be able to increase in size without data loss, but I'll test and get back to you. I'm not sure how increasing the size of the disks that back the RookCeph storage would work. I'll look into that a bit more and get back to you as well.

UPDATE: I tested increasing the shared directory and the conda storage sizes on a GCP deployment. The PVC for the disk that backs the RookCeph storage was increased without a problem (verified in GCP Console as well) and the PVCs for the shared directory and conda env storage also increased without a problem.

I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?

Currently, the only warning is in the documentation PR. We don't have a way to recognize changes to the nebari config file during a redeploy which makes this difficult. I did open this ticket which explains the issue further and proposes a way to add this ability. I wasn't originally planning on working on that before this PR is merged, but I'd be happy to if you think it's needed. As a workaround we could require user confirmation on any deploy which has cephfs set as the storage, but I think it's fair to just put the warning in the docs as well as listing it in the docs as an alpha feature.

Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

dcmcand · 2024-08-20T11:39:45Z

Are you planning on a HA multiple ceph node option for this in the future? Is there a follow on ticket for that?

This PR provides feature parity with the existing NFS storage state in Nebari. I'm happy to work on allowing the use of additional nodes to Ceph assuming that work remains in scope for JATIC. I have created a follow on ticket to make the number of RookCeph backing nodes variable - #2573. I assumed we would validate that Ceph storage works for a single node and then make it the default for future deployments, and only after that point would we work on increasing the number of rook/ceph nodes.

👍

Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

~~The benefit would be allowing the general node to spin up in a different AZ to allow multi AZ deployments. That has caused a number of folks errors when they upgraded.~~
I misspoke when I said node storage, because I was remembering the issue incorrectly. Basically any PVC's made by pods in the cluster would need to move to ceph. Right now we have at least one PVC against ebs which can't cross AZ's

It appears this is using dynamic pvc provisioning for Ceph which would allow expanding conda-store volumes in the future, correct?

This does use dynamic provisioning of PVs since PVCs are provisioned from storage classes. I believe the PVC should be able to increase in size without data loss, but I'll test and get back to you. I'm not sure how increasing the size of the disks that back the RookCeph storage would work. I'll look into that a bit more and get back to you as well.

👍

UPDATE: I tested increasing the shared directory and the conda storage sizes on a GCP deployment. The PVC for the disk that backs the RookCeph storage was increased without a problem (verified in GCP Console as well) and the PVCs for the shared directory and conda env storage also increased without a problem.

I didn't see any validation or data loss warning for changing an existing nfs deployment to ceph or vise versa. I think we will definitely need that, is there a follow on ticket for that?

Currently, the only warning is in the documentation PR. We don't have a way to recognize changes to the nebari config file during a redeploy which makes this difficult. I did open this ticket which explains the issue further and proposes a way to add this ability. I wasn't originally planning on working on that before this PR is merged, but I'd be happy to if you think it's needed. As a workaround we could require user confirmation on any deploy which has cephfs set as the storage, but I think it's fair to just put the warning in the docs as well as listing it in the docs as an alpha feature.

I think that is fine for an alpha feature, but given the risk of user data loss, I think this is a problem we need to solve before we go mainstream with ceph, either by figuring out how to migrate, or by blocking switching

Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

I think that is fair now, but I would prefer to have a switching path in the future, even if it is to use backup and restore.

Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

Basically what I talked about in the meeting, removing dedicated PV's and instead creating an NFS storage class that dynamically provisions for each cloud provider. That way nfs wouuld be exandable for all providers in the future

Adam-D-Lewis · 2024-08-20T16:31:59Z

Is there a ticket for moving node storage to ceph?

I don't have a ticket for this. I'm not sure how we would do this, and I can't think of any strong benefits. Can you explain your thoughts further and/or what you're looking for here?

~~The benefit would be allowing the general node to spin up in a different AZ to allow multi AZ deployments. That has caused a number of folks errors when they upgraded.~~ I misspoke when I said node storage, because I was remembering the issue incorrectly. Basically any PVC's made by pods in the cluster would need to move to ceph. Right now we have at least one PVC against ebs which can't cross AZ's

Here is the issue about moving other PVCs to use ceph for storage.

Let's spend some time thinking about how to migrate from nfs to ceph and back

I think it's fair to not support that transition, at least initially while we prove that RookCeph is going to work well. So in that case, users would only be expected to make the decision on the initial deployment and not change it afterwards. I do think we should run this on the Quansight deployment to help test it though. In a case like that, I would tell an admin to copy all the files in the NFS drives, tar them and copy to cloud storage. Then copy them to the new disk and untar them after we redeploy similar to what we do in the manual backup docs. We should test this before doing it (particularly for the conda-store NFS drive), but I think it should work fine.

I think that is fair now, but I would prefer to have a switching path in the future, even if it is to use backup and restore.

Yeah, I'm happy for backup and restore to be the switching plan. I'll test this out after this is merged since I'll want us to test using Rook Ceph on the Quansight Nebari deployment after this PR is merged.

Let's spend some time thinking about how to use this same design pattern for nfs with dynamic provisioning, etc

Can you elaborate here? What else are you looking for?

Basically what I talked about in the meeting, removing dedicated PV's and instead creating an NFS storage class that dynamically provisions for each cloud provider. That way nfs wouuld be exandable for all providers in the future

I doubt it would be very hard to make that change, and I'd be happy to comment on an issue with more thoughts.

Adam-D-Lewis · 2024-08-20T16:32:49Z

Thanks again for the review, @dcmcand. Is there anything else you want to see before this is merged?

dcmcand

🎉

Adam-D-Lewis · 2024-08-21T16:28:17Z

The failing test in the Local Integration Tests has been seen in other PRs. @viniciusdc is working on a fix, and it is unrelated to this PR, so I will merge this PR.

Adam-D-Lewis added 9 commits June 24, 2024 16:49

eliminate unsetting default value with every deploy

5933ace

remove setting node_count to 0, just set the min/max

ad3b499

initial commit of ceph

36b0ffd

remove secret

05ba71a

save progress

132df5f

delete comment

c8559b3

update

07c75bf

update

bf7ab30

Merge branch 'develop' into rook

010ffc2

Adam-D-Lewis linked an issue Jun 25, 2024 that may be closed by this pull request

Investigate Ceph for use in Nebari #2534

Closed

Adam-D-Lewis commented Jun 25, 2024

View reviewed changes

marcelovilla and others added 11 commits June 28, 2024 13:00

Add option to run CRDs updates and DaemonSet deletion on user's behalf.

bb969fe

Move kube config loading inside upgrade function.

f44f8de

Handle config exception.

7d9317b

Add link to docs on how to generate the kube config.

392ea57

Mock different prompts.

1b795df

Implement support for jupyterlab-gallery config (#2501)

1269d2b

Update release notes and constants.

4127ade

Enable authentication by default in jupyter-server (#2288)

f952e63

Co-authored-by: Marcelo Villa <mvilla@quansight.com>

remove dns sleep

534cb9b

fixes

6f5f665

Merge branch 'develop' into rook

1fd95e9

Adam-D-Lewis added 2 commits July 10, 2024 12:03

modify for single node cluster

54b4056

working single node deployment

ef78f93

Adam-D-Lewis added the project: JATIC Work item needed for the JATIC project label Aug 1, 2024

Adam-D-Lewis mentioned this pull request Aug 2, 2024

add rook ceph docs nebari-dev/nebari-docs#493

Merged

18 tasks

add timeout

6c64a33

Adam-D-Lewis added 3 commits August 5, 2024 18:01

add type annotation

529a614

try to delete resources

6fd8a5a

Revert "try to delete resources"

4f0d92e

This reverts commit 6fd8a5a.

dcmcand reviewed Aug 19, 2024

View reviewed changes

Merge branch 'develop' into rook

e1877ca

Adam-D-Lewis requested a review from dcmcand August 19, 2024 16:45

Merge branch 'develop' into rook

6fa8fed

Adam-D-Lewis mentioned this pull request Aug 20, 2024

Switch all PVCs in Nebari to be backed by Rook Ceph #2633

Open

dcmcand approved these changes Aug 21, 2024

View reviewed changes

This was referenced Aug 21, 2024

Conda store worker scaling issues #2505

Open

Rook Ceph Future Work Meta Issue #2634

Open

Adam-D-Lewis merged commit 3695eb4 into develop Aug 21, 2024
27 of 28 checks passed

Adam-D-Lewis deleted the rook branch August 21, 2024 16:44

Adam-D-Lewis mentioned this pull request Aug 26, 2024

Latest Rook Ceph PR requires deletion/recreation of NFS drives for all deployments #2638

Closed

Adam-D-Lewis mentioned this pull request Sep 3, 2024

Protections on PVs to prevent data loss #2700

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Rook Ceph for Jupyterhub and Conda Store drives #2541

Use Rook Ceph for Jupyterhub and Conda Store drives #2541

Adam-D-Lewis commented Jun 25, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 25, 2024 •

edited

Loading

Adam-D-Lewis Jun 25, 2024

Adam-D-Lewis commented Jun 28, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 28, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 28, 2024 •

edited

Loading

dcmcand commented Jul 2, 2024

Adam-D-Lewis commented Jul 2, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 31, 2024 •

edited

Loading

viniciusdc commented Aug 1, 2024

Adam-D-Lewis commented Aug 1, 2024

Adam-D-Lewis commented Aug 1, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 2, 2024

Adam-D-Lewis commented Aug 5, 2024 •

edited

Loading

dcmcand commented Aug 19, 2024

dcmcand left a comment

Adam-D-Lewis commented Aug 19, 2024 •

edited

Loading

dcmcand commented Aug 20, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 20, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 20, 2024

dcmcand left a comment

Adam-D-Lewis commented Aug 21, 2024

Use Rook Ceph for Jupyterhub and Conda Store drives #2541

Use Rook Ceph for Jupyterhub and Conda Store drives #2541

Conversation

Adam-D-Lewis commented Jun 25, 2024 • edited Loading

Reference Issues or PRs

What does this implement/fix?

Testing

Any other comments?

Adam-D-Lewis commented Jun 25, 2024 • edited Loading

Adam-D-Lewis Jun 25, 2024

Choose a reason for hiding this comment

Adam-D-Lewis commented Jun 28, 2024 • edited Loading

Adam-D-Lewis commented Jun 28, 2024 • edited Loading

Adam-D-Lewis commented Jun 28, 2024 • edited Loading

dcmcand commented Jul 2, 2024

Adam-D-Lewis commented Jul 2, 2024 • edited Loading

Adam-D-Lewis commented Jul 31, 2024 • edited Loading

viniciusdc commented Aug 1, 2024

Adam-D-Lewis commented Aug 1, 2024

Adam-D-Lewis commented Aug 1, 2024 • edited Loading

Adam-D-Lewis commented Aug 2, 2024

Adam-D-Lewis commented Aug 5, 2024 • edited Loading

dcmcand commented Aug 19, 2024

dcmcand left a comment

Choose a reason for hiding this comment

Adam-D-Lewis commented Aug 19, 2024 • edited Loading

dcmcand commented Aug 20, 2024 • edited Loading

Adam-D-Lewis commented Aug 20, 2024 • edited Loading

Adam-D-Lewis commented Aug 20, 2024

dcmcand left a comment

Choose a reason for hiding this comment

Adam-D-Lewis commented Aug 21, 2024

Adam-D-Lewis commented Jun 25, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 25, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 28, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 28, 2024 •

edited

Loading

Adam-D-Lewis commented Jun 28, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 2, 2024 •

edited

Loading

Adam-D-Lewis commented Jul 31, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 1, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 5, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 19, 2024 •

edited

Loading

dcmcand commented Aug 20, 2024 •

edited

Loading

Adam-D-Lewis commented Aug 20, 2024 •

edited

Loading