Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: [DENA-602] Wait for cockroach cluster to be ready #51

Merged
merged 32 commits into from
Jul 22, 2024
Merged

Conversation

MarcinGinszt
Copy link
Collaborator

Without those, scripts were entering "CrashLoopBackoff" on cluster start

@MarcinGinszt MarcinGinszt requested a review from a team as a code owner July 15, 2024 08:16
Copy link

linear bot commented Jul 15, 2024

@MarcinGinszt MarcinGinszt changed the title [DENA-602] Wait for cockroach cluster to be ready fix: [DENA-602] Wait for cockroach cluster to be ready Jul 15, 2024
@@ -16,7 +15,27 @@ data:
echo "DB_NAME envvar not set"
exit 1
fi
# Wait for cluster to be ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move this to a lifecycle hook on the container, and let k8s handle the retrying/back-off etc. behaviour?

Why doesn't the main cockcorach container in the STS suffer from the same issues, can we just re-use whatever that uses to figure out 'readiness'?

I'm also not a fan of this just being copied across the two scripts, it feels like it would be very easy to update one and forget to update the other

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lifecycle hooks execute synchronously with container entrypoint. The same about readiness probes.

Trying to make it nicer, I put it into separate script that runs before main scripts. Not sure if it's much better, but I can't find a better way to do it.

@@ -5,7 +5,6 @@ metadata:
data:
user-schema-bootstrap.sh: |
#!/bin/bash
set -e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to remove this? It could mean the script exiting with 0 even if e.g. an SQL command failed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcinGinszt MarcinGinszt requested review from matthewhughes-uw and a team July 18, 2024 13:06
Copy link
Contributor

Post kustomize build diff:

diff --git a/root-manifests/cockroachdb/manifests-cert-manager/db-init/manifests.yaml b/built-manifests/cockroachdb/manifests-cert-manager/db-init/manifests.yaml
index 21999bc..0b0d8cc 100644
--- a/root-manifests/cockroachdb/manifests-cert-manager/db-init/manifests.yaml
+++ b/built-manifests/cockroachdb/manifests-cert-manager/db-init/manifests.yaml
@@ -9,9 +9,11 @@ spec:
   template:
     spec:
       containers:
-      - command:
+      - args:
+        - -c
+        - /opt/scripts/wait-for-cluster-to-be-ready.sh && /opt/scripts/user-schema-bootstrap.sh
+        command:
         - /bin/bash
-        - /opt/scripts/user-schema-bootstrap.sh
         env:
         - name: COCKROACH_HOST
           valueFrom:
@@ -61,6 +63,7 @@ spec:
                 path: client.root.key
               name: cockroachdb.client.root
       - configMap:
+          defaultMode: 511
           name: cockroachdb-scripts
         name: cockroachdb-scripts
 ---
diff --git a/root-manifests/cockroachdb/manifests-cert-manager/manifests.yaml b/built-manifests/cockroachdb/manifests-cert-manager/manifests.yaml
index d0e7ea6..2cf6859 100644
--- a/root-manifests/cockroachdb/manifests-cert-manager/manifests.yaml
+++ b/built-manifests/cockroachdb/manifests-cert-manager/manifests.yaml
@@ -14,7 +14,7 @@ data:
         exit 1
     fi
 
-    # todo: wait for the cockroachdb cluster to be ready
+    # execute command
     SQL_CMD="/cockroach/cockroach sql" > /dev/null
 
     $SQL_CMD << EOF
@@ -29,14 +29,20 @@ data:
     EOF
   user-schema-bootstrap.sh: "#!/bin/bash\nset -e\n\nif [ -z \"$DB_USER\" ]; then\n
     \   echo \"DB_USER envvar is not set\"\n    exit 1\nfi\n\nif [ -z \"$DB_NAME\"
-    ]; then\n    echo \"DB_NAME envvar not set\"\n    exit 1\nfi\n\nSQL_CMD=\"/cockroach/cockroach
-    -d system sql\" > /dev/null\n\n# try creating user, database, grant database permissions
-    to user\n$SQL_CMD << EOF\n  CREATE USER IF NOT EXISTS $DB_USER;\n  CREATE DATABASE
-    IF NOT EXISTS $DB_NAME;\n  GRANT ALL ON DATABASE $DB_NAME TO $DB_USER;\nEOF\n\n#
-    If database already exists and contains tables, grant access to them for user.
-    \n# This command will fail for freshly initialized databases.\n$SQL_CMD --execute
-    \"GRANT ALL ON TABLE $DB_NAME.* TO $DB_USER;\" 2>1 || echo \"no tables in database
-    yet\"\n"
+    ]; then\n    echo \"DB_NAME envvar not set\"\n    exit 1\nfi\n\n# execute command
+    \nSQL_CMD=\"/cockroach/cockroach -d system sql\" > /dev/null\n\n# try creating
+    user, database, grant database permissions to user\n$SQL_CMD << EOF\n  CREATE
+    USER IF NOT EXISTS $DB_USER;\n  CREATE DATABASE IF NOT EXISTS $DB_NAME;\n  GRANT
+    ALL ON DATABASE $DB_NAME TO $DB_USER;\nEOF\n\n# If database already exists and
+    contains tables, grant access to them for user. \n# This command will fail for
+    freshly initialized databases.\n$SQL_CMD --execute \"GRANT ALL ON TABLE $DB_NAME.*
+    TO $DB_USER;\" 2>1 || echo \"no tables in database yet\"\n"
+  wait-for-cluster-to-be-ready.sh: "#!/bin/bash\n\nMAX_RETRIES=40\nRETRY_INTERVAL=3\n\nis_cluster_ready()
+    {\n  cockroach sql --execute=\"SELECT 1;\" &> /dev/null\n}\n\nuntil is_cluster_ready;
+    do\n  ((retry_count++))\n\n  if [ $retry_count -ge $MAX_RETRIES ]; then\n      echo
+    \"Reached maximum retries. Cluster is not ready.\"\n      exit 1\n  fi\n\n  echo
+    \"Cluster is not ready yet. Retry $retry_count/$MAX_RETRIES. Waiting for $RETRY_INTERVAL
+    seconds...\"\n  sleep $RETRY_INTERVAL\ndone \n"
 kind: ConfigMap
 metadata:
   name: cockroachdb-scripts
@@ -303,9 +309,11 @@ spec:
   template:
     spec:
       containers:
-      - command:
+      - args:
+        - -c
+        - /opt/scripts/wait-for-cluster-to-be-ready.sh && /opt/scripts/backup-bootstrap.sh
+        command:
         - /bin/bash
-        - /opt/scripts/backup-bootstrap.sh
         env:
         - name: COCKROACH_HOST
           valueFrom:
@@ -361,6 +369,7 @@ spec:
                 path: client.root.key
               name: cockroachdb.client.root
       - configMap:
+          defaultMode: 511
           name: cockroachdb-scripts
         name: cockroachdb-scripts
   ttlSecondsAfterFinished: 600

=============================================
k8s objects: 0 to add, 0 to destroy
6 changed hunks

Copy link

@sbuliarca sbuliarca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is nicer than what we had before and would result in fewer job pod failures.
In the past I think we were relying on k8s retrying within the backoffLimit

Base automatically changed from DENA-603 to main July 22, 2024 07:08
@MarcinGinszt MarcinGinszt merged commit 36b2333 into main Jul 22, 2024
1 of 2 checks passed
@MarcinGinszt MarcinGinszt deleted the DENA-602 branch July 22, 2024 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants