Skip to content
This repository has been archived by the owner on May 30, 2023. It is now read-only.

app-admin/locksmith: bump commit ID #1161

Merged
merged 1 commit into from
Aug 27, 2021
Merged

Conversation

@tormath1 tormath1 self-assigned this Aug 3, 2021
@tormath1
Copy link
Contributor Author

tormath1 commented Aug 3, 2021

getting random failure on the etcd client when the endpoint is not explicit - not sure why but it might be related to the etcd upgrade maybe:

core@localhost ~ $ for i in {1..10}; do locksmithctl status && sleep 1; done
Error initializing etcd client: dial tcp 127.0.0.1:4001: connect: connection refused
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Error initializing etcd client: dial tcp 127.0.0.1:4001: connect: connection refused
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Error initializing etcd client: dial tcp 127.0.0.1:4001: connect: connection refused
Error initializing etcd client: dial tcp 127.0.0.1:4001: connect: connection refused
core@localhost ~ $ for i in {1..10}; do locksmithctl -endpoint http://localhost:2379 status && sleep 1; done
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1
Available: 1
Max: 1

@jepio
Copy link
Contributor

jepio commented Aug 3, 2021

The default endpoints list is [localhost:2379,localhost:4001]. Looking at the code it seems to that "go.etcd.io/etcd/client" shuffles the passed in endpoints. But the comments say that if one is unreachable it will try another one, which is not happening correctly.

@tormath1
Copy link
Contributor Author

tormath1 commented Aug 3, 2021

yeah I arrived to the very same conclusion - it's very weird, I'll try tomorrow to revert / upgrade the etcd lib to see if it changes something.

Reverting flatcar/locksmith@1b5eb50 fixes the issue; this PR seems to be the cause of the trouble: etcd-io/etcd#5888 but it seems it has been resolved in etcd-io/etcd#8515.

The issue is the following: when we create the etcd client, we call the getClient which initialize a lock etcd client. In the init, it calls a Create (a Set) methods which provides the isOneShotCtxValue -> if there is an error, the action won't be retry. So basically we don't have anymore the endpoints resilience - we could update the tests to run on a specific endpoint but it keeps the main issue: user with multiple endpoints can experiencing this kind of issue.

@tormath1 tormath1 added the on-hold this PR or this issue is on-hold - it's blocked by something external label Aug 13, 2021
@tormath1
Copy link
Contributor Author

tormath1 commented Aug 13, 2021

This PR is currently on-hold because locksmithctl behaves not correctly due to the etcd upgrade.

We are currently exploring various fixes, like a custom patch (flatcar/locksmith#11) or upgrade to etcd/v3 (flatcar/locksmith#12).

EDIT: we merged the following patch flatcar/locksmith#11 in order to properly make the upgrade to etcd/v3; this PR is no more on-hold.

@tormath1 tormath1 removed the on-hold this PR or this issue is on-hold - it's blocked by something external label Aug 26, 2021
@tormath1 tormath1 force-pushed the tormath1/update-locksmith-commit branch from dadb202 to f362762 Compare August 26, 2021 07:54
closes flatcar/Flatcar#407

Signed-off-by: Mathieu Tortuyaux <mathieu@kinvolk.io>
@tormath1 tormath1 force-pushed the tormath1/update-locksmith-commit branch from 5a29d24 to a8fcf2d Compare August 27, 2021 14:28
@tormath1 tormath1 marked this pull request as ready for review August 27, 2021 14:28
@tormath1 tormath1 requested a review from a team August 27, 2021 14:29
@tormath1 tormath1 merged commit 07cdc2f into main Aug 27, 2021
@tormath1 tormath1 deleted the tormath1/update-locksmith-commit branch August 27, 2021 15:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

locksmithd will reboot outside of reboot windows, if no semaphore was acquired before
3 participants