Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD-199: bump etcd v3.4.16 #83

Merged
merged 85 commits into from
Jun 18, 2021

Conversation

hexfusion
Copy link

@hexfusion hexfusion commented Jun 16, 2021

This PR bumps etcd to v3.4.16 from v3.4.14 using golang 1.12 and is part of a multi PR migration[1] to get to etcd 3.5.0.

v3.4.16 (2021-05-11)

See code changes and v3.4 upgrade guide for any breaking changes.

etcd server

Metrics


v3.4.15 (2021-02-26)

See code changes and v3.4 upgrade guide for any breaking changes.

etcd server

Package fileutil

Dependency


[1] #85

dbavatar and others added 30 commits September 16, 2019 11:49
In case of URLs that are synonyms, the current lexicographic sorting
and compare of the URLs fails with frustrating errors. Make sure to do
a full comparison between every set of PeerURLs before failing.

Fixes etcd-io#11013
Use golang.org/x/sys/unix for F_OFD_* constants.

This fixes the issue that F_OFD_GETLK was defined incorrectly,
resulting in bugs such as moby/moby#31182

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
[3.4 backport] pkg/fileutil: fix F_OFD_ constants
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
This fixes etcd being unable to send any message longer than 64 KB as
a notification over the websocket. This was because the older version
of grpc-websocket-proxy was used and WithMaxRespBodyBufferSize option
wasn't set.
etcdserver: Fix 64 KB websocket notification message limit
[Backport-3.4] etcdserver/api/etcdhttp: log successful etcd server side health check in debug level
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
There are situations where we don't wish to fsync but we do want to
write the data.

Typically this occurs in clusters where fsync latency (often the
result of firmware) transiently spikes.  For Kubernetes clusters this
causes (many) elections which have knock-on effects such that the API
server will transiently fail causing other components fail in turn.

By writing the data (buffered and asynchronously flushed, so in most
situations the write is fast) and avoiding the fsync we no longer
trigger this situation and opportunistically write out the data.

Anecdotally:
  Because the fsync is missing there is the argument that certain
  types of failure events will cause data corruption or loss, in
  testing this wasn't seen.  If this was to occur the expectation is
  the member can be readded to a cluster or worst-case restored from a
  robust persisted snapshot.

  The etcd members are deployed across isolated racks with different
  power feeds.  An instantaneous failure of all of them simultaneously
  is unlikely.

  Testing was usually of the form:
   * create (Kubernetes) etcd write-churn by creating replicasets of
     some 1000s of pods
   * break/fail the leader

  Failure testing included:
   * hard node power-off events
   * disk removal
   * orderly reboots/shutdown

  In all cases when the node recovered it was able to rejoin the
  cluster and synchronize.
When using --unsafe-no-fsync still write out the data
The integration jobs fail with timeouts slightly over 3s, increase
this marginally so false failures are less prevalent.
integration: relax leader timeout from 3s to 4s
etcdserver: fix incorrect metrics generated when clients cancel watches
As go 1.12.2 is what is tested in CI as well as recommended to be built
with 1.12.2 we should also pin to this in the go directive version.
[release-3.4]: Pin go version in go.mod to 1.12
Currently in CI the tests are only run with go v1.12, this adds also go
v1.15.11.

Excludes certain variants for v1.15.
This patch is needed due to go 1.15 erroring on:

"Setctty set but Ctty not valid in child".
hexfusion and others added 7 commits June 17, 2021 11:49
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
To make it easier to root-cause when /health check fails.
For example, we are using load balancer to health check
each etcd instance, and when one etcd node gets terminated,
it's hard to tell whether etcd "server" was really failing
or client (or load balancer") failed to reach the etcd cluster
which is also failure in load balancer health check.

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
…s not part of member list and dataDir exists

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
@hexfusion hexfusion changed the title bump etcd v3.4.16 ETCD-199: bump etcd v3.4.16 Jun 17, 2021
@hexfusion
Copy link
Author

infra route53

/test configmap-scale

@hexfusion
Copy link
Author

failed to acquire lease for "aws-quota-slice": resources not found

/test e2e-aws-upgrade

@hexfusion
Copy link
Author

failed to acquire lease for "aws-quota-slice": resources not found

/test e2e-aws-serial

@hexfusion hexfusion removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2021
@hexfusion
Copy link
Author

/test configmap-scale

@lilic
Copy link

lilic commented Jun 18, 2021

/retest

@rsevilla87
Copy link
Member

/test configmap-scale

1 similar comment
@rsevilla87
Copy link
Member

/test configmap-scale

@lilic
Copy link

lilic commented Jun 18, 2021

Sadly there is an AWS limit we reached, they are still working on it, so not sure when this will pass. Maybe worth waiting until Monday to retest again?

Copy link

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

🎉

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2021
@openshift-ci
Copy link

openshift-ci bot commented Jun 18, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, lilic

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hexfusion
Copy link
Author

/test configmap-scale

@hexfusion
Copy link
Author

route53 ...
/test configmap-scale

@hexfusion
Copy link
Author

infra ..............

/test configmap-scale

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.