Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

externalize snapshot catchup entries to etcd flag #15033

Merged
merged 1 commit into from
Jan 5, 2023

Conversation

chaochn47
Copy link
Member

@chaochn47 chaochn47 commented Dec 20, 2022

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Motivation

  1. It can benefit one-node cluster with limited resource (e.g memory) without any performance impact.
  2. For multi-node cluster, setting a small value for DefaultSnapshotCatchUpEntries can reduce the memory usage while less tolerant of network events that will cause snapshot transfer (reduced availability).
  3. It can benefit the E2E test cases to speed up test time and improve development velocity.

Questions

  1. open to discussion if we want to mark this flag as experimental.
  2. should we set a lower bound limit on the snapshot catchup entries count?

Related #14819 and #15013 (comment)

@chaochn47
Copy link
Member Author

@ahrtr @lavacat Could you please take a look and provide guidance? I am not sure about the answer of the above two questions.

/cc @serathius

Copy link

@lavacat lavacat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please update changelog.

server/etcdmain/config.go Outdated Show resolved Hide resolved
@lavacat
Copy link

lavacat commented Dec 20, 2022

I think this setting can be qualified as stable feature. So, it's ok to drop experimental. When it was added, there was a comment about making it tunable #2403 (comment)

@ahrtr
Copy link
Member

ahrtr commented Dec 20, 2022

1.open to discussion if we want to mark this flag as experimental.

YES, a new flag usually is experimental. See features.md#experimental. It isn't super clear what's the exact impact for now. So keeping it experimental is a better choice.

2.should we set a lower bound limit on the snapshot catchup entries count?

I would expect NO for now. For single node cluster for edge case, I even want to set a very small value, e.g. 10. But we need to provide general guidance on this, usually it should be expected latency * max throughput for multi-node cluster.

cc @ptabor & @spzala for opinion as well.

server/etcdmain/help.go Outdated Show resolved Hide resolved
tests/e2e/etcd_config_test.go Outdated Show resolved Hide resolved
@chaochn47 chaochn47 force-pushed the snapshotCatchupEntries branch 2 times, most recently from 24313cc to 98a6aff Compare December 21, 2022 07:14
@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2022

Codecov Report

Merging #15033 (2c46b2b) into main (e73f55d) will decrease coverage by 0.26%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main   #15033      +/-   ##
==========================================
- Coverage   74.85%   74.58%   -0.27%     
==========================================
  Files         415      415              
  Lines       34287    34288       +1     
==========================================
- Hits        25666    25575      -91     
- Misses       6997     7072      +75     
- Partials     1624     1641      +17     
Flag Coverage Δ
all 74.58% <100.00%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/config/config.go 79.76% <ø> (ø)
server/embed/config.go 74.48% <ø> (ø)
server/etcdmain/config.go 85.01% <100.00%> (+0.05%) ⬆️
server/etcdserver/api/etcdhttp/types/errors.go 60.00% <0.00%> (-13.34%) ⬇️
server/etcdserver/api/rafthttp/peer_status.go 87.87% <0.00%> (-12.13%) ⬇️
server/etcdserver/api/rafthttp/peer.go 87.01% <0.00%> (-8.45%) ⬇️
client/pkg/v3/testutil/leak.go 60.17% <0.00%> (-7.08%) ⬇️
client/pkg/v3/fileutil/purge.go 68.85% <0.00%> (-6.56%) ⬇️
server/storage/mvcc/watchable_store.go 89.13% <0.00%> (-3.63%) ⬇️
server/etcdserver/api/v3election/election.go 66.66% <0.00%> (-2.78%) ⬇️
... and 17 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you @chaochn47

Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Shell we have a test that checks whether etcd instance works in stable way when the setting is unreasonably small (e.g. 1) ?

  • Do we disallow setting this to 0 or negative ?

@ahrtr
Copy link
Member

ahrtr commented Dec 30, 2022

  • Do we disallow setting this to 0 or negative ?

Negative value is definitely NOT possible, because its type is uint64. 0 is the default value, and etcd will use the default value 5000 if it's 0.

  • Shell we have a test that checks whether etcd instance works in stable way when the setting is unreasonably small (e.g. 1) ?

Agreed. @chaochn47 please add test cases to intentionally to set a small value (e.g 1, 10). Please also resolve comment #15033 (comment), and also rebase this PR.

@chaochn47 chaochn47 force-pushed the snapshotCatchupEntries branch 2 times, most recently from 8450f72 to 7f56ecd Compare January 5, 2023 02:09
Signed-off-by: Chao Chen <chaochn@amazon.com>
@chaochn47
Copy link
Member Author

chaochn47 commented Jan 5, 2023

Thanks for the review and suggestion!!

Added a test TestEtcdHealthyWithTinySnapshotCatchupEntries to set snapshot catchup entry to 1 and validate 3 node etcd cluster is healthy under put traffic.

Is it sufficient as an experimental flag?

@ahrtr
Copy link
Member

ahrtr commented Jan 5, 2023

LGTM, thank you. Merging...

@ahrtr ahrtr merged commit 6200b22 into etcd-io:main Jan 5, 2023
@chaochn47 chaochn47 deleted the snapshotCatchupEntries branch January 5, 2023 05:49
@halegreen
Copy link
Contributor

@chaochn47 hi, will this change be picked to v3.5 too?

@chaochn47
Copy link
Member Author

chaochn47 commented Jan 7, 2023

Hi @halegreen

I am afraid that this experimental flag will not be back-ported to 3.5 which is meant to be stable release.

On the other way, 3.5 node cannot send snapshot to 3.6 node immediately without current PR. To unblock your test, as an alternative pointed out in #14807, you may want to add a 3.6 node to a 3.5 cluster to trigger snapshot transfer easily.

/cc @ahrtr @serathius for thoughts

@ahrtr
Copy link
Member

ahrtr commented Jan 7, 2023

No plan to backport this PR to 3.5. Please see #15013 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants