Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow writes panic when run with github.com/golang/protobuf v1.4.0+ #12197

Closed
liggitt opened this issue Aug 3, 2020 · 13 comments · Fixed by #12239
Closed

Slow writes panic when run with github.com/golang/protobuf v1.4.0+ #12197

liggitt opened this issue Aug 3, 2020 · 13 comments · Fixed by #12239
Assignees
Labels

Comments

@liggitt
Copy link
Contributor

liggitt commented Aug 3, 2020

When the etcd server integration test library is used with github.com/golang/protobuf v1.4.0+, slow writes hit this path:

defer warnOfExpensiveRequest(s.getLogger(), time.Now(), r, nil, nil)

stringifying the request:

zap.String("request", reqStringer.String()),

which calls to

func (m *Request) String() string { return proto.CompactTextString(m) }

With github.com/golang/protobuf < 1.4.0, this works fine.

With github.com/golang/protobuf >= 1.4.0, this panics with:

panic: field etcdserverpb.Request.ID has invalid type: got uint64, want pointer

goroutine 1 [running]:
google.golang.org/protobuf/internal/impl.fieldInfoForScalar(0x1677e40, 0xc00030e000, 0x14fa413, 0x2, 0x0, 0x0, 0x1677f60, 0x15188e0, 0x14fa417, 0x29, ...)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message_reflect_field.go:228 +0x7d7
google.golang.org/protobuf/internal/impl.(*MessageInfo).makeKnownFieldsFunc(0xc000218a00, 0xffffffffffffffff, 0xffffffffffffffff, 0x98, 0xffffffffffffffff, 0xc0001f9320, 0xc0001f9350, 0xc0001f9380, 0xc0001f93b0)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message_reflect.go:67 +0x97d
google.golang.org/protobuf/internal/impl.(*MessageInfo).makeReflectFuncs(0xc000218a00, 0x1677f60, 0x15c2800, 0xffffffffffffffff, 0xffffffffffffffff, 0x98, 0xffffffffffffffff, 0xc0001f9320, 0xc0001f9350, 0xc0001f9380, ...)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message_reflect.go:36 +0x65
google.golang.org/protobuf/internal/impl.(*MessageInfo).initOnce(0xc000218a00)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message.go:90 +0x192
google.golang.org/protobuf/internal/impl.(*MessageInfo).init(...)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message.go:72
google.golang.org/protobuf/internal/impl.(*messageReflectWrapper).Has(0xc0002dc3a0, 0x1677e40, 0xc00030e000, 0xc00030e000)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message_reflect_gen.go:185 +0xf9
github.com/golang/protobuf/proto.(*textWriter).writeMessage(0xc0001f9290, 0x1676ec0, 0xc0002dc3a0, 0x0, 0x0)
	/Users/liggitt/go/pkg/mod/github.com/golang/protobuf@v1.4.2/proto/text_encode.go:278 +0x935
github.com/golang/protobuf/proto.(*TextMarshaler).marshal(0x193249e, 0x1670380, 0xc0000cfd90, 0x1007fa5, 0xc000096058, 0x0, 0x0, 0x0)
	/Users/liggitt/go/pkg/mod/github.com/golang/protobuf@v1.4.2/proto/text_encode.go:86 +0x190
github.com/golang/protobuf/proto.(*TextMarshaler).Text(...)
	/Users/liggitt/go/pkg/mod/github.com/golang/protobuf@v1.4.2/proto/text_encode.go:44
github.com/golang/protobuf/proto.CompactTextString(...)
	/Users/liggitt/go/pkg/mod/github.com/golang/protobuf@v1.4.2/proto/text_encode.go:106
go.etcd.io/etcd/etcdserver/etcdserverpb.(*Request).String(...)

Reproducer:
go.mod:

module etcdpanictest

go 1.15

require (
       go.etcd.io/etcd v0.5.0-alpha.5.0.20200716221620-18dfb9cca345 // HEAD of release-3.4, also reproducible on master
       github.com/golang/protobuf v1.4.2
)

main.go:

package main

import (
       "fmt"

       etcdpb "go.etcd.io/etcd/etcdserver/etcdserverpb"
)

func main() {
       request := &etcdpb.Request{}
       fmt.Println(request.String())
}
go run main.go 
panic: field etcdserverpb.Request.ID has invalid type: got uint64, want pointer

goroutine 1 [running]:
google.golang.org/protobuf/internal/impl.fieldInfoForScalar(0x1677e40, 0xc000294000, 0x14fa413, 0x2, 0x0, 0x0, 0x1677f60, 0x15188e0, 0x14fa417, 0x29, ...)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message_reflect_field.go:228 +0x7d7
google.golang.org/protobuf/internal/impl.(*MessageInfo).makeKnownFieldsFunc(0xc0001a0a00, 0xffffffffffffffff, 0xffffffffffffffff, 0x98, 0xffffffffffffffff, 0xc000181320, 0xc000181350, 0xc000181380, 0xc0001813b0)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message_reflect.go:67 +0x97d
google.golang.org/protobuf/internal/impl.(*MessageInfo).makeReflectFuncs(0xc0001a0a00, 0x1677f60, 0x15c2800, 0xffffffffffffffff, 0xffffffffffffffff, 0x98, 0xffffffffffffffff, 0xc000181320, 0xc000181350, 0xc000181380, ...)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message_reflect.go:36 +0x65
google.golang.org/protobuf/internal/impl.(*MessageInfo).initOnce(0xc0001a0a00)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message.go:90 +0x192
google.golang.org/protobuf/internal/impl.(*MessageInfo).init(...)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message.go:72
google.golang.org/protobuf/internal/impl.(*messageReflectWrapper).Has(0xc0002643a0, 0x1677e40, 0xc000294000, 0xc000294000)
	/Users/liggitt/go/pkg/mod/google.golang.org/protobuf@v1.23.0/internal/impl/message_reflect_gen.go:185 +0xf9
github.com/golang/protobuf/proto.(*textWriter).writeMessage(0xc000181290, 0x1676ec0, 0xc0002643a0, 0x0, 0x0)
	/Users/liggitt/go/pkg/mod/github.com/golang/protobuf@v1.4.2/proto/text_encode.go:278 +0x935
github.com/golang/protobuf/proto.(*TextMarshaler).marshal(0x193249e, 0x1670380, 0xc0000d7e40, 0x1007fa5, 0xc00009e058, 0x0, 0x0, 0x0)
	/Users/liggitt/go/pkg/mod/github.com/golang/protobuf@v1.4.2/proto/text_encode.go:86 +0x190
github.com/golang/protobuf/proto.(*TextMarshaler).Text(...)
	/Users/liggitt/go/pkg/mod/github.com/golang/protobuf@v1.4.2/proto/text_encode.go:44
github.com/golang/protobuf/proto.CompactTextString(...)
	/Users/liggitt/go/pkg/mod/github.com/golang/protobuf@v1.4.2/proto/text_encode.go:106
go.etcd.io/etcd/etcdserver/etcdserverpb.(*Request).String(...)
	/Users/liggitt/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20200716221620-18dfb9cca345/etcdserver/etcdserverpb/etcdserver.pb.go:158
main.main()
	/Users/liggitt/go/src/k8s.io/kubernetes/etcdpanictest/main.go:11 +0x59
exit status 2
@liggitt
Copy link
Contributor Author

liggitt commented Aug 3, 2020

It is getting harder to avoid pulling in github.com/golang/protobuf@v1.4.0+ version via transitive dependencies, but doing so makes the integration etcd server flake/break under load

@liggitt
Copy link
Contributor Author

liggitt commented Aug 3, 2020

cc @jpbetz @jingyih

@wenjiaswe
Copy link
Contributor

cc @ptabor

@jingyih
Copy link
Contributor

jingyih commented Aug 6, 2020

@liggitt etcdserverpb.Request is a V2 request (etcd2), which should already be deprecated. Could you point me to the test that panics with this error?

@liggitt
Copy link
Contributor Author

liggitt commented Aug 6, 2020

I added this patch to figure out what was triggering the v2 call:

diff --git a/vendor/go.etcd.io/etcd/etcdserver/server.go b/vendor/go.etcd.io/etcd/etcdserver/server.go
index a341625dccb..21b8090030c 100644
--- a/vendor/go.etcd.io/etcd/etcdserver/server.go
+++ b/vendor/go.etcd.io/etcd/etcdserver/server.go
@@ -2196,11 +2196,13 @@ func (s *EtcdServer) applyEntryNormal(e *raftpb.Entry) {
 		var r pb.Request
 		rp := &r
 		pbutil.MustUnmarshal(rp, e.Data)
+		panic("!pbutil.MaybeUnmarshal: " + rp.Method + " " + rp.Path)
 		s.w.Trigger(r.ID, s.applyV2Request((*RequestV2)(rp)))
 		return
 	}
 	if raftReq.V2 != nil {
 		req := (*RequestV2)(raftReq.V2)
+		panic("raftReq.V2 != nil: " + req.Method + " " + req.Path)
 		s.w.Trigger(req.ID, s.applyV2Request(req))
 		return
 	}
diff --git a/vendor/go.etcd.io/etcd/etcdserver/v2_server.go b/vendor/go.etcd.io/etcd/etcdserver/v2_server.go
index 9238b2dc580..23c8ef2106f 100644
--- a/vendor/go.etcd.io/etcd/etcdserver/v2_server.go
+++ b/vendor/go.etcd.io/etcd/etcdserver/v2_server.go
@@ -119,6 +119,7 @@ func (a *reqV2HandlerEtcdServer) processRaftRequest(ctx context.Context, r *Requ
 }
 
 func (s *EtcdServer) Do(ctx context.Context, r pb.Request) (Response, error) {
+	panic("EtcdServer#Do:" + r.Method + " " + r.Path)
 	r.ID = s.reqIDGen.Next()
 	h := &reqV2HandlerEtcdServer{
 		reqV2HandlerStore: reqV2HandlerStore{
@@ -139,6 +140,7 @@ func (s *EtcdServer) Do(ctx context.Context, r pb.Request) (Response, error) {
 // respective operation. Do will block until an action is performed or there is
 // an error.
 func (r *RequestV2) Handle(ctx context.Context, v2api RequestV2Handler) (Response, error) {
+	panic("RequestV2#Handle:" + r.Method + " " + r.Path)
 	if r.Method == "GET" && r.Quorum {
 		r.Method = "QGET"
 	}

I'm seeing this panic:

panic: EtcdServer#Do:PUT /0/members/2f0ff1f18e31aeff/attributes

which looks like it is this code

release-3.4 version:

etcd/etcdserver/server.go

Lines 2002 to 2031 in 18dfb9c

// publish registers server information into the cluster. The information
// is the JSON representation of this server's member struct, updated with the
// static clientURLs of the server.
// The function keeps attempting to register until it succeeds,
// or its server is stopped.
//
// Use v2 store to encode member attributes, and apply through Raft
// but does not go through v2 API endpoint, which means even with v2
// client handler disabled (e.g. --enable-v2=false), cluster can still
// process publish requests through rafthttp
// TODO: Deprecate v2 store
func (s *EtcdServer) publish(timeout time.Duration) {
b, err := json.Marshal(s.attributes)
if err != nil {
if lg := s.getLogger(); lg != nil {
lg.Panic("failed to marshal JSON", zap.Error(err))
} else {
plog.Panicf("json marshal error: %v", err)
}
return
}
req := pb.Request{
Method: "PUT",
Path: membership.MemberAttributesStorePath(s.id),
Val: string(b),
}
for {
ctx, cancel := context.WithTimeout(s.ctx, timeout)
_, err := s.Do(ctx, req)

master version:

etcd/etcdserver/server.go

Lines 1832 to 1858 in f395f82

// publish registers server information into the cluster. The information
// is the JSON representation of this server's member struct, updated with the
// static clientURLs of the server.
// The function keeps attempting to register until it succeeds,
// or its server is stopped.
//
// Use v2 store to encode member attributes, and apply through Raft
// but does not go through v2 API endpoint, which means even with v2
// client handler disabled (e.g. --enable-v2=false), cluster can still
// process publish requests through rafthttp
// TODO: Deprecate v2 store in 3.6
func (s *EtcdServer) publish(timeout time.Duration) {
lg := s.getLogger()
b, err := json.Marshal(s.attributes)
if err != nil {
lg.Panic("failed to marshal JSON", zap.Error(err))
return
}
req := pb.Request{
Method: "PUT",
Path: membership.MemberAttributesStorePath(s.id),
Val: string(b),
}
for {
ctx, cancel := context.WithTimeout(s.ctx, timeout)
_, err := s.Do(ctx, req)

@liggitt
Copy link
Contributor Author

liggitt commented Aug 19, 2020

Workaround open at #12238 (master) and #12239 (3.4 pick)

@jingyih
Copy link
Contributor

jingyih commented Aug 19, 2020

Sorry I lost track of this issue. And thanks for fixing it!

@knight42
Copy link

knight42 commented Aug 19, 2020

@jingyih @liggitt I strongly suspect this issue is caused by incompatibility between github.com/gogo/protobuf and github.com/golang/protobuf.

In etcdserver.pb.go, github.com/gogo/protobuf was imported

_ "github.com/gogo/protobuf/gogoproto"
, but github.com/golang/protobuf was used for serialisation:

Furthermore, If I change the reproducer @liggitt provided to the following:
main.go

package main

import (
	"fmt"

	"github.com/gogo/protobuf/proto"
	etcdpb "go.etcd.io/etcd/etcdserver/etcdserverpb"
)

func main() {
	request := &etcdpb.Request{}
	fmt.Println(proto.CompactTextString(request))
}

go.mod

module etcdpanictest

go 1.15

require (
        github.com/gogo/protobuf v1.2.1
        github.com/golang/protobuf v1.4.2
        go.etcd.io/etcd v0.5.0-alpha.5.0.20200716221620-18dfb9cca345 // HEAD of release-3.4, also reproducible on master
)

It works well:

$ go run .
ID:0 Method:"" Path:"" Val:"" Dir:false PrevValue:"" PrevIndex:0 Expiration:0 Wait:false Since:0 Recursive:false Sorted:false Quorum:false Time:0 Stream:false

@ptabor
Copy link
Contributor

ptabor commented Oct 15, 2020

Etcd was using very old protobuf generator (gogo/protobuf-1.0.0).
#12397 changes it to use newer for branch 3.5.

@kjin97
Copy link

kjin97 commented Nov 5, 2020

I am seeing this in the following environment:

require (
	github.com/coreos/etcd v3.3.25+incompatible
	github.com/gogo/protobuf v1.3.1 // indirect
...
	github.com/golang/protobuf v1.4.3
	google.golang.org/protobuf v1.25.0 // indirect
)

replace github.com/coreos/bbolt => go.etcd.io/bbolt v1.3.5
replace google.golang.org/grpc v1.33.1 => google.golang.org/grpc v1.26.0

Should I still be seeing this issue if I'm on 3.3.25, which includes the aforementioned workaround?

@ptabor
Copy link
Contributor

ptabor commented Jan 21, 2021

This is definitely consequence of mixing gogo-generated protobuf's with protobuf-1.4 runtime, that is unsupported (and not working).

We need to somehow exit that glitch.

@dhermes: Sadly https://blog.golang.org/protobuf-apiv2 makes no mention of faster marshalling and unmarshalling without reflection.

From the data I collected so far, significant incremental improvement in google-protobuf implementation was done over time, so the gogo vs. google difference is not that significant. Kubernetes team is going to do some benchmarking there (kubernetes/kubernetes#96564).

From quick scan of RAFT code, I think there could be a performance improvement from transering to google-proto-gen, as it would enforce us to change a lot of pass-proto-by-value into pass-proto-by-pointer. Don't know to what degree golang optimizer can deal with that, but there is an opportunity.

The real challenge for transition gogo->google comes from coordinating the operation across many projects/libraries:

  • You cannot mix 1.4 runtime with any 1.3.x-gogo generated protobuf in a single binary
    (but you can mix 1.4 runtime with 1.3.x-google generated protobuf)

So every library would need to transfer to 1.3.x-google generated protobuf first (without taking benefits of transferring to 1.4 in foreseeable future) and only when all dependencies are ready on 1.3.x-google, the runtime of 1.4.x can be imported (so e.g. grpc 1.30+)

The API generated by gogo vs. Google is slightly different and requites pretty boring pass over code and adding &, * and changing proto.Foo into proto.GetFoo():

  • This is pretty significant and grungy work to do.
  • is breaking change on API surface when proto is exposed to library users (like in etcd client.v3),
    so shouldn't be done without new major version or package name change.

Assuming lack of upstream support the exit work would need to be following:

  1. [optional] Collect data to make the decision:
    a) either wait for kubernetees perf data
    b) or make the experimental google-proto-gen-1.3.x transition in separate branch AND run benchmarks
  2. Assuming performance is OK (or we don't have other choice):
    . a) either change the generator as part of 3.5 release (piggy-backing on package name change that already happens in 3.5 due to go-modules)
    b) or wait for etcd release of v4 (or v4 client)...
    c) ignore the import compatibility rule (https://blog.golang.org/v2-go-modules) that might cause problems to upstream components.
  3. At this point etcd client is able to coorperate with protobuf-1.4 runtime (but is not yet pushing that dependency upstream)
  4. When critical customers ready (e.g. kubernetes), regenerate using 1.4.x generator and take the dependency on proto-runtime-1.4.

@stale
Copy link

stale bot commented Apr 24, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 24, 2021
@ptabor
Copy link
Contributor

ptabor commented Apr 25, 2021

The protobuf runtime 1.5.1. (instead of 1.4.x) solves the problem.

@ptabor ptabor closed this as completed Apr 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

6 participants