Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden stall at 100% CPU #858

Closed
Stebalien opened this issue Mar 27, 2020 · 19 comments
Closed

Sudden stall at 100% CPU #858

Stebalien opened this issue Mar 27, 2020 · 19 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@Stebalien
Copy link
Member

Stebalien commented Mar 27, 2020

Knowns:

  • Started happening after 7ba3222 but before 0.6.1.
  • The node is completely unresponsive, even to pprof.

Likely: golang/go#38070

@Stebalien Stebalien added the kind/bug A bug in existing code (including security flaws) label Mar 27, 2020
@Stebalien
Copy link
Member Author

Stebalien commented Mar 27, 2020

Suspects:

  • yamux. We re-enabled write coalescing in both yamux and mplex in this range of commits. We should try downgrading yamux.
  • Go. Yes, go itself.

@Stebalien
Copy link
Member Author

Traces before the issue. View the trace with go tool trace ./traces/trace. View the profile with go tool pprof ./traces/profile.

traces.tar.gz

@Stebalien
Copy link
Member Author

Stackdump: Process with stackparse

stackdump.txt

@Stebalien
Copy link
Member Author

@alanshaw could you post the go version and linux version?

@Stebalien
Copy link
Member Author

Stebalien commented Mar 27, 2020

The go suspect:

goroutine 0 [idle]:
runtime.futex(0x152d0e8, 0x80, 0x0, 0x0, 0x0, 0x46a6f5, 0x6f5fc, 0x3b72265d, 0x7ffdee735cb8, 0x40caef, ...)
        /usr/local/go/src/runtime/sys_linux_amd64.s:567 +0x21
runtime.futexsleep(0x152d0e8, 0x0, 0xffffffffffffffff)
        /usr/local/go/src/runtime/os_linux.go:45 +0x46
runtime.notesleep(0x152d0e8)
        /usr/local/go/src/runtime/lock_futex.go:151 +0x9f
runtime.stopm()
        /usr/local/go/src/runtime/proc.go:1828 +0xc0
runtime.findrunnable(0xc00006a000, 0x0)
        /usr/local/go/src/runtime/proc.go:2360 +0xa0d
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:2520 +0x2fc
runtime.park_m(0xc0008e8780)
        /usr/local/go/src/runtime/proc.go:2690 +0x9d
runtime.mcall(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:318 +0x5b

and

goroutine 496872 [runnable]:
runtime.resettimer(...)
        /usr/local/go/src/runtime/time.go:488
time.resetTimer(0xc006487868, 0x19ee734323e21)
        /usr/local/go/src/runtime/time.go:223 +0x62
time.(*Timer).Reset(0xc006487860, 0x186a0, 0x3)
        /usr/local/go/src/time/sleep.go:126 +0x7d
github.com/libp2p/go-yamux.(*Session).sendLoop(0xc005e35b20, 0x0, 0x0)
        /go/pkg/mod/github.com/libp2p/go-yamux@v1.3.3/session.go:513 +0x895
github.com/libp2p/go-yamux.(*Session).send(0xc005e35b20)
        /go/pkg/mod/github.com/libp2p/go-yamux@v1.3.3/session.go:413 +0x2b
created by github.com/libp2p/go-yamux.newSession
        /go/pkg/mod/github.com/libp2p/go-yamux@v1.3.3/session.go:130 +0x3d7

Clearly there is work to run, but it looks like findrunnable isn't finding anything. We may need to take a crack at this with gdb to confirm that this.

This is why the Linux version and go version matter.

@Stebalien
Copy link
Member Author

Looking deeper, I'm seeing streams stuck reading 30 minutes in the DHT. They're supposed to be reset after 10 minutes.

@alanshaw
Copy link
Member

go 1.14

I’ll check the linux version asap

alanshaw added a commit to libp2p/hydra-booster that referenced this issue Mar 27, 2020
@alanshaw
Copy link
Member

$ hostnamectl
   Static hostname: n/a
Transient hostname: gke-cluster-1-default-pool-b49ccc89-w9n3
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 224bd461662f618bdb8f842af3ce3abf
           Boot ID: 9f7917a78a30441e8a2fed11f51bde86
    Virtualization: kvm
  Operating System: Container-Optimized OS from Google
            Kernel: Linux 4.14.138+
      Architecture: x86-64

@Stebalien
Copy link
Member Author

@alanshaw could we try upgrading libp2p and upgrading yamux: GOPROXY=direct go get github.com/libp2p/go-libp2p@fix/re-disable-write-coalescing?

@alanshaw
Copy link
Member

git log 7ba322..76c1f95 --pretty=oneline
76c1f95caa7a7e22a1d971e8e186d9129ba77a74 (tag: v0.6.1) Merge pull request #846 from libp2p/chore/dep-update
7b280dd74dc58f65e312b3bdb8fbde6b857af1a9 chore: update yamux
891aeda705813f93d2d0b8108aa93e7ef141b61f Merge pull request #843 from libp2p/dependabot/go_modules/github.com/libp2p/go-libp2p-yamux-0.2.4
71e2d4ed5ba1f0b8247e668a17995bb4d2b42ddb build(deps): bump github.com/libp2p/go-libp2p-yamux from 0.2.2 to 0.2.4
463fa2a0dcbd5ee139011f36eb4975af63e05b1a Merge pull request #838 from libp2p/dependabot/go_modules/github.com/multiformats/go-multiaddr-net-0.1.3
ca57cf904e462e0977b41c7cb382c75d673778d9 build(deps): bump github.com/multiformats/go-multiaddr-net
9ad477ba8b7a369cad709790eca366e8afe2d3e0 (tag: v0.6.0) Merge pull request #836 from libp2p/feat/reliable-mock-notif
5f75aa2068f276cc5140879cda7fed1f0b131afe feat(mock): reliable notifications
4bbf43e8b2deeeb3f5db32cb94ae0286a95812b5 Merge pull request #835 from libp2p/fix/autorelay-doc
61634630e8114d4883fb1881c39554b26adbf3fa Merge pull request #833 from libp2p/dependabot/go_modules/github.com/libp2p/go-libp2p-mplex-0.2.2
8248da96b0a406501d72a8442a545eff6997e280 doc(options): fix autorelay documentation
241a61a246fab62af3341f022d1788829074fe12 build(deps): bump github.com/libp2p/go-libp2p-mplex from 0.2.1 to 0.2.2
8a52e862ee20881eb7e4c4cf6b33dd617664b873 Merge pull request #832 from libp2p/dependabot/go_modules/github.com/libp2p/go-libp2p-yamux-0.2.2
15fa3033c50088580b692871540401839a58b1f1 Merge pull request #834 from libp2p/dependabot/go_modules/github.com/libp2p/go-libp2p-peerstore-0.2.0
b5853020f816dad92542a0a03d33258ffdc0d0b5 build(deps): bump github.com/libp2p/go-libp2p-peerstore
a3bbeef22e33e1fefefdb52bc73e9c1a03a93618 build(deps): bump github.com/libp2p/go-libp2p-yamux from 0.2.1 to 0.2.2
e641f58681af49b132f38c2b7053eaea2addb0d6 Merge pull request #817 from libp2p/dependabot/go_modules/github.com/multiformats/go-multiaddr-0.2.1
34e69d491c213cd4566ee505b9adab85ff10a8e5 Merge pull request #796 from libp2p/generalize-private-network
cd4c6d3d115a7713ed6fe609c63da197c06788f1 Merge pull request #827 from libp2p/update-readme
4ccf0171045a60476b5c05d0e844fe60d3ebb3e4 Merge pull request #820 from libp2p/fix/remove-goroutine
d6d208197105b2287ff1aba14a23573490c4a62c Merge pull request #810 from libp2p/fix/relays-without-content-routing
68f9fd0e13bce74a8743fc394d54cca860b6b03c change PrivateNetwork to accept a PSK, update constructor magic
ebb9cb8af12faae59ec1cae4c8198cfea26cb422 Merge pull request #815 from Wondertan/fix/reset-error
5a6c36d346dda699f0bd99ff3b32cb9e1d23fbdd Merge pull request #826 from vasco-santos/docs/uniform
969eedc98f44ff9c362e6452dc8a842ed2ea0dc9 docs: remove note to non cancelled dev calls
b55baf18e967d8dd509c3b430b4217fa50a6a1e1 docs: remove deprecated packages
fe2c13c71b1c091293c712d6d0fef4bdfdf985f0 docs: uniform comment sentences
6cdc40418044dc96a9a90aefb0cc24faff647ffe fix: remove an unnecessary goroutine
6b1179649ff1276f10a8a6fe8b384959a281d743 build(deps): bump github.com/multiformats/go-multiaddr
5e30cdeaacdfdb1394048e6aa9668a4bfe78cb36 use of mux.ErrReset
78ca1bf57c421b79ce278075c7a311f7775f72b8 fix: can enable autorelay without content routing as long as there are static relays configured

I'm running a Hydra with e641f58681af49b132f38c2b7053eaea2addb0d6 now and stable so far...will leave it running for longer and report back.

@alanshaw
Copy link
Member

@alanshaw could we try upgrading libp2p and upgrading yamux: GOPROXY=direct go get github.com/libp2p/go-libp2p@fix/re-disable-write-coalescing?

Sorry I missed this - I'll deploy this in a few hours and let you know.

@Stebalien
Copy link
Member Author

Tell me how this test fairs first.

@Stebalien
Copy link
Member Author

Ok, found a go bug that looks like the issue: golang/go#38070. I have a repro:

package main

import (
	"fmt"
	"time"
)

func main() {
	ticker := time.NewTicker(time.Millisecond)
	defer ticker.Stop()
	exit := make(chan struct{})
	running := 0
	for {
		for i := 0; i < 100; i++ {
			running++
			go func() {
				timer := time.NewTimer(1 * time.Microsecond)
				for {
					for k := 0; k < 100; k++ {
						timer.Reset(100 * time.Microsecond)
					}
					select {
					case <-timer.C:
					case exit <- struct{}{}:
						return
					}
				}
			}()
			if running > 4000 {
				<-ticker.C
				select {
				case <-exit:
					running--
				default:
				}
			}
		}
		fmt.Printf("running: %d\n", running)
	}
}

This works on go 1.13 but grinds to a halt on go 1.14.1 (even with the preemptive scheduler disabled).

@Stebalien
Copy link
Member Author

Issue submitted upstream: golang/go#38119.

It looks like the code above "works" on the latest master (which has fixed a timer reset deadlock). However, it still doesn't work well.

Stebalien added a commit to Stebalien/quic-go that referenced this issue Mar 27, 2020
This reverts commit c88a690.

Unfortunately, go 1.14's new scheduler and timer changes have a few bugs that
need to be ironed out before it's ready for production use.

See:
* libp2p/go-libp2p#858
* golang/go#38119
@alanshaw
Copy link
Member

Still stable running e641f58681af49b132f38c2b7053eaea2addb0d6

@Stebalien
Copy link
Member Author

Stebalien commented Mar 27, 2020

Yeah, that sounds like the write coalescing is causing it. Try the latest go-lib2p release with the patches I've posted.

@RubenKelevra
Copy link

might be related or not.

There's a complete debug log attached from the start until it freezes plus the stack trace after kill (if this helps?)

ipfs/kubo#7049

@Stebalien
Copy link
Member Author

It's not related.

@Stebalien
Copy link
Member Author

Resolved by updating to go 1.14.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

3 participants