Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow parallel processing #37

Closed
tobowers opened this issue Jul 17, 2018 · 22 comments
Closed

slow parallel processing #37

tobowers opened this issue Jul 17, 2018 · 22 comments

Comments

@tobowers
Copy link
Contributor

tobowers commented Jul 17, 2018

Hi, I'm still researching this so it's a work-in-progress but I have narrowed it down enough that I thought it worthy of discussion. I've been using #30 in a pre-production project and I noticed some bottlenecks in parallel code. I started to research and I narrowed it down to WrapObject creating blocking-like behavior.

In a real-world project with hundreds of parallel tasks, I was noticing 200-500ms times in wrapping simple objects.

I narrowed it down to https://github.com/ipfs/go-ipld-cbor/blob/refmt/node.go#L184 but haven't yet found the root cause.

The rest of that function does not cause long blocking behavior.

Here's a failing test:

type testEncodeStruct struct {
	Id string
	Number int
}

func TestParallelWrap(t *testing.T) {
	cbornode.RegisterCborType(testEncodeStruct{})

	times := 10000

	obj := &testEncodeStruct{Id: "test", Number: 100}

	responses := make(chan int, times)

	for i := 0; i < times; i++ {
		go func() {
			now := time.Now()
			node,err := cbornode.WrapObject(obj, multihash.SHA2_256, -1)
			end := int(time.Now().Sub(now))
			assert.Nil(t, err)
			assert.Len(t, node.RawData(), 18)
			responses <- end
		}()
	}

	respSlice := make([]int, times)

	for i := 0; i < times; i++ {
		respSlice[i] = <-responses
	}
	t.Logf("max was: %v", max(respSlice))

	assert.True(t, max(respSlice)< int(100 * time.Millisecond))

}
@whyrusleeping
Copy link
Member

@tobowers Thanks for reporting! Getting usage on that PR definitely helps us move the ball forward

@Stebalien
Copy link
Member

@tobowers could you try my PR #38?

@tobowers
Copy link
Contributor Author

So... this could totally be my own shitty parallel code causing context switching... so I really hope I'm not leading us down a rabbit hole :). However, anyway: using this code https://github.com/QuorumControl/encodeexplore I do not see an overall speed improvement using the new rfmt-pool branch: ( https://github.com/QuorumControl/encodeexplore/compare/test/refmt-pool )

I'm still seeing spikes in decode/encode... these lines: https://github.com/QuorumControl/encodeexplore/blob/master/gossip/gossip.go#L234-L239

producing:

ERROR[07-18|09:47:58.545] decode                                   took=1.051360211s
ERROR[07-18|09:47:58.622] decode                                   took=1.128933855s
ERROR[07-18|09:47:58.693] decode                                   took=1.191881103s
ERROR[07-18|09:47:58.716] decode                                   took=1.190810543s
ERROR[07-18|09:47:58.717] decode                                   took=1.191089245s
ERROR[07-18|09:47:58.735] decode                                   took=1.19506691s

@tobowers
Copy link
Contributor Author

FYI: These encodes/decodes are getting called 10s of thousands of times.

@Stebalien
Copy link
Member

So, WrapObject is CPU heavy as it computes a sha256 hash. We could do that lazily but we worried that we'd always want it anyways so we might as well compute it up-front. However, it shouldn't take a second.

I'd try using pprof: https://blog.golang.org/profiling-go-programs. A CPU profile should give you a good idea of what's happening.

@tobowers
Copy link
Contributor Author

When I was looking before it wasn't the sha256, but instead the rfmt decode. I will check again with the profile. Thanks for the link!

@Stebalien
Copy link
Member

Thanks. The profile will be really helpful.

@tobowers
Copy link
Contributor Author

Interesting:

      flat  flat%   sum%        cum   cum%
     3.84s  6.82%  6.82%      3.84s  6.82%  runtime.memmove
     3.70s  6.57% 13.39%     14.71s 26.13%  runtime.mallocgc
     2.71s  4.81% 18.21%      4.79s  8.51%  runtime.scanobject
     2.65s  4.71% 22.91%      2.94s  5.22%  runtime.heapBitsForObject
     2.29s  4.07% 26.98%      2.29s  4.07%  runtime.usleep
     1.61s  2.86% 29.84%      2.49s  4.42%  runtime.bulkBarrierPreWrite
     1.60s  2.84% 32.68%      1.60s  2.84%  runtime.memclrNoHeapPointers
     1.33s  2.36% 35.04%      1.33s  2.36%  syscall.Syscall
     1.29s  2.29% 37.34%      1.50s  2.66%  runtime.heapBitsSetType
     1.22s  2.17% 39.50%      4.55s  8.08%  runtime.gcWriteBarrier
     1.14s  2.02% 41.53%      1.14s  2.02%  runtime.nextFreeFast (inline)
     1.08s  1.92% 43.45%      6.36s 11.30%  fmt.(*pp).doPrintf
     0.99s  1.76% 45.20%      0.99s  1.76%  github.com/quorumcontrol/encodeexplore/vendor/github.com/minio/sha256-simd.blockAvx2
     0.97s  1.72% 46.93%     22.84s 40.57%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.traverse
     0.97s  1.72% 48.65%      2.54s  4.51%  reflect.Value.Slice
     0.95s  1.69% 50.34%      3.81s  6.77%  runtime.wbBufFlush1
     0.93s  1.65% 51.99%      4.78s  8.49%  fmt.(*pp).printArg
     0.80s  1.42% 53.41%      1.52s  2.70%  fmt.(*fmt).fmt_integer
     0.63s  1.12% 54.53%      0.63s  1.12%  runtime.heapBits.bits (inline)
     0.63s  1.12% 55.65%      0.63s  1.12%  runtime.mach_semaphore_signal
     0.54s  0.96% 56.61%      0.54s  0.96%  runtime.(*mspan).refillAllocCache
     0.52s  0.92% 57.53%      0.52s  0.92%  runtime.acquirem (inline)
     0.51s  0.91% 58.44%      1.11s  1.97%  fmt.(*buffer).WriteString (inline)
     0.50s  0.89% 59.33%      3.11s  5.52%  runtime.convT2Estring
     0.49s  0.87% 60.20%      0.54s  0.96%  runtime.(*gcBits).bitp (inline)
     0.48s  0.85% 61.05%      0.48s  0.85%  runtime.releasem (inline)
     0.47s  0.83% 61.88%      3.75s  6.66%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*Marshaller).Step
     0.47s  0.83% 62.72%      1.24s  2.20%  reflect.Value.assignTo
     0.42s  0.75% 63.46%      3.75s  6.66%  reflect.Value.Set
     0.41s  0.73% 64.19%         1s  1.78%  sync.(*Pool).Put
     0.38s  0.67% 64.87%     13.37s 23.75%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*Unmarshaller).Step
     0.37s  0.66% 65.52%      2.07s  3.68%  runtime.mapassign_fast64
     0.36s  0.64% 66.16%      0.76s  1.35%  runtime.evacuate_fast64
     0.35s  0.62% 66.79%      0.35s  0.62%  runtime.(*bmap).setoverflow (inline)
     0.35s  0.62% 67.41%      0.75s  1.33%  sync.(*Pool).pin
     0.34s   0.6% 68.01%      2.85s  5.06%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*Marshaller).Recurse
     0.34s   0.6% 68.61%      0.80s  1.42%  reflect.Value.Bytes
     0.32s  0.57% 69.18%      0.32s  0.57%  runtime.mach_semaphore_wait
     0.31s  0.55% 69.73%      0.61s  1.08%  runtime.gcmarknewobject
     0.30s  0.53% 70.27%     12.81s 22.75%  fmt.Sprintf

@tobowers
Copy link
Contributor Author

(pprof) top20 -cum
Showing nodes accounting for 6.56s, 11.65% of 56.30s total
Dropped 279 nodes (cum <= 0.28s)
Showing top 20 nodes out of 192
      flat  flat%   sum%        cum   cum%
         0     0%     0%     43.50s 77.26%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.WrapObject
     0.01s 0.018% 0.018%     25.01s 44.42%  github.com/quorumcontrol/encodeexplore/gossip.(*GNode).Start.func1.1
         0     0% 0.018%     24.91s 44.25%  github.com/quorumcontrol/encodeexplore/gossip.(*GNode).OnMessage
         0     0% 0.018%     23.65s 42.01%  github.com/quorumcontrol/encodeexplore/gossip.(*GNode).Start.func1.2
     0.01s 0.018% 0.036%     23.01s 40.87%  github.com/quorumcontrol/encodeexplore/gossip.(*GNode).doGossip
     0.04s 0.071%  0.11%     22.88s 40.64%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.newObject
     0.97s  1.72%  1.83%     22.84s 40.57%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.traverse
     0.17s   0.3%  2.13%     20.43s 36.29%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/shared.TokenPump.Run
         0     0%  2.13%     15.32s 27.21%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor/encoding.(*PooledCloner).Clone
         0     0%  2.13%     15.27s 27.12%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt.(*cloner).Clone
         0     0%  2.13%     15.26s 27.10%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt.cloner.Clone
     0.29s  0.52%  2.65%     15.07s 26.77%  runtime.systemstack
     3.70s  6.57%  9.22%     14.71s 26.13%  runtime.mallocgc
     0.38s  0.67%  9.89%     13.37s 23.75%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*Unmarshaller).Step
     0.25s  0.44% 10.34%     13.23s 23.50%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*unmarshalMachineWildcard).Step
     0.10s  0.18% 10.52%     12.89s 22.90%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*unmarshalMachineSliceWildcard).Step
     0.30s  0.53% 11.05%     12.81s 22.75%  fmt.Sprintf
     0.06s  0.11% 11.15%     12.77s 22.68%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*unmarshalMachineSliceWildcard).(github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.step_AcceptValue)-fm
     0.28s   0.5% 11.65%     12.71s 22.58%  github.com/quorumcontrol/encodeexplore/vendor/github.com/polydawn/refmt/obj.(*unmarshalMachineSliceWildcard).step_AcceptValue
         0     0% 11.65%     12.44s 22.10%  github.com/quorumcontrol/encodeexplore/vendor/github.com/ipfs/go-ipld-cbor.compLinks
(pprof) 

@Stebalien
Copy link
Member

Could you post a graph? Try running the svg command.

@Stebalien
Copy link
Member

It looks like we really should consider traversing lazily.

@tobowers
Copy link
Contributor Author

@tobowers
Copy link
Contributor Author

@Stebalien
Copy link
Member

Yeah. Calculating the tree up-front is slowing us down. There's no reason we really need to do that.

@whyrusleeping
Copy link
Member

@Stebalien the reason we do it up front is to make sure that the other methods that rely on it having been computed don't need to return an error

@Stebalien
Copy link
Member

I see. The performance issue is concatenating the strings so we could always do that later. However, I think a more complete fix is to make two node interfaces (a lazy one for resolving, computing the cid, deserialising into an object and an eager one for introspecting).

@tobowers
Copy link
Contributor Author

tobowers commented Jul 19, 2018

In that 2nd one, the concatenation doesn't seem to take up as much time (I just switched the fmt.Sprintf to a standard string concatenation and a strconv for the integer.)

@Stebalien
Copy link
Member

I guess calling that "the" performance issue is incorrect. However, it still takes up 10% of the time.

@whyrusleeping
Copy link
Member

a more complete fix is to make two node interfaces

Yeah, Thats what @warpfork was saying too. It sounds like a good idea to me

@warpfork
Copy link
Member

This is not related to parallel processing, but since I've got the issue open, and it's about performance...

I've been nudging around some benchmarks of go-ipld-cbor and one thing that's come up as a big timesink on some of my pprof graphs is the handling of CIDs. CIDs use the refmt 'transform' function feature... and that's one of those features which is powerful and expressive but by nature does some things that are not very friendly to the GC. That's something I'm going to look at trying to optimize better.

In general GC can be a big factor of performance, and also distressingly hard to figure out the impacts of because it can quietly make other code seem slower. The graph view and the weights of the arrows heading into the GC nodes are probably the most informative; I don't really know of a great way to get a textual list of that from the other pprof commands (although if anyone has suggetions... ❤️ )

@aschmahmann
Copy link
Contributor

Any IPLD folks have thoughts on this @warpfork?

@warpfork
Copy link
Member

I think we could probably mark this closed. Some performance observations were made here, but I'm not sure what actions would be described next by this issue. (More efforts to reduce allocations and reduce GC pressure, maybe? But how we would actually do that probably needs other discussions than this.)

We've also done an enormous amount of work in the redesigned go-ipld-prime interfaces to be sensitive to GC pressure and design to avoid unnecessary allocations at all levels. (In essence, that library also avoids all the precomputations of tree structure that seems to have been flagged as problematic here(?) -- and we needed to tackle that with a new library and new interfaces, because there was really no incremental way to change something as central as that.) I'd suggest looking to that library for improvements now and in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants