TarWriter.WriteFile stuck on select when writing to io.pipe #8957

EnchanterIO · 2022-05-09T13:51:41Z

Checklist

This is a bug report, not a question. Ask questions on discuss.ipfs.io.
I have searched on the issue tracker for my bug.
I am running the latest go-ipfs version or have an issue updating.

Installation method

built from source

Version

0.11

Config

Description

When debugging a memory leak on Infura nodes we noticed there are hundreds of stuck goroutines trying to process get commands.

Stack trace example:

goroutine 448012757 [select, 3993 minutes]:
io.(*pipe).Write(0xc0f60ecd80, {0xc1ef3d4000, 0x100000, 0x100000})
	/usr/local/go/src/io/pipe.go:94 +0x21a
io.(*PipeWriter).Write(0xc10fc49b72, {0xc1ef3d4000, 0x5, 0x4})
	/usr/local/go/src/io/pipe.go:163 +0x25
bufio.(*Writer).Flush(0xc0f68f88c0)
	/usr/local/go/src/bufio/bufio.go:607 +0x62
bufio.(*Writer).Write(0xc0f68f88c0, {0xc01ccf8000, 0xc061b905a0, 0xc10fc49af0})
	/usr/local/go/src/bufio/bufio.go:643 +0xd8
github.com/ipfs/go-ipfs/core/commands.(*identityWriteCloser).Write(0x50, {0xc01ccf8000, 0xc10fc49b70, 0x14d6dcc})
	go-ipfs/core/commands/get.go:258 +0x26
archive/tar.(*regFileWriter).Write(0xc0e3b9d380, {0xc01ccf8000, 0x100000, 0xc0499b4400})
	/usr/local/go/src/archive/tar/writer.go:497 +0x69
archive/tar.(*Writer).Write(0xc0657ddc00, {0xc01ccf8000, 0xc10fc49b01, 0xc10fc49bb0})
	/usr/local/go/src/archive/tar/writer.go:435 +0x4a
bytes.(*Reader).WriteTo(0xc0fa4714a0, {0x32dcc80, 0xc0657ddc00})
	/usr/local/go/src/bytes/reader.go:144 +0x87
github.com/ipfs/go-unixfs/io.(*dagReader).writeNodeDataBuffer(0xc0f60ecd20, {0x32dcc80, 0xc0657ddc00})
	pkg/mod/github.com/ipfs/go-unixfs@v0.3.1/io/dagreader.go:257 +0x33
github.com/ipfs/go-unixfs/io.(*dagReader).WriteTo.func1({0x32f27f0, 0xc102a2fa40})
	pkg/mod/github.com/ipfs/go-unixfs@v0.3.1/io/dagreader.go:314 +0xd9
github.com/ipfs/go-ipld-format.(*Walker).visitActiveNode(0xc044be0b40, 0x14d6f02)
	pkg/mod/github.com/ipfs/go-ipld-format@v0.2.0/walker.go:359 +0x4a
github.com/ipfs/go-ipld-format.(*Walker).down(0xc044be0b40, 0x0)
	pkg/mod/github.com/ipfs/go-ipld-format@v0.2.0/walker.go:306 +0x20f
github.com/ipfs/go-ipld-format.(*Walker).Iterate(0xc044be0b40, 0xc0657ddc00)
	pkg/mod/github.com/ipfs/go-ipld-format@v0.2.0/walker.go:195 +0x3f
github.com/ipfs/go-unixfs/io.(*dagReader).WriteTo(0x2a67fc0, {0x32dcc80, 0xc0657ddc00})
	pkg/mod/github.com/ipfs/go-unixfs@v0.3.1/io/dagreader.go:297 +0xf8
io.copyBuffer({0x32dcc80, 0xc0657ddc00}, {0x7f85b43f3d48, 0xc132b8f7c0}, {0x0, 0x0, 0x0})
	/usr/local/go/src/io/io.go:405 +0x16e
io.Copy(...)
	/usr/local/go/src/io/io.go:382
github.com/ipfs/go-ipfs-files.(*TarWriter).writeFile(0xc19bb81248, {0x7f85b43f3d10, 0xc132b8f7c0}, {0xc0cf864a06, 0x3b})
	pkg/mod/github.com/ipfs/go-ipfs-files@v0.0.9/tarwriter.go:46 +0xad
github.com/ipfs/go-ipfs-files.(*TarWriter).WriteFile(0xc19bb81248, {0x32f2bd8, 0xc132b8f7c0}, {0xc0cf864a06, 0x3b})
	pkg/mod/github.com/ipfs/go-ipfs-files@v0.0.9/tarwriter.go:59 +0x192
github.com/ipfs/go-ipfs/core/commands.fileArchive.func4()
	go-ipfs/core/commands/get.go:322 +0x45
created by github.com/ipfs/go-ipfs/core/commands.fileArchive
	go-ipfs/core/commands/get.go:320 +0x5da

Seems like if the original reader is gone, the writer can't proceed? Maybe a bug when handling a context timeout / writing error / pipe closing?

Interestingly, the stuck get cmds aren't visible in the ipfs diag cmds -v list, as if they would be partially cleaned-up already.

PS: our /get cmds are running with an explicit timeout= option configured by a Reverse Proxy making HTTP requests to the go-ipfs node.

The text was updated successfully, but these errors were encountered:

BigLep · 2022-05-10T16:50:16Z

@EnchanterIO : thanks for reporting this. What is the impact of this issue on your service?

Next steps that can be taken:

(Infura) explicitly add timeouts at the go-ipfs level to confirm the issue (rather than relying on nginx/lb timeouts)
(Either side) see if can reproduce this issue locally by making ipfs get requests for various content that does exist and doesn't.
(Maintainers) inspect the code to see if can reason about this from the stack trace

Maintainer @ajnavarro will attempt #2.

EnchanterIO · 2022-05-10T17:30:31Z

Hi @BigLep, the impact is a crashing node due to constantly growing heap allocation due to stuck go-routines and IO pipes.

Reagrding 1) we already have an explicit ?timeout= flag sent to go-ipfs nodes for every API call as well as Proxy timeout so the go-routine seems to be definitely stuck due to some resource mishandling.

Thanks for looking into it!

schomatis · 2022-05-10T19:18:11Z

(Maybe related:) We never plumb the command context to the fileArchive that started the stack.

ajnavarro · 2022-05-12T14:49:08Z

(Maybe related:) We never plumb the command context to the fileArchive that started the stack.

I think the Context is set to the file when calling api.Unixfs().Get. After that, the context is also set inside the dagReader and used by the dagWalker when calling WriteTo. dagWalker is using the context when calling NavigableIPLDNode.FetchChild and there, the context can be canceled correctly.

One thing that is strange for me is NavigableIPLDNode is trying to preload using an already canceled context:

switch err {
	case nil:
	case context.DeadlineExceeded, context.Canceled:
		if ctx.Err() != nil {
			return nil, ctx.Err()
		}

		// In this case, the context used to *preload* the node (in a previous
		// `FetchChild` call) has been canceled. We need to retry the load with
		// the current context and we might as well preload some extra nodes
		// while we're at it.
		nn.preload(ctx, childIndex)
		child, err = nn.getPromiseValue(ctx, childIndex)
		if err != nil {
			return nil, err
		}
	default:
		return nil, err
	}

I think that can be a problem.

BigLep · 2022-05-12T17:15:44Z

2022-05-12 conversation:
@ajnavarro hasn't been able to reproduce the error locally. He will post the approaches he took and what he found. Hopefully others will spot what we might be doing wrong.
Basically, we need to get a repro case so we can further debug.

schomatis · 2022-05-12T23:30:25Z

The context passed is used only during fetching in our API, but the pipe created is not closed on it and the write gets stuck.

ajnavarro · 2022-05-13T08:02:49Z

I tried these two approaches to reproduce the error:

Add a file, remove some random block from it, and try to fetch it. Steps:

ipfs daemon
ipfs add --pin=false Downloads/go1.17.9.linux-amd64.tar.gz # add go.17 file
ipfs ls QmUhYYjETKjwFCGVuU16e6TVftBJfry9WSMyXm5GHSGogC # find a block to delete
ipfs block rm QmVwPxYFpbu96gi4pqk3f4xsHnjag79MdFbjKP7RnPHHTT # remove one random block
ipfs repo gc
ipfs get QmUhYYjETKjwFCGVuU16e6TVftBJfry9WSMyXm5GHSGogC # execute and cancel command
ipfs diag profile

All get operations were canceled correctly.

Force a bunch of get operations and see if some of them get stuck:

ipfs daemon
ipfs add --pin=false Downloads/go1.17.9.linux-amd64.tar.gz # add go.17 file
ipfs ls QmUhYYjETKjwFCGVuU16e6TVftBJfry9WSMyXm5GHSGogC # find a block to delete
ipfs block rm QmVwPxYFpbu96gi4pqk3f4xsHnjag79MdFbjKP7RnPHHTT # remove one random block
ipfs repo gc
for i in $(seq 1 1000); do (ipfs get QmUhYYjETKjwFCGVuU16e6TVftBJfry9WSMyXm5GHSGogC --timeout=1m &); done
ipfs diag profile

I tried with CIDs pointing to files and folders too.

EnchanterIO · 2022-05-13T10:29:24Z

Would it make a difference if the get reqs would be over HTTP? And if the original client would terminate before a get operation under timeout finishes?

MichaelMure · 2022-05-13T10:48:43Z

When following the original stack trace, one spot I found quite suspicious is fileArchive(). In that stack trace, the flow goes through that function to setup some writers and a io.Pipe, go down the DAG fetching/handling, then come back in that io code where it get stuck regardless of the context status.

I wasn't able to really pinpoint how that could get stuck, but generally this code and how the tear down is executed on error is quite confusing. I wouldn't be surprised if something is incorrect in there. This could be combined with the following gotcha of io.Pipe:

// Reads and Writes on the pipe are matched one to one
// except when multiple Reads are needed to consume a single Write.
// That is, each Write to the PipeWriter --> blocks <-- until it has satisfied
// one or more Reads from the PipeReader that fully consume
// the written data.

So, if somehow:

a request get cancelled
... while data was already read and piped into that buffer
... but for some reason the write was big enough to require more than one read
... and the request get cancelled just between those reads
... killing the command output (that is, no more read), while data is still in the io.Pipe, including I suppose the final error

... then that can block the whole thing?

Fixes ipfs#8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <antnavper@gmail.com>

Fixes #8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <antnavper@gmail.com>

BigLep · 2022-05-13T18:26:40Z

Great job @Jorropo and @ajnavarro in finding/fixing the issue. A few things:

I see fix: hanging goroutine in get fileArchive handler #8973 doesn't have any test. What would it take to add one?
@EnchanterIO : can you confirm this addresses your issue?
Assuming we have confirmation this fixes the issue, I assume we'll include this in the 0.13 release (Release v0.13 #8640 ).

Jorropo · 2022-05-13T18:39:21Z

doesn't have any test. What would it take to add one?

@BigLep
No the test would be very VERY flaky:

It require the http API to be really slow.
(the cancel must happen on an http write which is unlikely to begin with)
It require scanning the goroutines stacks to not find a certain function (which has billions of way to fail (or not)).

You could add:

var b [1]byte
rand.Read(b[:])
if b[0] & 1 == 0 {
  panic("failed")
}

and it would likely be about as reliable.

schomatis · 2022-05-16T14:41:10Z

Some thoughts (nothing important, feel free to ignore):

Testing a leak is not that painful (brittle), we can try tools like https://github.com/fortytw2/leaktest or https://github.com/uber-go/goleak.
Triggering the leak is rather straightforward, it's not about the HTTP API; we just don't consume the returned reader.
The tricky part is that our commands library as it stands doesn't allow us to easily abstract and test a command on its own (that's why we rely so much on bash/sharness tests, not because we care that much about integration testing but because we can't easily test our commands in native Go).
On this particular issue testing costs do seem to outweigh its gains. Eventually we should have a good framework to test altogether any leaks in a running daemon after a battery of commands, but a regression on this particular issue seems unlikely (given the OP confirms the issue as resolved; we did close this one too early).

EnchanterIO · 2022-05-17T14:28:24Z

This looks promising. Thanks for the effort. I applied the patch to our go-ipfs nodes and rolled them our to production this morning.

Will post results by the end of the week if the memory problem and hanging go-routines are solved or not.

EnchanterIO · 2022-05-19T13:04:11Z

Thanks a ton for looking into this @BigLep, @schomatis , @ajnavarro , @Jorropo.

Over a longer period I will be more sure, but after a week it seems like we no longer have any stuck get cmds at any instance and the memory heap seems healthy again. 🥳

Fixes #8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <antnavper@gmail.com>

Fixes #8957 The context was only checked while reading data. Not while writing data to the http connection. So since the data flow through an io.Pipe the closing didn't flowed through and left the writer open hanging. Co-authored-by: Antonio Navarro Perez <antnavper@gmail.com> (cherry picked from commit 7892cc9)

EnchanterIO added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 9, 2022

BigLep assigned ajnavarro May 10, 2022

ipfs deleted a comment from welcome bot May 12, 2022

Jorropo mentioned this issue May 13, 2022

fix: hanging goroutine in get fileArchive handler #8973

Merged

Jorropo closed this as completed in #8973 May 13, 2022

BigLep mentioned this issue May 13, 2022

Release v0.13 #8640

Closed

65 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TarWriter.WriteFile stuck on select when writing to io.pipe #8957

TarWriter.WriteFile stuck on select when writing to io.pipe #8957

EnchanterIO commented May 9, 2022

BigLep commented May 10, 2022

EnchanterIO commented May 10, 2022 •

edited

Loading

schomatis commented May 10, 2022

ajnavarro commented May 12, 2022

BigLep commented May 12, 2022

schomatis commented May 12, 2022

ajnavarro commented May 13, 2022 •

edited

Loading

EnchanterIO commented May 13, 2022

MichaelMure commented May 13, 2022

BigLep commented May 13, 2022

Jorropo commented May 13, 2022 •

edited

Loading

schomatis commented May 16, 2022

EnchanterIO commented May 17, 2022

EnchanterIO commented May 19, 2022

TarWriter.WriteFile stuck on select when writing to io.pipe #8957

TarWriter.WriteFile stuck on select when writing to io.pipe #8957

Comments

EnchanterIO commented May 9, 2022

Checklist

Installation method

Version

Config

Description

BigLep commented May 10, 2022

EnchanterIO commented May 10, 2022 • edited Loading

schomatis commented May 10, 2022

ajnavarro commented May 12, 2022

BigLep commented May 12, 2022

schomatis commented May 12, 2022

ajnavarro commented May 13, 2022 • edited Loading

EnchanterIO commented May 13, 2022

MichaelMure commented May 13, 2022

BigLep commented May 13, 2022

Jorropo commented May 13, 2022 • edited Loading

schomatis commented May 16, 2022

EnchanterIO commented May 17, 2022

EnchanterIO commented May 19, 2022

EnchanterIO commented May 10, 2022 •

edited

Loading

ajnavarro commented May 13, 2022 •

edited

Loading

Jorropo commented May 13, 2022 •

edited

Loading