Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error on deployment with millions of files in PGDATA #738

Closed
MannerMan opened this issue Aug 24, 2020 · 42 comments
Closed

Out of memory error on deployment with millions of files in PGDATA #738

MannerMan opened this issue Aug 24, 2020 · 42 comments

Comments

@MannerMan
Copy link

WAL-G version: 0.2.15 release binary

PostgreSQL version: 9.5

Operating system/version: CentOS 7.8

Hi,

I'm evaluating WAL-G as a replacement for my companys current backup system WAL-E. We have a schema-level style of sharding our tenants, where each of our database servers hosts 4000 customer schemas, spread over 16 databases. This has worked great for scalability, but presents a challenge for many backup systems - since this layout results in a lot of
files in postgresql data directory.

Output of ls -1RL /var/lib/pgsql/9.5/data/ | wc -l;
6602805

So above 6 million files, total size 77gb. Around 5 years ago we deployed WAL-E since it was one of the few backup systems that could handle so many files without a problem. However, since WAL-E is no longer maintained we're looking for alternatives. When testing WAL-G, I'm running out of memory when performing a full-backup. It seems to be some kind of go internal memory error, since the memory of the server is not fully utilized. See graphs;

Screenshot from 2020-08-24 12-28-14

Server specs:

  • 12gb of ram
  • shared_buffers set to 3gb
  • 1800 hugepages (~ 3.6 gb to fit shared_buffers )
  • vm.overcommit = 2 (not allowed to overcommit)
  • 7.8 gb regular memory limit
  • CentOS 7
  • postgres 9.5
  • wal-g 0.2.15

Chunk of the error log;

INFO: 2020/08/24 11:49:19.354150 Calling pg_stop_backup()
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x193bfd8, 0x16)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/panic.go:774 +0x72
runtime.sysMap(0xc110000000, 0x10000000, 0x271a7f8)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mem_linux.go:169 +0xc5
runtime.(*mheap).sysAlloc(0x27017c0, 0xfbe0000, 0x150b986, 0x288007575f7)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/malloc.go:701 +0x1cd
runtime.(*mheap).grow(0x27017c0, 0x7df0, 0xffffffff)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1255 +0xa3
runtime.(*mheap).allocSpanLocked(0x27017c0, 0x7df0, 0x271a808, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1170 +0x266
runtime.(*mheap).alloc_m(0x27017c0, 0x7df0, 0x7f20f7020100, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1022 +0xc2
runtime.(*mheap).alloc.func1()
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1093 +0x4c
runtime.(*mheap).alloc(0x27017c0, 0x7df0, 0x7f2109010100, 0x7f20f7027ff8)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1092 +0x8a
runtime.largeAlloc(0xfbdfd80, 0x9e0001, 0x7f20f7027ff8)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/malloc.go:1138 +0x97
runtime.mallocgc.func1()
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/malloc.go:1033 +0x46
runtime.systemstack(0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/proc.go:1146

Attatched the full log output when performing the backup, including the error as well;
walg_log.txt

Target datastore is a local minio S3 instance. Tried WALG_UPLOAD_DISK_CONCURRENCY set to 1 and 4, same result. I see no kernel level OOM-killer logs in syslog, it appears to fail internally.

@x4m
Copy link
Collaborator

x4m commented Aug 24, 2020

Hi! Can you please rerun upload with WALG_LOG_LEVEL=DEBUG?
I think like we've tried to format JSON for the sentinel here

dtoBody, err := json.Marshal(sentinelDto)

probably there are better ways to format and send large JSON
But I'm not entirely sure that it's actual problem at hand. It would be perfect if you could recompile wal-g with log lines before and after this statement. If we would be entirely sure it's JSON serialization problem - we could find workaround.
Actually you even can upload empty sentinel - basebackup will work anyway.

@x4m
Copy link
Collaborator

x4m commented Aug 24, 2020

if this marshaling is the only thing that prevents you from backup - we should use stream Encoder https://golang.org/pkg/encoding/json/#Encoder
It must be as simple as few loc

@MannerMan
Copy link
Author

@x4m Sounds promising!

Would you like me to re-run with WALG_LOG_LEVEL=DEBUG on release 0.2.15, or compile your latest commit b02f986 and run that? Unfortunately I'm not to familiar with Go but I'll manage :)

@x4m
Copy link
Collaborator

x4m commented Aug 24, 2020

It would be best to run #740 if here will be green test results https://travis-ci.org/github/wal-g/wal-g/builds/720630361
Otherwise any codebase would do, this place did not change for years.

I'm going to the bar currently. Not sure if i'll take my laptop with me.

@x4m
Copy link
Collaborator

x4m commented Aug 24, 2020

#740 seems to be broken yet...

@MannerMan
Copy link
Author

@x4m I think I managed to build your branch

-bash-4.2$ wal-g --version
wal-g version   4ebb408 2020.08.25_06:16:15     PostgreSQL

Running backup now

@MannerMan
Copy link
Author

@x4m It failed once more unfortunately, actually it seemed to fail quicker and use more memory now;

Screenshot from 2020-08-25 09-28-56

Log attached. Re-running with DEVEL log-level now if that can give additional clues.
walg_fail1.txt

@x4m
Copy link
Collaborator

x4m commented Aug 25, 2020

Can you plz try backup-push with WALG_UPLOAD_CONCURRENCY set to 1?
I have no idea why memory curve is much more steep now, but we changes AWS SDK in recent commits.

@x4m
Copy link
Collaborator

x4m commented Aug 25, 2020

Also plz use WALG_COMPRESSION_METHOD=brotli
I see some lz4 stacks, maybe there's a leak

@x4m
Copy link
Collaborator

x4m commented Aug 25, 2020

I've checked logs, it seems that backup upload was 2x faster than in previous time.
That's why allocations grew our of control...

@MannerMan
Copy link
Author

@x4m Will do, running backup now with WALG_UPLOAD_CONCURRENCY set to 1, and WALG_COMPRESSION_METHOD set to brotli. Including the DEVEL log level stack trace here as well from the previous re-run if it helps.

walg_fail2.txt

@MannerMan
Copy link
Author

It got to the end this time (calling pg_stop_backup()) but still spiked in memory and failed

Memory graph;

Screenshot from 2020-08-25 12-31-54

Stack trace;
walg_fail3.txt

@x4m
Copy link
Collaborator

x4m commented Aug 25, 2020

Meh.... Encoder buffers all...
golang/go#7872
as far as I understand it will be fixed here https://go-review.googlesource.com/c/go/+/135595/

I'm looking for other workarounds...

@x4m
Copy link
Collaborator

x4m commented Aug 25, 2020

We could switch to https://github.com/francoispqt/gojay probably...

@x4m
Copy link
Collaborator

x4m commented Aug 25, 2020

plz test if green https://travis-ci.org/github/wal-g/wal-g/builds/720967236
otherwise ping me
thanks!

@MannerMan
Copy link
Author

@x4m Awesome! Thats some quick work - running backup now with compiled with your latest changes

-bash-4.2$ ./wal-g --version
wal-g version   b115e98 2020.08.25_12:02:59     PostgreSQL

@x4m
Copy link
Collaborator

x4m commented Aug 25, 2020

@MannerMan sorry, 740 is broken again
I'm a bit new to this json lib and made a mistake (and I do not know exactly what's wrong)
but we really need this lib to avoid formatting backup manifest in memory

@MannerMan
Copy link
Author

@x4m Alright, no problem - I stopped the running backup meanwhile

@x4m
Copy link
Collaborator

x4m commented Aug 25, 2020

so far I've considered easyjson and that strange library, all they are not enough robust to fit for sentinel of every supported in wal-g db....
thinking further

@x4m
Copy link
Collaborator

x4m commented Aug 26, 2020

@MannerMan can you plz try #741 ?

@MannerMan
Copy link
Author

@x4m Testing now, running backup

-bash-4.2$ ./wal-g --version
wal-g version   11bba03 2020.08.26_05:59:12     PostgreSQL

@MannerMan
Copy link
Author

@x4m Did not help unfortunately.

Screenshot from 2020-08-26 10-17-49

walg_fail4.txt

@x4m
Copy link
Collaborator

x4m commented Aug 27, 2020

I'm spamming json lib maintainers a bit. Actually, I've unsuccessfully tried all 3 libs, but maybe I'm missing something...

@x4m
Copy link
Collaborator

x4m commented Aug 27, 2020

Dunno maybe it would be useful to bog golang json maintainers too so they could merge finally changes list about streaming?..

@jonaz
Copy link

jonaz commented Aug 28, 2020

Can we manually loop over the data? I've written some data export tools that can sucessfully export at least 5gb data without using extra memory. It uses this approach:

		tempFile, err := createTempFile()
		if err != nil {
			logrus.Error(err)
			return
		}

		e := json.NewEncoder(tempFile)
		io.WriteString(tempFile, "[")
		i := 0
		for result := range c {
			result := result
			for _, v := range result.Data {
				dataToStream := v
				if i != 0 {
					io.WriteString(tempFile, ",")
				}
				e.Encode(dataToStream)
				i++
			}
		}
		io.WriteString(tempFile, "]")

		tempFile.Close()

@x4m
Copy link
Collaborator

x4m commented Aug 28, 2020

@jonaz we can but this routine is reused by many other DBs with different sentinel types.
We could just take reflective json encoder out of golang and add flush points where it writes data to network not buffering it...

@x4m
Copy link
Collaborator

x4m commented Aug 28, 2020

we will have to create repo wal-g/json :)
Anyone would volunteer for doing this? :)

We even can take Google's changelist, merge it and keep in our repo

@jonaz
Copy link

jonaz commented Aug 28, 2020

Does https://github.com/json-iterator/go work? Saw some encoder memory relates buffer issues in among their fixed issues.

@x4m
Copy link
Collaborator

x4m commented Aug 28, 2020

@jonaz wow, cool, thanks! Looks promising.

@MannerMan I've pushed new commit to #740 , plz test if CI will be green

@MannerMan
Copy link
Author

@x4m Great! Running a backup with a new binary compiled from your oom branch now. Fingers crossed

./wal-g --version
wal-g version   67df11c 2020.08.28_10:22:59     PostgreSQL

@MannerMan
Copy link
Author

Nope, still crashed with 'out of memory' 😕
walg_fail5.txt

@nh-nmurk
Copy link

Hi guys,
just to let you know, we are also experiencing the same problem/errors like was reported by @MannerMan

System specs:
Operating system/version: CentOS 7.8
RAM 62GB
WAL-G version: 0.2.15 release binary
PostgreSQL version: 11.7
20GB shared_buffers (also set accordingly hugepages )
vm.overcommit = 2

WALG_FILE_PREFIX is set to local disk (not SSD)

We do not have so many files in PGDATA folder, in our case is cca 82 000 files, 52 databases, instance size is cca 330GB.
The difference is also that we get this error after archive_command is executed, the system is working normally for some time,
but then out of the sudden

fatal error: runtime: out of memory
runtime stack:
runtime.throw(0x193bfd8, 0x16)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/panic.go:774 +0x72 fp=0x7ffc567c1578 sp=0x7ffc567c1548 pc=0x9bca02
runtime.sysMap(0xc000000000, 0x4000000, 0x271a7f8)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mem_linux.go:169 +0xc5 fp=0x7ffc567c15b8 sp=0x7ffc567c1578 pc=0x9a6c65
runtime.(*mheap).sysAlloc(0x27017c0, 0x2000, 0x0, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/malloc.go:701 +0x1cd fp=0x7ffc567c1660 sp=0x7ffc567c15b8 pc=0x999d5d
runtime.(*mheap).grow(0x27017c0, 0x1, 0xffffffff)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1255 +0xa3 fp=0x7ffc567c16b0 sp=0x7ffc567c1660 pc=0x9b4d13
runtime.(*mheap).allocSpanLocked(0x27017c0, 0x1, 0x271a808, 0x1)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1170 +0x266 fp=0x7ffc567c1730 sp=0x7ffc567c16b0 pc=0x9b4ba6
runtime.(*mheap).alloc_m(0x27017c0, 0x1, 0x9e002a, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1022 +0xc2 fp=0x7ffc567c1780 sp=0x7ffc567c1730 pc=0x9b4442
runtime.(*mheap).alloc.func1()
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1093 +0x4c fp=0x7ffc567c17b8 sp=0x7ffc567c1780 pc=0x9e5a8c
runtime.(*mheap).alloc(0x27017c0, 0x1, 0x1002a, 0x40000)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1092 +0x8a fp=0x7ffc567c1808 sp=0x7ffc567c17b8 pc=0x9b472a
runtime.(*mcentral).grow(0x27025f8, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mcentral.go:255 +0x7b fp=0x7ffc567c1848 sp=0x7ffc567c1808 pc=0x9a668b
runtime.(*mcentral).cacheSpan(0x27025f8, 0x7f6a6b447000)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mcentral.go:106 +0x2fe fp=0x7ffc567c18a8 sp=0x7ffc567c1848 pc=0x9a61ae
runtime.(*mcache).refill(0x7f6a6b447008, 0x2a)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mcache.go:138 +0x85 fp=0x7ffc567c18c8 sp=0x7ffc567c18a8 pc=0x9a5c55
runtime.(*mcache).nextFree(0x7f6a6b447008, 0x271a82a, 0x7f6a6b447008, 0x8, 0xfffffffffffffff8)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/malloc.go:854 +0x87 fp=0x7ffc567c1900 sp=0x7ffc567c18c8 pc=0x99a587
runtime.mallocgc(0x180, 0x191cfa0, 0x1, 0x271a860)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/malloc.go:1022 +0x793 fp=0x7ffc567c19a0 sp=0x7ffc567c1900 pc=0x99aec3
runtime.newobject(0x191cfa0, 0x4000)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/malloc.go:1151 +0x38 fp=0x7ffc567c19d0 sp=0x7ffc567c19a0 pc=0x99b2b8
runtime.malg(0x1516a00008000, 0x2703e68)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/proc.go:3232 +0x31 fp=0x7ffc567c1a10 sp=0x7ffc567c19d0 pc=0x9c5da1
runtime.mpreinit(...)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/os_linux.go:324
runtime.mcommoninit(0x26fc280)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/proc.go:626 +0xc2 fp=0x7ffc567c1a48 sp=0x7ffc567c1a10 pc=0x9bf7c2
runtime.schedinit()
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/proc.go:543 +0x72 fp=0x7ffc567c1aa0 sp=0x7ffc567c1a48 pc=0x9bf422
runtime.rt0_go(0x7ffc567c1ba8, 0x3, 0x7ffc567c1ba8, 0x0, 0x7f6a6a7a3555, 0x2000000000, 0x7ffc567c1ba8, 0x300000000, 0x9e7d20, 0x0, ...)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/asm_amd64.s:214 +0x125 fp=0x7ffc567c1aa8 sp=0x7ffc567c1aa0 pc=0x9e7e55
        2020-08-30 01:37:52.313 UTC [2204] LOG:  archive command failed with exit code 2
2020-08-30 01:37:52.313 UTC [2204] DETAIL:  The failed archive command was: /usr/pgsql-11/bin/archive_command.sh pg_wal/0000000100010EFC000000A7
2020-08-30 01:37:52.480 UTC [2195] LOG:  could not fork autovacuum worker process: Cannot allocate memory
2020-08-30 01:37:52.480 UTC [2195] LOG:  could not fork autovacuum worker process: Cannot allocate memory

Also, there are no OOM killer logs in syslog.

Br,
Nikola

@MannerMan
Copy link
Author

@x4m I noticed your activity in json-iterator/go#488 - is there a branch with the patched library applied that I can compile and test with?

@x4m
Copy link
Collaborator

x4m commented Aug 31, 2020

@nh-nmurk I'm not sure your problem originates from WAL-G... do you have atop logs or something similar to identify process consuming a lot of memory? The stacktrace that you provided do not show JSON problems which @MannerMan have.

@MannerMan to test proposed there patch you need to compile patched WAL-G from #740 with jsoniter proposed in that discussion. Meanwhile author of that patch is not maintainer of jsoniter...

@x4m
Copy link
Collaborator

x4m commented Aug 31, 2020

I have no idea why this patch is not even submitted as pull request. Patch itself looks OK.

@nh-nmurk
Copy link

@x4m we cann't find any additional information in PostgreSQL nor syslog which could indicate that system is running out of memory. It can be seen a list of archived WAL logs and then ...out of memory.

Since we had switched back to use test and cp in archive_command, we do not have any problems with memory.

On other production system, we also use wal-g and do not have such problems.

More from Postgresql logs...

INFO: 2020/08/30 01:35:28.070103 FILE PATH: 0000000100010EFC000000A0.lz4
INFO: 2020/08/30 01:35:56.702466 FILE PATH: 0000000100010EFC000000A1.lz4
2020-08-30 01:36:08.092 UTC [10833] LOG:  incomplete startup packet
INFO: 2020/08/30 01:36:27.667880 FILE PATH: 0000000100010EFC000000A2.lz4
2020-08-30 01:36:56.487 UTC [11157] LOG:  incomplete startup packet
INFO: 2020/08/30 01:36:57.670646 FILE PATH: 0000000100010EFC000000A3.lz4
INFO: 2020/08/30 01:37:25.479836 FILE PATH: 0000000100010EFC000000A4.lz4
INFO: 2020/08/30 01:37:35.502408 FILE PATH: 0000000100010EFC000000A5.lz4
INFO: 2020/08/30 01:37:40.082181 FILE PATH: 0000000100010EFC000000A6.lz4
2020-08-30 01:37:44.900 UTC [11568] LOG:  incomplete startup packet
fatal error: runtime: out of memory
runtime stack:
runtime.throw(0x193bfd8, 0x16)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/panic.go:774 +0x72 fp=0x7ffc567c1578 sp=0x7ffc567c1548 pc=0x9bca02
runtime.sysMap(0xc000000000, 0x4000000, 0x271a7f8)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mem_linux.go:169 +0xc5 fp=0x7ffc567c15b8 sp=0x7ffc567c1578 pc=0x9a6c65
runtime.(*mheap).sysAlloc(0x27017c0, 0x2000, 0x0, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/malloc.go:701 +0x1cd fp=0x7ffc567c1660 sp=0x7ffc567c15b8 pc=0x999d5d
runtime.(*mheap).grow(0x27017c0, 0x1, 0xffffffff)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1255 +0xa3 fp=0x7ffc567c16b0 sp=0x7ffc567c1660 pc=0x9b4d13
runtime.(*mheap).allocSpanLocked(0x27017c0, 0x1, 0x271a808, 0x1)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/mheap.go:1170 +0x266 fp=0x7ffc567c1730 sp=0x7ffc567c16b0 pc=0x9b4ba6
runtime.(*mheap).alloc_m(0x27017c0, 0x1, 0x9e002a, 0x0)
...

or

2020-08-30 17:04:29.471 UTC [12967] LOG:  incomplete startup packet
INFO: 2020/08/30 17:04:29.604918 FILE PATH: 0000000100010F02000000F3.lz4
INFO: 2020/08/30 17:05:04.284755 FILE PATH: 0000000100010F02000000F4.lz4
2020-08-30 17:05:17.922 UTC [13486] LOG:  incomplete startup packet
INFO: 2020/08/30 17:05:27.314168 FILE PATH: 0000000100010F02000000F5.lz4
INFO: 2020/08/30 17:05:59.111989 FILE PATH: 0000000100010F02000000F6.lz4
2020-08-30 17:06:08.360 UTC [13909] LOG:  incomplete startup packet
2020-08-30 17:06:31.434 UTC [27498] LOG:  could not fork autovacuum worker process: Cannot allocate memory
TopMemoryContext: 346896 total in 11 blocks; 39728 free (15 chunks); 307168 used
TopMemoryContext: 448648 total in 15 blocks; 105944 free (330 chunks); 342704 used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 free (0 chunks); 6736 used
  TopTransactionContext: 8192 total in 1 blocks; 7744 free (0 chunks); 448 used
    Prepared Queries: 16384 total in 2 blocks; 6664 free (3 chunks); 9720 used
  TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used
TopTransactionContext: 8192 total in 1 blocks; 7744 free (1 chunks); 448 used
  Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used
    TopMemoryContext: 615936 total in 19 blocks; 236256 free (663 chunks); 379680 usedOperator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 usedpgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456 free (0 chunks); 6736 used

this part of log is, presumably, just consequence of out of memory, not the cause...

goroutine 1 [syscall]:
github.com/spf13/cobra.(*Command).Execute(...)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:864
github.com/wal-g/wal-g/cmd/pg.Execute()
        /home/travis/gopath/src/github.com/wal-g/wal-g/cmd/pg/pg.go:29 +0x31
main.main()
        /home/travis/gopath/src/github.com/wal-g/wal-g/main/pg/main.go:8 +0x20

goroutine 19 [syscall]:
os/signal.signal_recv(0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()os.statNolog(0xc0003eed60, 0x18, 0x18, 0x18, 0x18, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/stat_unix.go:30 +0x31
os.Stat(0xc0003eed60, 0x18, 0x0, 0x0, 0x18, 0xc0003eed60)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/stat.go:13 +0x4d
github.com/spf13/afero.OsFs.Stat(...)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/afero/os.go:87
github.com/spf13/viper.exists(0x1c10b00, 0x2718200, 0xc0003eed60, 0x18, 0x18, 0x1926c8a, 0x3)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/util.go:119 +0x49
github.com/spf13/viper.(*Viper).searchInPath(0xc0000d7b00, 0xc0002cdb50, 0xe, 0x2, 0x3b)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1973 +0x344
github.com/spf13/viper.(*Viper).findConfigFile(0xc0000d7b00, 0xc00012ea50, 0x22, 0x0, 0x0)
        /home/travis/
goroutine 1 [syscall]:
os.statNolog(0xc0003eed60, 0x18, 0x18, 0x18, 0x18, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/stat_unix.go:30 +0x31
os.Stat(0xc0003eed60, 0x18, 0x0, 0x0, 0x18, 0xc0003eed60)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/stat.go:13 +0x4d
github.com/spf13/afero.OsFs.Stat(...)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/afero/os.go:87
github.com/spf13/viper.exists(0x1c10b00, 0x2718200, 0xc0003eed60, 0x18, 0x18, 0x1926c8a, 0x3)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/util.go:119 +0x49
github.com/spf13/viper.(*Viper).searchInPath(0xc0000d7b00, 0xc0002cdb50, 0xe, 0x2, 0x3b)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1973 +0x344
github.com/spf13/viper.(*Viper).findConfigFile(0xc0000d7b00, 0xc00012ea50, 0x22, 0x0, 0x0)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1992 +0x128
github.com/spf13/viper.(*Viper).getConfigFile(0xc0000d7b00, 0xc000477b48, 0x1, 0x1, 0xe9ea0f)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1960 +0x5a
github.com/spf13/viper.(*Viper).ReadInConfig(0xc0000d7b00, 0x194af92, 0x21)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1307 +0x8f
github.com/spf13/viper.ReadInConfig(...)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1304
github.com/wal-g/wal-g/internal.InitConfig()
        /home/travis/gopath/src/github.com/wal-g/wal-g/internal/config.go:249 +0x163
github.com/spf13/cobra.(*Command).preRun(0x267df40)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:856 +0x49
github.com/spf13/cobra.(*Command).execute(0x267df40, 0xc0004d6100, 0x1, 0x1, 0x267df40, 0xc0004d6100)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:792 +0x145
github.com/spf13/cobra.(*Command).ExecuteC(0x267d7c0, 0x9ca6fa, 0x2691b40, 0xc000000180)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:792 +0x145
github.com/spf13/cobra.(*Command).ExecuteC(0x267d7c0, 0x9ca6fa, 0x2691b40, 0xc000000180)
goroutine 1 [syscall]:
os.statNolog(0xc0003eed60, 0x18, 0x18, 0x18, 0x18, 0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/stat_unix.go:30 +0x31
os.Stat(0xc0003eed60, 0x18, 0x0, 0x0, 0x18, 0xc0003eed60)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/stat.go:13 +0x4d
github.com/spf13/afero.OsFs.Stat(...)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/afero/os.go:87
github.com/spf13/viper.exists(0x1c10b00, 0x2718200, 0xc0003eed60, 0x18, 0x18, 0x1926c8a, 0x3)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/util.go:119 +0x49
github.com/spf13/viper.(*Viper).searchInPath(0xc0000d7b00, 0xc0002cdb50, 0xe, 0x2, 0x3b)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1973 +0x344
github.com/spf13/viper.(*Viper).findConfigFile(0xc0000d7b00, 0xc00012ea50, 0x22, 0x0, 0x0)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1992 +0x128
github.com/spf13/viper.(*Viper).getConfigFile(0xc0000d7b00, 0xc000477b48, 0x1, 0x1, 0xe9ea0f)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1960 +0x5a
github.com/spf13/viper.(*Viper).ReadInConfig(0xc0000d7b00, 0x194af92, 0x21)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1307 +0x8f
github.com/spf13/viper.ReadInConfig(...)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/viper/viper.go:1304
github.com/wal-g/wal-g/internal.InitConfig()
        /home/travis/gopath/src/github.com/wal-g/wal-g/internal/config.go:249 +0x163
github.com/spf13/cobra.(*Command).preRun(0x267df40)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:856 +0x49
github.com/spf13/cobra.(*Command).execute(0x267df40, 0xc0004d6100, 0x1, 0x1, 0x267df40, 0xc0004d6100)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:792 +0x145
github.com/spf13/cobra.(*Command).ExecuteC(0x267d7c0, 0x9ca6fa, 0x2691b40, 0xc000000180)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/github.com/spf13/cobra/command.go:864
github.com/wal-g/wal-g/cmd/pg.Execute()
        /home/travis/gopath/src/github.com/wal-g/wal-g/cmd/pg/pg.go:29 +0x31
main.main()
        /home/travis/gopath/src/github.com/wal-g/wal-g/main/pg/main.go:8 +0x20

goroutine 19 [syscall]:
os/signal.signal_recv(0x0)
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
        /home/travis/.gimme/versions/go1.13.9.linux.amd64/src/os/signal/signal_unix.go:29 +0x41

goroutine 21 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0003b2690)
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/go.opencensus.io/stats/view/worker.go:154 +0x100
created by go.opencensus.io/stats/view.init.0
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/go.opencensus.io/stats/view/worker.go:154 +0x100
created by go.opencensus.io/stats/view.init.0
        /home/travis/gopath/src/github.com/wal-g/wal-g/vendor/go.opencensus.io/stats/view/worker.go:32 +0x57

rax    0x0
rbx    0x7f3c00959868
rcx    0xffffffffffffffff
rdx    0x6
rdi    0x37cb
rsi    0x37d1
rbp    0x1c3ef87
rsp    0x7f3bf77fd978
r8     0xa
r9     0x7f3bf77fe700
r10    0x8
r11    0x202
r12    0x7f3be80008c0
r13    0x0
r14    0x1ba3c7c
r15    0x0
rip    0x7f3c005c7387
rflags 0x202
cs     0x33
fs     0x0
gs     0x0
2020-08-30 17:07:04.105 UTC [27526] LOG:  archive command failed with exit code 2
2020-08-30 17:07:04.105 UTC [27526] DETAIL:  The failed archive command was: /usr/pgsql-11/bin/archive_command.sh pg_wal/0000000100010F02000000F8
2020-08-30 17:07:05.216 UTC [27498] LOG:  could not fork autovacuum worker process: Cannot allocate memory
runtime/cgo: runtime/cgo: pthread_create failed: Resource temporarily unavailablepthread_create failed: Resource temporarily unavailable

SIGABRT: abort

Thank you for your time
Nikola

@MannerMan
Copy link
Author

@x4m I think I was able to apply the json-iterator patch to your oom branch and compile. The patch failed to apply to stream_str.go file, but since it was a one-liner I applied it manually. Running backup with new build now.

@MannerMan
Copy link
Author

MannerMan commented Sep 8, 2020

@x4m Hm, not sure whats going on but the backup still failed. Different error this time though;

INFO: 2020/09/07 20:53:03.218762 Calling pg_stop_backup()
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x1cab9cd, 0x16)
        /usr/local/go/src/runtime/panic.go:774 +0x72
runtime.sysMap(0xc0e0000000, 0x20000000, 0x2c89278)
        /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
runtime.(*mheap).sysAlloc(0x2c6ff00, 0x1ed00000, 0x182cf01, 0xffffffff008d6431)
        /usr/local/go/src/runtime/malloc.go:701 +0x1cd
runtime.(*mheap).grow(0x2c6ff00, 0xf680, 0xffffffff)
        /usr/local/go/src/runtime/mheap.go:1255 +0xa3
runtime.(*mheap).allocSpanLocked(0x2c6ff00, 0xf680, 0x2c89288, 0x0)
        /usr/local/go/src/runtime/mheap.go:1170 +0x266
runtime.(*mheap).alloc_m(0x2c6ff00, 0xf680, 0x7f7b10640100, 0x7f7b0fceb340)
        /usr/local/go/src/runtime/mheap.go:1022 +0xc2
runtime.(*mheap).alloc.func1()
        /usr/local/go/src/runtime/mheap.go:1093 +0x4c
runtime.(*mheap).alloc(0x2c6ff00, 0xf680, 0x7f7b3d010100, 0x7f7b10648f88)
        /usr/local/go/src/runtime/mheap.go:1092 +0x8a
runtime.largeAlloc(0x1ed00000, 0xaf0001, 0x7f7b10648f88)
        /usr/local/go/src/runtime/malloc.go:1138 +0x97
runtime.mallocgc.func1()
        /usr/local/go/src/runtime/malloc.go:1033 +0x46
runtime.systemstack(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1146

goroutine 1 [running]:
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc00383b248 sp=0xc00383b240 pc=0xafcb30
runtime.mallocgc(0x1ed00000, 0x1b83940, 0xc0002b6e01, 0x7f7b6664db28)
        /usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc00383b2e8 sp=0xc00383b248 pc=0xaac2f5
runtime.newarray(0x1b83940, 0x110000, 0xc0d540fa90)
        /usr/local/go/src/runtime/malloc.go:1173 +0x63 fp=0xc00383b318 sp=0xc00383b2e8 pc=0xaac733
runtime.makeBucketArray(0x1ac6860, 0xc06182da14, 0x0, 0xc00383b390, 0xab3444)
        /usr/local/go/src/runtime/map.go:362 +0x183 fp=0xc00383b350 sp=0xc00383b318 pc=0xaad623
runtime.hashGrow(0x1ac6860, 0xc09967f5f0)
        /usr/local/go/src/runtime/map.go:1033 +0x89 fp=0xc00383b3a0 sp=0xc00383b350 pc=0xaaf299
runtime.mapassign_faststr(0x1ac6860, 0xc09967f5f0, 0xc065342ac0, 0x13, 0xc0d879e410)
        /usr/local/go/src/runtime/map_faststr.go:272 +0x122 fp=0xc00383b408 sp=0xc00383b3a0 pc=0xab2cf2
github.com/wal-g/wal-g/internal.(*BackupSentinelDto).setFiles.func1(0x1a2a480, 0xc0653aacc0, 0x1ba6440, 0xc0675d2090, 0x1)
        /wal-g/internal/backup_sentinel_dto.go:54 +0xdd fp=0xc00383b490 sp=0xc00383b408 pc=0x182007d
sync.(*Map).Range(0xc00040a660, 0xc00383b578)
        /usr/local/go/src/sync/map.go:333 +0x1c3 fp=0xc00383b568 sp=0xc00383b490 pc=0xb1a9d3
github.com/wal-g/wal-g/internal.(*BackupSentinelDto).setFiles(0xc00030e780, 0xc00040a660)
        /wal-g/internal/backup_sentinel_dto.go:51 +0x6b fp=0xc00383b598 sp=0xc00383b568 pc=0x17ec87b
github.com/wal-g/wal-g/internal.createAndPushBackup(0xc0002efc50, 0xc0001902e0, 0x17, 0x1ca3296, 0x10, 0x0, 0x0, 0x0, 0x0, 
0x0, ...)
        /wal-g/internal/backup_push_handler.go:175 +0xcfd fp=0xc00383ba38 sp=0xc00383b598 pc=0x17e9f4d
github.com/wal-g/wal-g/internal.HandleBackupPush(0xc0002efc50, 0x7ffc9b4bb124, 0x17, 0x0)
        /wal-g/internal/backup_push_handler.go:257 +0x274 fp=0xc00383bcd0 sp=0xc00383ba38 pc=0x17eb784
github.com/wal-g/wal-g/cmd/pg.glob..func4(0x2be78c0, 0xc000181d50, 0x1, 0x1)
        /wal-g/cmd/pg/backup_push.go:34 +0xd1 fp=0xc00383bd40 sp=0xc00383bcd0 pc=0x182db11
github.com/spf13/cobra.(*Command).execute(0x2be78c0, 0xc000181d20, 0x1, 0x1, 0x2be78c0, 0xc000181d20)
        /wal-g/vendor/github.com/spf13/cobra/command.go:830 +0x2aa fp=0xc00383be18 sp=0xc00383bd40 pc=0xc8350a
github.com/spf13/cobra.(*Command).ExecuteC(0x2be6740, 0xadcdba, 0x2bfb8e0, 0xc000000180)
        /wal-g/vendor/github.com/spf13/cobra/command.go:914 +0x2fb fp=0xc00383bef0 sp=0xc00383be18 pc=0xc8410b
github.com/spf13/cobra.(*Command).Execute(...)
        /wal-g/vendor/github.com/spf13/cobra/command.go:864
github.com/wal-g/wal-g/cmd/pg.Execute()
        /wal-g/cmd/pg/pg.go:36 +0x31 fp=0xc00383bf50 sp=0xc00383bef0 pc=0x182cf11
main.main()
        /wal-g/main/pg/main.go:8 +0x20 fp=0xc00383bf60 sp=0xc00383bf50 pc=0x182efc0
runtime.main()
        /usr/local/go/src/runtime/proc.go:203 +0x206 fp=0xc00383bfe0 sp=0xc00383bf60 pc=0xad0836
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc00383bfe8 sp=0xc00383bfe0 pc=0xafec01

goroutine 6 [syscall, 73 minutes]:
os/signal.signal_recv(0x0)
        /usr/local/go/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
        /usr/local/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
        /usr/local/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 10 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000e4be0)
        /wal-g/vendor/go.opencensus.io/stats/view/worker.go:154 +0x100
created by go.opencensus.io/stats/view.init.0
        /wal-g/vendor/go.opencensus.io/stats/view/worker.go:32 +0x57

goroutine 12779990 [IO wait]:
internal/poll.runtime_pollWait(0x7f7b665b0ac8, 0x72, 0xffffffffffffffff)
        /usr/local/go/src/runtime/netpoll.go:184 +0x55
internal/poll.(*pollDesc).wait(0xc08fe4e398, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc08fe4e380, 0xc087e8b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:169 +0x1cf
net.(*netFD).Read(0xc08fe4e380, 0xc087e8b000, 0x1000, 0x1000, 0xc0124c6420, 0xc0b7056dd0, 0xc0b7056be8)
        /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc078d46e50, 0xc087e8b000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:184 +0x68
net/http.(*persistConn).Read(0xc000316a20, 0xc087e8b000, 0x1000, 0x1000, 0xc0b7056ed8, 0xafc030, 0xc0b7056ed8)
        /usr/local/go/src/net/http/transport.go:1758 +0x75
bufio.(*Reader).fill(0xc08fa1e720)
        /usr/local/go/src/bufio/bufio.go:100 +0x103
bufio.(*Reader).Peek(0xc08fa1e720, 0x1, 0x2, 0x0, 0x0, 0xc087e88f00, 0x0)
        /usr/local/go/src/bufio/bufio.go:138 +0x4f
net/http.(*persistConn).readLoop(0xc000316a20)
        /usr/local/go/src/net/http/transport.go:1911 +0x1d6
created by net/http.(*Transport).dialConn
        /usr/local/go/src/net/http/transport.go:1580 +0xb0d

goroutine 12779991 [select]:
net/http.(*persistConn).writeLoop(0xc000316a20)
        /usr/local/go/src/net/http/transport.go:2210 +0x123
created by net/http.(*Transport).dialConn
        /usr/local/go/src/net/http/transport.go:1581 +0xb32

goroutine 13206650 [select]:
net/http.(*persistConn).writeLoop(0xc0002c8a20)
        /usr/local/go/src/net/http/transport.go:2210 +0x123
created by net/http.(*Transport).dialConn
        /usr/local/go/src/net/http/transport.go:1581 +0xb32

goroutine 13206649 [IO wait]:
internal/poll.runtime_pollWait(0x7f7b665b0928, 0x72, 0xffffffffffffffff)
        /usr/local/go/src/runtime/netpoll.go:184 +0x55
internal/poll.(*pollDesc).wait(0xc08fe4e418, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc08fe4e400, 0xc057c2e000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:169 +0x1cf
net.(*netFD).Read(0xc08fe4e400, 0xc057c2e000, 0x1000, 0x1000, 0xc0002e9380, 0xc00007fdd0, 0xc00007fbe8)
        /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc0980c06e8, 0xc057c2e000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:184 +0x68
net/http.(*persistConn).Read(0xc0002c8a20, 0xc057c2e000, 0x1000, 0x1000, 0xc00007fed8, 0xafc030, 0xc00007fed8)
        /usr/local/go/src/net/http/transport.go:1758 +0x75
bufio.(*Reader).fill(0xc03fff3980)
        /usr/local/go/src/bufio/bufio.go:100 +0x103
bufio.(*Reader).Peek(0xc03fff3980, 0x1, 0x2, 0x0, 0x0, 0xc0000b2600, 0x0)
        /usr/local/go/src/bufio/bufio.go:138 +0x4f
net/http.(*persistConn).readLoop(0xc0002c8a20)
        /usr/local/go/src/net/http/transport.go:1911 +0x1d6
created by net/http.(*Transport).dialConn
        /usr/local/go/src/net/http/transport.go:1580 +0xb0d

I have run the backup twice and got this error both times. Not sure if it actually makes it as far as previously, or if it now fails at an earlier stage.

@jonaz
Copy link

jonaz commented Sep 8, 2020

@MannerMan can you paste whole output? The error seems to be missing only part of the stack trace?

@MannerMan
Copy link
Author

@jonaz Indeed, I was running a screen-within-a-screen which gobbled the output a bit. Updated the comment above with the full stack trace. Still an out of memory error as it turns out.

@x4m
Copy link
Collaborator

x4m commented Sep 28, 2021

JFYI we are working on this problem at #1103 too

@usernamedt
Copy link
Member

Fixed in #1101. WALG_WITHOUT_FILES_METADATA setting allows to disable the files metadata tracking and create backups with millions of files without the OOM errors. The only limitation is that it does not support delta backups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants