repository/index: Avoid creating an error when checking if an id is in the index. #1538

MJDSys · 2018-01-08T18:48:58Z

What is the purpose of this change? What does it change?

When backing up several million files (>14M tested here) with few changes,
a large amount of time is spent failing to find an id in an index and creating
an error to signify this. Since this is checked using the Has method,
which doesn't use this error, this time creating the error is wasted.

Instead, create a new function lookupInternal that doesn't create this error.
Change Has to use this function and change Lookup to use lookupInternal and
create the error if it fails to find the id.

This results in saving about one hour of wall time (about half the backup
time) and about six hours of cpu time on my system.

I'm not sure if you want Lookup just changed to work with my lookupInternal, and then fixup all the reverse dependencies. If this is wanted, I'll update my PR appropriately. I also wasn't sure if you wanted a changelog entry, but I'm happy to add that too if wanted.

Was the change discussed in an issue or in the forum before?

No.

Checklist

I have read the Contribution Guidelines
I have added tests for all changes in this PR
I have added documentation for the changes (in the manual)
There's a new file in a subdir of changelog/x.y.z that describe the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

fd0

Hey, thanks for proposing the enhancemente! What I'd love to see is a benchmark showing the improvement... I can already imagine that creating the error takes CPU time (and memory), since we're using the github.com/pkg/errors library, which capture stack traces.

I'd like to propose a slightly different approach, which will use even less CPU time and memory: Undo your changes, then modify the Has() method to try the lookup in idx.pack directly (after locking of course). That will be even more efficient because bundling the locations where blobs are stored isn't needed.

What are your thoughts?

codecov-io · 2018-01-08T19:34:02Z

Codecov Report

Merging #1538 into master will decrease coverage by 5.24%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #1538      +/-   ##
=========================================
- Coverage   51.55%   46.3%   -5.25%     
=========================================
  Files         138     138              
  Lines       11080   11081       +1     
=========================================
- Hits         5712    5131     -581     
- Misses       4503    5139     +636     
+ Partials      865     811      -54

Impacted Files	Coverage Δ
internal/repository/index.go	`67.82% <100%> (+0.11%)`	⬆️
internal/backend/b2/b2.go	`0% <0%> (-78.84%)`	⬇️
internal/backend/azure/azure.go	`0% <0%> (-77.85%)`	⬇️
internal/backend/swift/swift.go	`0% <0%> (-76.03%)`	⬇️
internal/backend/gs/gs.go	`0% <0%> (-72%)`	⬇️
internal/backend/swift/config.go	`36.95% <0%> (-54.35%)`	⬇️
internal/backend/s3/s3.go	`65.76% <0%> (+1.15%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99f0fce...539599d. Read the comment docs.

MJDSys · 2018-01-08T19:40:38Z

Regarding the benchmark, I posted a note about the speed up I get running a backup of my system (saving about an hour of time in a 2-3 hour backup). Is there a more established way to do the benchmark? If it helps, I found this by doing some CPU profiling of my backup.

Regarding your suggestion, that does sound great. If I understand correctly, I just have the one outer if statement? That sounds like it could help some more (lookupInternal still dominates the cpu profile). I'll give it a test here and see how it goes.

fd0 · 2018-01-08T20:16:59Z

Hey, for Go software it's very easy to write a micro benchmark of just the Has() function for the index:

const maxPackSize = 16 * 1024 * 1024

func createRandomIndex() (idx *repository.Index, lookupID restic.ID) {
	idx = repository.NewIndex()

	// create index with 200k pack files
	for i := 0; i < 200000; i++ {
		packID := restic.NewRandomID()
		offset := 0
		for offset < maxPackSize {
			size := 2000 + rand.Intn(4*1024*1024)
			id := restic.NewRandomID()
			idx.Store(restic.PackedBlob{
				PackID: packID,
				Blob: restic.Blob{
					Type:   restic.DataBlob,
					ID:     id,
					Length: uint(size),
					Offset: uint(offset),
				},
			})

			offset += size

			if rand.Float32() < 0.001 && lookupID.IsNull() {
				lookupID = id
			}
		}
	}

	return idx, lookupID
}

func BenchmarkIndexHasUnknown(b *testing.B) {
	idx, _ := createRandomIndex()
	lookupID := restic.NewRandomID()

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		idx.Has(lookupID, restic.DataBlob)
	}
}

func BenchmarkIndexHasKnown(b *testing.B) {
	idx, lookupID := createRandomIndex()

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		idx.Has(lookupID, restic.DataBlob)
	}
}

You can then run just this benchmark and look at the memory allocation:

$ go test -v -run xxxxx -bench IndexHas -benchmem | tee before
goos: linux
goarch: amd64
pkg: github.com/restic/restic/internal/repository
BenchmarkIndexHasUnknown-4   	 1000000	      1464 ns/op	     608 B/op	      10 allocs/op
BenchmarkIndexHasKnown-4     	 5000000	       280 ns/op	     128 B/op	       5 allocs/op
PASS
ok  	github.com/restic/restic/internal/repository	20.428s

So, we can see that each call of Has() takes about 300ns and allocates 128 Byte for known blobs, and 1500ns and 600 byte for unknown blobs.

With your change cherry picked, this looks already much better:

$ go test -v -run xxxxx -bench IndexHas -benchmem | tee after
goos: linux
goarch: amd64
pkg: github.com/restic/restic/internal/repository
BenchmarkIndexHasUnknown-4   	10000000	       143 ns/op	      16 B/op	       2 allocs/op
BenchmarkIndexHasKnown-4     	 5000000	       276 ns/op	     128 B/op	       5 allocs/op
PASS
ok  	github.com/restic/restic/internal/repository	22.466s

We can use benchcmp to compare it:

$ benchcmp before after
benchmark                      old ns/op     new ns/op     delta
BenchmarkIndexHasUnknown-4     1464          143           -90.23%
BenchmarkIndexHasKnown-4       280           276           -1.43%

benchmark                      old allocs     new allocs     delta
BenchmarkIndexHasUnknown-4     10             2              -80.00%
BenchmarkIndexHasKnown-4       5              5              +0.00%

benchmark                      old bytes     new bytes     delta
BenchmarkIndexHasUnknown-4     608           16            -97.37%
BenchmarkIndexHasKnown-4       128           128           +0.00%

With my suggestions, we can even improve further:

$ go test -v -run xxxxx -bench IndexHas -benchmem | tee after2
goos: linux
goarch: amd64
pkg: github.com/restic/restic/internal/repository
BenchmarkIndexHasUnknown-4   	20000000	        76.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkIndexHasKnown-4     	20000000	        72.9 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/restic/restic/internal/repository	22.371s

And benchcmp agrees: 😄

$ benchcmp before after2
benchmark                      old ns/op     new ns/op     delta
BenchmarkIndexHasUnknown-4     1464          76.4          -94.78%
BenchmarkIndexHasKnown-4       280           72.9          -73.96%

benchmark                      old allocs     new allocs     delta
BenchmarkIndexHasUnknown-4     10             0              -100.00%
BenchmarkIndexHasKnown-4       5              0              -100.00%

benchmark                      old bytes     new bytes     delta
BenchmarkIndexHasUnknown-4     608           0             -100.00%
BenchmarkIndexHasKnown-4       128           0             -100.00%

So, if it's okay for you, I'd like to take over from here. Is that okay for you?

Thank you very very much for pointing me to this improvement opportunity!

MJDSys · 2018-01-08T20:26:16Z

Ah, I didn't realize you meant that kind of benchmark. Sorry about that.

Feel free to take over this pr if you want. I pushed my changes to the Has function (undoing my other changes). I also whipped up a quick test for the Has function (though it seems thoroughly tested by all the other tests), so feel free to grab that as well. I have checked the "Allow Maintainers to make changes" button if you want to just add to this pr.

I think there is still some optimizations available to get my backup to run faster, so I'll look into those as well. If I find any, I'll try to add a mirco-benchmark for those as well next time.

fd0 · 2018-01-08T20:30:06Z

Cool, thanks! I'll reformat the title of your commit, but otherwise leave it intact so that you get the credit for the change. Cheers!

When backing up several million files (>14M tested here) with few changes, a large amount of time is spent failing to find an id in an index and creating an error to signify this. Since this is checked using the Has method, which doesn't use this error, this time creating the error is wasted. Instead, directly check if the given id and type are present in the index. This also avoids reporting all the packs containing this blob, further reducing cpu usage.

repository/index: Avoid creating an error when checking if an id is in the index.

jojomi · 2018-01-10T18:27:45Z

Thanks to @fd0 and @MJDSys for so nicely working together in the way only OSS can enable to the benefit of us all. Nice improvement you found there and I would love to see more of these "little things" improved :).

Looking at the speed of restic it's incredible to see that this can still be a 50% improvement on backup time! (see @fd0's comment).

fd0 · 2018-01-11T20:24:14Z

@MJDSys FYI, this PR cut my backup time on a larger server (~250GB data size, ~400GB repo size) in half: from ~45 min to ~20 min. Thanks again!

fd0 requested changes Jan 8, 2018

View reviewed changes

Add benchmark for Index.Has()

d77a326

MJDSys force-pushed the make_lookup_internal branch from ed01fca to 063e883 Compare January 8, 2018 20:22

fd0 force-pushed the make_lookup_internal branch from 063e883 to 539599d Compare January 8, 2018 20:48

fd0 merged commit 539599d into restic:master Jan 9, 2018

fd0 added a commit that referenced this pull request Jan 9, 2018

Merge pull request #1538 from MJDSys/make_lookup_internal

5632ca4

repository/index: Avoid creating an error when checking if an id is in the index.

This was referenced Jan 12, 2018

More optimizations to avoid calling Index.Lookup() #1549

Merged

Improving restics performance on large mostly unchanging datasets #1550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repository/index: Avoid creating an error when checking if an id is in the index. #1538

repository/index: Avoid creating an error when checking if an id is in the index. #1538

MJDSys commented Jan 8, 2018

fd0 left a comment

codecov-io commented Jan 8, 2018 •

edited

Loading

MJDSys commented Jan 8, 2018

fd0 commented Jan 8, 2018 •

edited

Loading

MJDSys commented Jan 8, 2018

fd0 commented Jan 8, 2018

jojomi commented Jan 10, 2018 •

edited

Loading

fd0 commented Jan 11, 2018

repository/index: Avoid creating an error when checking if an id is in the index. #1538

repository/index: Avoid creating an error when checking if an id is in the index. #1538

Conversation

MJDSys commented Jan 8, 2018

What is the purpose of this change? What does it change?

Was the change discussed in an issue or in the forum before?

Checklist

fd0 left a comment

Choose a reason for hiding this comment

codecov-io commented Jan 8, 2018 • edited Loading

Codecov Report

MJDSys commented Jan 8, 2018

fd0 commented Jan 8, 2018 • edited Loading

MJDSys commented Jan 8, 2018

fd0 commented Jan 8, 2018

jojomi commented Jan 10, 2018 • edited Loading

fd0 commented Jan 11, 2018

codecov-io commented Jan 8, 2018 •

edited

Loading

fd0 commented Jan 8, 2018 •

edited

Loading

jojomi commented Jan 10, 2018 •

edited

Loading