Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repository/index: Avoid creating an error when checking if an id is in the index. #1538

Merged
merged 2 commits into from
Jan 9, 2018

Conversation

MJDSys
Copy link
Contributor

@MJDSys MJDSys commented Jan 8, 2018

What is the purpose of this change? What does it change?

When backing up several million files (>14M tested here) with few changes,
a large amount of time is spent failing to find an id in an index and creating
an error to signify this. Since this is checked using the Has method,
which doesn't use this error, this time creating the error is wasted.

Instead, create a new function lookupInternal that doesn't create this error.
Change Has to use this function and change Lookup to use lookupInternal and
create the error if it fails to find the id.

This results in saving about one hour of wall time (about half the backup
time) and about six hours of cpu time on my system.

I'm not sure if you want Lookup just changed to work with my lookupInternal, and then fixup all the reverse dependencies. If this is wanted, I'll update my PR appropriately. I also wasn't sure if you wanted a changelog entry, but I'm happy to add that too if wanted.

Was the change discussed in an issue or in the forum before?

No.

Checklist

  • I have read the Contribution Guidelines
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in a subdir of changelog/x.y.z that describe the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

Copy link
Member

@fd0 fd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for proposing the enhancemente! What I'd love to see is a benchmark showing the improvement... I can already imagine that creating the error takes CPU time (and memory), since we're using the github.com/pkg/errors library, which capture stack traces.

I'd like to propose a slightly different approach, which will use even less CPU time and memory: Undo your changes, then modify the Has() method to try the lookup in idx.pack directly (after locking of course). That will be even more efficient because bundling the locations where blobs are stored isn't needed.

What are your thoughts?

@codecov-io
Copy link

codecov-io commented Jan 8, 2018

Codecov Report

Merging #1538 into master will decrease coverage by 5.24%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1538      +/-   ##
=========================================
- Coverage   51.55%   46.3%   -5.25%     
=========================================
  Files         138     138              
  Lines       11080   11081       +1     
=========================================
- Hits         5712    5131     -581     
- Misses       4503    5139     +636     
+ Partials      865     811      -54
Impacted Files Coverage Δ
internal/repository/index.go 67.82% <100%> (+0.11%) ⬆️
internal/backend/b2/b2.go 0% <0%> (-78.84%) ⬇️
internal/backend/azure/azure.go 0% <0%> (-77.85%) ⬇️
internal/backend/swift/swift.go 0% <0%> (-76.03%) ⬇️
internal/backend/gs/gs.go 0% <0%> (-72%) ⬇️
internal/backend/swift/config.go 36.95% <0%> (-54.35%) ⬇️
internal/backend/s3/s3.go 65.76% <0%> (+1.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99f0fce...539599d. Read the comment docs.

@MJDSys
Copy link
Contributor Author

MJDSys commented Jan 8, 2018

Regarding the benchmark, I posted a note about the speed up I get running a backup of my system (saving about an hour of time in a 2-3 hour backup). Is there a more established way to do the benchmark? If it helps, I found this by doing some CPU profiling of my backup.

Regarding your suggestion, that does sound great. If I understand correctly, I just have the one outer if statement? That sounds like it could help some more (lookupInternal still dominates the cpu profile). I'll give it a test here and see how it goes.

@fd0
Copy link
Member

fd0 commented Jan 8, 2018

Hey, for Go software it's very easy to write a micro benchmark of just the Has() function for the index:

const maxPackSize = 16 * 1024 * 1024

func createRandomIndex() (idx *repository.Index, lookupID restic.ID) {
	idx = repository.NewIndex()

	// create index with 200k pack files
	for i := 0; i < 200000; i++ {
		packID := restic.NewRandomID()
		offset := 0
		for offset < maxPackSize {
			size := 2000 + rand.Intn(4*1024*1024)
			id := restic.NewRandomID()
			idx.Store(restic.PackedBlob{
				PackID: packID,
				Blob: restic.Blob{
					Type:   restic.DataBlob,
					ID:     id,
					Length: uint(size),
					Offset: uint(offset),
				},
			})

			offset += size

			if rand.Float32() < 0.001 && lookupID.IsNull() {
				lookupID = id
			}
		}
	}

	return idx, lookupID
}

func BenchmarkIndexHasUnknown(b *testing.B) {
	idx, _ := createRandomIndex()
	lookupID := restic.NewRandomID()

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		idx.Has(lookupID, restic.DataBlob)
	}
}

func BenchmarkIndexHasKnown(b *testing.B) {
	idx, lookupID := createRandomIndex()

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		idx.Has(lookupID, restic.DataBlob)
	}
}

You can then run just this benchmark and look at the memory allocation:

$ go test -v -run xxxxx -bench IndexHas -benchmem | tee before
goos: linux
goarch: amd64
pkg: github.com/restic/restic/internal/repository
BenchmarkIndexHasUnknown-4   	 1000000	      1464 ns/op	     608 B/op	      10 allocs/op
BenchmarkIndexHasKnown-4     	 5000000	       280 ns/op	     128 B/op	       5 allocs/op
PASS
ok  	github.com/restic/restic/internal/repository	20.428s

So, we can see that each call of Has() takes about 300ns and allocates 128 Byte for known blobs, and 1500ns and 600 byte for unknown blobs.

With your change cherry picked, this looks already much better:

$ go test -v -run xxxxx -bench IndexHas -benchmem | tee after
goos: linux
goarch: amd64
pkg: github.com/restic/restic/internal/repository
BenchmarkIndexHasUnknown-4   	10000000	       143 ns/op	      16 B/op	       2 allocs/op
BenchmarkIndexHasKnown-4     	 5000000	       276 ns/op	     128 B/op	       5 allocs/op
PASS
ok  	github.com/restic/restic/internal/repository	22.466s

We can use benchcmp to compare it:

$ benchcmp before after
benchmark                      old ns/op     new ns/op     delta
BenchmarkIndexHasUnknown-4     1464          143           -90.23%
BenchmarkIndexHasKnown-4       280           276           -1.43%

benchmark                      old allocs     new allocs     delta
BenchmarkIndexHasUnknown-4     10             2              -80.00%
BenchmarkIndexHasKnown-4       5              5              +0.00%

benchmark                      old bytes     new bytes     delta
BenchmarkIndexHasUnknown-4     608           16            -97.37%
BenchmarkIndexHasKnown-4       128           128           +0.00%

With my suggestions, we can even improve further:

$ go test -v -run xxxxx -bench IndexHas -benchmem | tee after2
goos: linux
goarch: amd64
pkg: github.com/restic/restic/internal/repository
BenchmarkIndexHasUnknown-4   	20000000	        76.4 ns/op	       0 B/op	       0 allocs/op
BenchmarkIndexHasKnown-4     	20000000	        72.9 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/restic/restic/internal/repository	22.371s

And benchcmp agrees: 😄

$ benchcmp before after2
benchmark                      old ns/op     new ns/op     delta
BenchmarkIndexHasUnknown-4     1464          76.4          -94.78%
BenchmarkIndexHasKnown-4       280           72.9          -73.96%

benchmark                      old allocs     new allocs     delta
BenchmarkIndexHasUnknown-4     10             0              -100.00%
BenchmarkIndexHasKnown-4       5              0              -100.00%

benchmark                      old bytes     new bytes     delta
BenchmarkIndexHasUnknown-4     608           0             -100.00%
BenchmarkIndexHasKnown-4       128           0             -100.00%

So, if it's okay for you, I'd like to take over from here. Is that okay for you?

Thank you very very much for pointing me to this improvement opportunity!

@MJDSys
Copy link
Contributor Author

MJDSys commented Jan 8, 2018

Ah, I didn't realize you meant that kind of benchmark. Sorry about that.

Feel free to take over this pr if you want. I pushed my changes to the Has function (undoing my other changes). I also whipped up a quick test for the Has function (though it seems thoroughly tested by all the other tests), so feel free to grab that as well. I have checked the "Allow Maintainers to make changes" button if you want to just add to this pr.

I think there is still some optimizations available to get my backup to run faster, so I'll look into those as well. If I find any, I'll try to add a mirco-benchmark for those as well next time.

@fd0
Copy link
Member

fd0 commented Jan 8, 2018

Cool, thanks! I'll reformat the title of your commit, but otherwise leave it intact so that you get the credit for the change. Cheers!

When backing up several million files (>14M tested here) with few changes,
a large amount of time is spent failing to find an id in an index and creating
an error to signify this.  Since this is checked using the Has method,
which doesn't use this error, this time creating the error is wasted.

Instead, directly check if the given id and type are present in the index.
This also avoids reporting all the packs containing this blob, further
reducing cpu usage.
@fd0 fd0 merged commit 539599d into restic:master Jan 9, 2018
fd0 added a commit that referenced this pull request Jan 9, 2018
repository/index: Avoid creating an error when checking if an id is in the index.
@jojomi
Copy link

jojomi commented Jan 10, 2018

Thanks to @fd0 and @MJDSys for so nicely working together in the way only OSS can enable to the benefit of us all. Nice improvement you found there and I would love to see more of these "little things" improved :).

Looking at the speed of restic it's incredible to see that this can still be a 50% improvement on backup time! (see @fd0's comment).

@fd0
Copy link
Member

fd0 commented Jan 11, 2018

@MJDSys FYI, this PR cut my backup time on a larger server (~250GB data size, ~400GB repo size) in half: from ~45 min to ~20 min. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants