Skip to content
This repository has been archived by the owner on Apr 12, 2019. It is now read-only.

Faster commit lookup #91

Merged
merged 4 commits into from
Dec 10, 2017
Merged

Faster commit lookup #91

merged 4 commits into from
Dec 10, 2017

Conversation

ethantkoenig
Copy link
Member

@ethantkoenig ethantkoenig commented Nov 19, 2017

A faster implementation of GetCommitsInfo, particularly for repositories with many files. Addresses go-gitea/gitea#491 and go-gitea/gitea#502.

A revival of the failed #53 (which was reverted by #73).

BENCHMARKS

Old implementations:

BenchmarkEntries_GetCommitsInfo/gitea-4         	     200	  89001816 ns/op
BenchmarkEntries_GetCommitsInfo/manyfiles-4     	       1	47869780305 ns/op
BenchmarkEntries_GetCommitsInfo/moby-4          	     100	 174120584 ns/op
BenchmarkEntries_GetCommitsInfo/go-4            	      20	 727309260 ns/op
BenchmarkEntries_GetCommitsInfo/linux-4         	      20	 853282140 ns/op

New implementation:

BenchmarkEntries_GetCommitsInfo/gitea-4         	     200	  79968242 ns/op
BenchmarkEntries_GetCommitsInfo/manyfiles-4     	      50	 261940680 ns/op
BenchmarkEntries_GetCommitsInfo/moby-4          	     100	 154584374 ns/op
BenchmarkEntries_GetCommitsInfo/go-4            	      20	 715422795 ns/op
BenchmarkEntries_GetCommitsInfo/linux-4         	      20	 833090406 ns/op

In summary, we see a >150x improvement in repos with many files, and mild improvements in other repos.

IMPLEMENTATION DETAILS:
Runs targeted and untargeted searches in parallel. The untargeted search (1 thread) sifts through the output of a git log --name-status -c -- <tree-path> command, seeing which files the most recent commits have affected. The targeted search (multiple threads) runs git rev-list -1 HEAD -- <entry> commands to get the latest commit for specific file that have not been found by the untargeted search. The targeted search is similar to the old implementation, with a few enhancements (e.g. uses git rev-list instead of git log).

If there are only a few remaining entries, then the untargeted search stops to allow the targeted-search threads to finish without any interference (in my tests, this interference had a non-negligible impact on performance).

TESTING:
I have not been able to test this new implementation on a Windows machine. If someone could test the new implementation on Windows (to check for forward-slash vs. backslash bugs), that would be great.

@ethantkoenig ethantkoenig force-pushed the fast_commits branch 2 times, most recently from 141bc21 to f366cad Compare November 19, 2017 03:03
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
defer cancel()

args := []string{"log", getCommitsInfoPretty, "--name-status", "-c"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this command ever be faster than the targetedSearch ?

Copy link
Member Author

@ethantkoenig ethantkoenig Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is faster for repos with many files (which would otherwise require many calls to targetedSearch vs. one untargeted search). See the benchmark results for the manyfiles repo.

Copy link
Member

@bkcsoft bkcsoft Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have 2 benchmarks, one for each type of search, just to convince me 😛

Copy link
Member Author

@ethantkoenig ethantkoenig Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkcsoft I believe the existing benchmarks are telling on their own. Using only the targetedSearch is essentially the same as the old implementation (aside from a few small tweaks). The only explanation for the very large (>150x) performance improvement in the manyfiles repo between the old and new implementations is that running an "untargeted" search in parallel is helpful (in this case, very helpful).

The advantage of using a "untargeted" approach in repos with many files is also reflected in the benchmarks for #53 (which solely used an untargeted approach)

commit_info.go Outdated
}
if line[0] >= 'A' && line[0] <= 'X' { // a file was changed by the current commit
tabIndex := strings.IndexByte(line, '\t')
if tabIndex < 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MM is valid for Modified files in a Merge commit 🤔

c757efaaa0267ec823ddc0858cf6ec930e93a074 1510393589 Merge branch 'gitlab-git-f7537ce03a29b' into 'master'
MM      CHANGELOG.md

Copy link
Member

@bkcsoft bkcsoft Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you have moves like this: (no clue what the digits are for 🤔 )

R088    internal/server/auth.go internal/server/auth/auth.go
R070    internal/ref/helper.go  internal/git/ref.go
R055    ruby/vendor/gitlab_git/lib/gitlab/git/committer.rb      ruby/vendor/gitlab_git/lib/gitlab/git/user.rb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarity-score apparently:

  <X><score>  The rename or copy score (denoting the percentage
of similarity between the source and target of the
move or copy). For example "R100" or "C75".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkcsoft The code will handle MM correctly (as far as I can tell, since the line still begins with an upper case character). I'll update the code to correctly handle R... and C... (where there will be two filepaths, and we should look at the second one)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I read tabIndex > 1 which would fail. But it's tabIndex < 1 😅

commit_info.go Outdated
func (state *getCommitsInfoState) numRemainingEntries() int {
state.lock.Lock()
numRemaining := len(state.entries) - len(state.commits)
state.lock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this just after lock and use defer state.lock.Unlock()

Copy link
Member Author

@ethantkoenig ethantkoenig Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally did not use defer for performance. This function will be called many times, so small changes can have a meaningful impact on overall performance. See golang/go#6980

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go on that with @ethantkoenig defer add some overhead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lafriks friendly ping

commit_info.go Outdated
state.targetedPaths[entryPath] = struct{}{}
break
}
state.lock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - move this just after lock and use defer state.lock.Unlock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here since no return in between.

commit_info.go Outdated
state.commits[entryPath] = commit
updated = true
}
state.lock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with defer

commit_info.go Outdated
func (state *getCommitsInfoState) numRemainingEntries() int {
state.lock.Lock()
numRemaining := len(state.entries) - len(state.commits)
state.lock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go on that with @ethantkoenig defer add some overhead

commit_info.go Outdated
state.targetedPaths[entryPath] = struct{}{}
break
}
state.lock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here since no return in between.

@sapk
Copy link
Member

sapk commented Dec 1, 2017

LGTM

@lafriks
Copy link
Member

lafriks commented Dec 9, 2017

@ethantkoenig Is it really worth risking locking up if unexpected error occurs? Can you retest test how big performance penalty in this case it is with defer?

@ethantkoenig
Copy link
Member Author

@lafriks Overhead of using defer was less than I expected (essentially neglible), so I've switched to using defer.

@lafriks
Copy link
Member

lafriks commented Dec 10, 2017

LGTM

@lafriks lafriks merged commit 02eccf8 into go-gitea:master Dec 10, 2017
@lafriks
Copy link
Member

lafriks commented Dec 10, 2017

@ethantkoenig please update git dep in gitea repository

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants