Implement APIs for Git serialized commit graphs #965

filipnavara · 2018-09-26T15:07:56Z

The Git serialized commit graph should offer speed-up for certain types of history traversal. It was added as experimental feature in Git 2.18 and is greatly detailed in the blog series by Derrick Stolee. It would be nice if go-git could offer access to the feature, both at low-level (verifying, building and querying the commit graph file) and high-level (speeding up log traversals with it, if applicable).

I've started a proof-of-concept code at https://github.com/filipnavara/go-git/compare/perf-read...filipnavara:commitgraph?expand=1. However, the API is far from nice, the code is not tested (besides running it on my own repository) and it is generally messy.

Since go-git currently defines Commit objects as struct I was left with no other choice but to introduce a CommitNode interface, which is watered down version of Commit as present in the serialized commit graph. In ideal world there would be only one Commit interface and the commit graph implementation of it would lazy-load the real Commit objects if necessary.

Here's a rough approch of my implementation:

At the lowest level there is commitgraph package, which provides the Node and Index interfaces representing the data at the file level (https://github.com/git/git/blob/2d3b1c576c85b7f5db1f418907af00ab88e0c303/Documentation/technical/commit-graph-format.txt). There is implementation of the interface using random-access files / memory-mapped files (FileIndex), in-memory implementation (MemoryIndex) and an Encoder, which can write down the memory index into new file.

Building a memory index from all commits reachable from an existing commit could be accomplished by a code like this:

func buildCommitGraph(c *object.Commit) (*commitgraph.MemoryIndex, error) {
	idx := commitgraph.NewMemoryIndex()
	seen := make(map[plumbing.Hash]bool)
	return idx, addCommitToIndex(idx, c, seen)
}

func addCommitToIndex(idx *commitgraph.MemoryIndex, c *object.Commit, seen map[plumbing.Hash]bool) error {
	if seen[c.Hash] {
		return nil
	}
	seen[c.Hash] = true

	// Recursively add parents first
	err := c.Parents().ForEach(func(parent *object.Commit) error {
		return addCommitToIndex(idx, parent, seen)
	})
	if err != nil {
		return err
	}

	// Add this commit if it hasn't been done already
	node := &commitgraph.Node{
		TreeHash:     c.TreeHash,
		ParentHashes: c.ParentHashes,
		When:         c.Committer.When,
	}
	return idx.Add(c.Hash, node)
}

The access to the index is exposed through CommitGraphStorer interface in the storer package that is implemented by the memory storage and filesystem storage. The file system storage implements it using memory-mapped files for reading and using the encoder for writing. [Note to myself: The memory mapped files are currently not closed with the repository!]

In the object package CommitNode and CommitNodeIndex interfaces are added. The CommitNode interface is implemented by existing Commit structure and also a new lightweight graphCommitNode. The CommitNodeIndex interface provides methods for looking up commit parents and getting a full Commit object from CommitNode. Two implementations of CommitNodeIndex exist. The first one is objectCommitNodeIndex, which only uses Commit objects and implements the interfaces to behave exactly as if no serialized commit graph existed. Second one, graphCommitNodeIndex, takes the additional commitgraph.Index object and implements the lookup methods by trying the commit graph first and falling back to loading full Commit objects if the commit is not present in the commit graph file.

I added NewCommitNodeIterCTime iterator as a counterpart to NewCommitIterCTime, which operates on top of the CommitNode and CommitNodeIndex interfaces. Similar thing could be done for other NewCommit*Iter* methods. In fact, it is easily possible to reimplement NewCommit*Iter* on top of NewCommitNode*Iter* and switch between the lookup implementations (graphCommitNodeIndex / objectCommitNodeIndex) based on the paricular workload at hand. When full commit information, such as Message or Author is needed, then it's more useful to load the objects directly. When only summary information is needed (eg. counting distance between two commits) then the commit graph implementation can be used.

Lastly, I added the method CommitNodeIndex to git.Repository, which returns either graphCommitNodeIndex or objectCommitNodeIndex depending on the presence of the serialized commit graph. [Note to myself: It should probably consult the gc.commitGraph config option.]

The text was updated successfully, but these errors were encountered:

filipnavara · 2018-09-26T15:10:01Z

Example of walking the commit graph with the above APIs:

func getLastCommitForPaths(r *git.Repository, c *object.Commit, paths []string) ([]*object.Commit, error) {
	commitIndex := r.CommitNodeIndex()
	cIter := object.NewCommitNodeIterCTime(c, commitIndex, nil, nil)
	result := make([]*object.Commit, len(paths))
	remainingResults := len(paths)

	cTree, err := c.Tree()
	if err != nil {
		return nil, err
	}

	currentEntryHashes := make([]plumbing.Hash, len(paths))
	for i, path := range paths {
		cEntry, err := cTree.FindEntry(path)
		if err != nil {
			return nil, err
		}
		currentEntryHashes[i] = cEntry.Hash
	}

	cIter.ForEach(func(current object.CommitNode) error {
		newEntryHashes := make([]plumbing.Hash, len(paths))

		err := commitIndex.ParentNodes(current).ForEach(func(parent object.CommitNode) error {
			parentTree, err := parent.Tree()
			if err != nil {
				return err
			}

			for i, path := range paths {
				// skip path if we already found it
				if currentEntryHashes[i] != plumbing.ZeroHash {
					// find parents that contain the path
					if parentEntry, err := parentTree.FindEntry(path); err == nil {
						// if the hash for the path differs in the parent then the current commit changed it
						if parentEntry.Hash == currentEntryHashes[i] {
							newEntryHashes[i] = currentEntryHashes[i]
						} else {
							// mark for saving the result below
							newEntryHashes[i] = plumbing.ZeroHash
							// stop any further processing for this file
							currentEntryHashes[i] = plumbing.ZeroHash
						}
					}
				}
			}

			return nil
		})
		if err != nil {
			return err
		}

		// if a file didn't exist in any parent commit then it must have been created in the
		// current one. also we mark changed files in the loop above as not present in the
		// parent to simplify processing
		var currentCommit *object.Commit
		for i, newEntryHash := range newEntryHashes {
			if newEntryHash == plumbing.ZeroHash && result[i] == nil {
				if currentCommit == nil {
					var err error
					currentCommit, err = commitIndex.Commit(current)
					if err != nil {
						return err
					}
				}
				result[i] = currentCommit
				remainingResults--
			}
		}

		if remainingResults == 0 {
			return storer.ErrStop
		}

		currentEntryHashes = newEntryHashes
		return nil
	})

	return result, nil
}

filipnavara · 2018-09-26T15:14:07Z

I'm sharing my proof of concept implementation in hope that at some point a proper API could be implemented. It could be useful as a base for code reading/writing the commit-graph files, or perhaps it could influence design decisions for future go-git versions (such as moving over to interfaces instead of structs).

filipnavara · 2018-09-27T10:34:05Z

I added few more things to my branch:

Experimental implementation of bloom filters for filtering paths (idea taken from https://blogs.msdn.microsoft.com/devops/2018/07/16/super-charging-the-git-commit-graph-iv-bloom-filters/)
Example of listing a directory tree from a specific revision along with last commit for each entry
Example of building the commit graph using go-git, the output is verifiable using git commit-graph verify

smola added the enhancement label Sep 27, 2018

filipnavara mentioned this issue Sep 27, 2018

Slow page loads with a large repo go-gitea/gitea#491

Closed

This was referenced Apr 21, 2019

Proof of concept: Experimental support for git commit graph files go-gitea/gitea#6701

Closed

plumbing: format/commitgraph, add APIs for reading and writing commit-graph files #1128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement APIs for Git serialized commit graphs #965

Implement APIs for Git serialized commit graphs #965

filipnavara commented Sep 26, 2018

filipnavara commented Sep 26, 2018

filipnavara commented Sep 26, 2018

filipnavara commented Sep 27, 2018

Implement APIs for Git serialized commit graphs #965

Implement APIs for Git serialized commit graphs #965

Comments

filipnavara commented Sep 26, 2018

filipnavara commented Sep 26, 2018

filipnavara commented Sep 26, 2018

filipnavara commented Sep 27, 2018