Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse-checkout builtin: upstreamable version #180

Conversation

derrickstolee
Copy link
Collaborator

@derrickstolee derrickstolee commented Aug 20, 2019

I'm creating this PR for a fourth time (see #163, #171, and #178 for earlier versions). This version is tracking my progress to create something that can be sent as an RFC upstream, but also can be used to start the sparse feature branch. This is based on v2.23.0.

Note: I currently have conflicts with the virtual filesystem feature, and I'll resolve those with a merge commit when I'm ready. I'm just creating this for tracking progress at the moment, but can also be a place for early feedback.

TODOs:

  • git sparse-checkout disable to disable the sparse-checkout feature and return to a full checkout.
  • git sparse-checkout init --cone, to initialize in cone mode.
  • git sparse-checkout add when core.sparseCheckout=cone.

This includes performance improvements with the hashset-based matching algorithm, but I'll leave further enhancements as smaller steps on top. Here are a few things I want to try:

  1. Track the maximum depth of a prefix pattern, so we know to not run hashes for deeper paths.
  2. Use the "known exclude" bit in cone mode to stop hashing paths we know will not be included.
  3. Use the "known include" bit in cone mode to stop hashing paths we know will be included. This is more difficult than "known exclude" because we need to distinguish directories from files when doing path matches so we don't give a directory a "known include" when it isn't a recursive pattern match.

The sparse-checkout feature is mostly hidden to users, as its
only documentation is supplementary information in the docs for
'git read-tree'. In addition, users need to know how to edit the
.git/info/sparse-checkout file with the right patterns, then run
the appropriate 'git read-tree -mu HEAD' command. Keeping the
working directory in sync with the sparse-checkout file requires
care.

Begin an effort to make the sparse-checkout feature a porcelain
feature by creating a new 'git sparse-checkout' builtin. This
builtin will be the preferred mechanism for manipulating the
sparse-checkout file and syncing the working directory.

For now, create the basics of the builtin. Includes a single
subcommand, "git sparse-checkout list", that lists the patterns
currently in the sparse-checkout file. Test that these patterns
are parsed and written correctly to the output.

The documentation provided is adapted from the "git read-tree"
documentation with a few edits for clarity in the new context.
Extra sections are added to hint toward a future change to
a moer restricted pattern set.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Getting started with a sparse-checkout file can be daunting. Help
users start their sparse enlistment using 'git sparse-checkout init'.
This will set 'core.sparseCheckout=true' in their config, write
an initial set of patterns to the sparse-checkout file, and update
their working directory.

Using 'git read-tree' to clear directories does not work cleanly
on Windows, so manually delete directories that are tracked by Git
before running read-tree.

The use of running another process for 'git read-tree' is likely
suboptimal, but that can be improved in a later change, if valuable.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When someone wants to clone a large repository, but plans to work
using a sparse-checkout file, they either need to do a full
checkout first and then reduce the patterns they included, or
clone with --no-checkout, set up their patterns, and then run
a checkout manually. This requires knowing a lot about the repo
shape and how sparse-checkout works.

Add a new '--sparse' option to 'git clone' that initializes the
sparse-checkout file to include the following patterns:

	/*
	!/*/*

These patterns include every file in the root directory, but
no directories. This allows a repo to include files like a
README or a bootstrapping script to grow enlistments from that
point.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The 'git sparse-checkout add' subcommand takes a list of patterns
over stdin and writes them to the sparse-checkout file. Then, it
updates the working directory using 'git read-tree -mu HEAD'.

Note: if a user adds a negative pattern that would lead to the
removal of a non-empty directory, then Git may not delete that
directory (on Windows).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Collaborator Author

/azp run Microsoft.git

@azure-pipelines
Copy link

Pull request contains merge conflicts.

derrickstolee and others added 5 commits August 20, 2019 09:56
The instructions for disabling a sparse-checkout to a full
working directory are complicated and non-intuitive. Add a
subcommand, 'git sparse-checkout disable', to perform those
steps for the user.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The clear_ce_flags_1 method is used by many types of calls to
unpack_trees(). Add trace2 regions around the method, including
some flag information, so we can get granular performance data
during experiments.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The sparse-checkout feature can have quadratic performance as
the number of patterns and number of entries in the index grow.
If there are 1,000 patterns and 1,000,000 entries, this time can
be very significant.

Create a new 'cone' mode for the core.sparseCheckout config
option, and adjust the parser to set an appropriate enum value.

While adjusting the type of this variable, rename it from
core_apply_sparse_checkout to core_sparse_checkout. This will
help avoid parallel changes from hitting type issues, and we
can guarantee that all uses now consider the enum values instead
of the int value.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The parent and recursive patterns allowed by the "cone mode"
option in sparse-checkout are restrictive enough that we
can avoid using the regex parsing. Everything is based on
prefix matches, so we can use hashsets to store the prefixes
from the sparse-checkout file. When checking a path, we can
strip path entries from the path and check the hashset for
an exact match.

As a test, I created a cone-mode sparse-checkout file for the
Linux repository that actually includes every file. This was
constructed by taking every folder in the Linux repo and creating
the pattern pairs here:

	/$folder/*
	!/$folder/*/*

This resulted in a sparse-checkout file sith 8,296 patterns.
Running 'git read-tree -mu HEAD' on this file had the following
performance:

	core.sparseCheckout=false: 0.21 s (0.00 s)
	 core.sparseCheckout=true: 3.75 s (3.50 s)
	 core.sparseCheckout=cone: 0.23 s (0.01 s)

The times in parentheses above correspond to the time spent
in the first clear_ce_flags() call, according to the trace2
performance traces.

While this example is contrived, it demonstrates how these
patterns can slow the sparse-checkout feature.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
To make the cone pattern set easy to use, update the behavior of
'git sparse-checkout [init|add]'.

Add '--cone' flag to 'git sparse-checkout init' to set the config
option 'core.sparseCheckout=cone'.

When running 'git sparse-checkout add' in cone mode, a user only
needs to supply a list of recursive folder matches. Git will
automatically add the necessary parent matches for the leading
directories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee derrickstolee force-pushed the sparse-checkout/v1 branch 2 times, most recently from 9937fc0 to 876024d Compare August 20, 2019 14:33
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Collaborator Author

/azp run Microsoft.git

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@derrickstolee
Copy link
Collaborator Author

/azp run Microsoft.git

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@derrickstolee derrickstolee changed the title [WIP] Sparse-checkout builtin: upstreamable version Sparse-checkout builtin: upstreamable version Aug 20, 2019
@derrickstolee derrickstolee marked this pull request as ready for review August 20, 2019 16:55
@derrickstolee
Copy link
Collaborator Author

/azp run Microsoft.git

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@jrbriggs jrbriggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few naive questions but overall this looks good to me.

`git sparse-checkout disable` command.

Sparse checkout support in 'git read-tree' and similar commands is
disabled by default. You need to set `core.sparseCheckout` to `true`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to set core.sparseCheckout to true in order to have sparse checkout support

Does this mean when core.sparseCheckout=cone that git read-tree does the wrong thing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! I just didn't update this documentation in the commits adding the cone setting. I should fix that.

{
struct argv_array argv = ARGV_ARRAY_INIT;
int result = 0;
argv_array_pushl(&argv, "read-tree", "-m", "-u", "HEAD", NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pass through any progress on stdout? Should it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git read-tree does not have progress indicators. I could add some. (It would go through stderr if so.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting to progress and maybe having a quiet mode seems good from an experience perspective. I could see a user adding a very large cone and being frustrated if they don't get feedback on what's happening like checkout provides.

We can address this in a followup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (line.buf[0] == '/')
strbuf_remove(&line, 0, 1);

if (!line.len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract strbuf_trim_leading_dir_sep so we can be a bit more explicit about our intent?

if (!(x->flags & EXC_FLAG_NEGATIVE)) {
/* Not a cone pattern. */
el->use_cone_patterns = 0;
warning(_("unrecognized pattern: '%s'"), x->pattern);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this user-facing? If it is, should we indicate something like unrecognized sparse-checkout pattern so they know where to look?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe also indicate that it might be valid for sparse=true but not sparse=cone.

return;
}

if (x->patternlen >= 2 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a couple of similar patterns here of trying to, I think, effectively ignore the recursive wildcard(s) on the end of the entries. Should we extract something to make that easier to parse for a first-time reader?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could likely be split into several methods and be better for it.

config.c Show resolved Hide resolved

Then it compares the new skip-worktree value with the previous one. If
skip-worktree turns from set to unset, it will add the corresponding
file back. If it turns from unset to set, that file will be removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 paragraphs are a bit muddy. But I don't want to bog us down wordsmith-ing right now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, many of them are copied directory from Documentation/git-read-tree.txt)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was afraid of that....

@@ -15,7 +15,7 @@ SYNOPSIS
[--dissociate] [--separate-git-dir <git dir>]
[--depth <depth>] [--[no-]single-branch] [--no-tags]
[--recurse-submodules[=<pathspec>]] [--[no-]shallow-submodules]
[--[no-]remote-submodules] [--jobs <n>] [--] <repository>
[--[no-]remote-submodules] [--jobs <n>] [--sparse] [--] <repository>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I can't comment on the commit message, I'll park this here.

On the "clone: add --sparse mode" commit:
Where you mention "README" you might also mention files like .gitignore
and friends.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And .gitattributes!

Enable "sparse checkout" feature. If "false", then sparse-checkout
is disabled. If "true", then sparse-checkout is enabled with the full
.gitignore pattern set. If "cone", then sparse-checkout is enabled with
a restricted pattern set. See linkgit:git-sparse-checkout[1] for more

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the term "... pattern set" threw me a bit here. I guess I'm wondering if "set" is a noun or verb here.

Would it be better to say something about the pattern matching mode. As in: If "true", then sparse-checkout is enabled using the rules for .gitignore-style pattern matching. If "cone, then it is enabled using the rules for cone-style matching.

It seems like this config setting is just changing matching algorithm.
Whereas, saying "... pattern set" implies that we have 2 cached "sets" and are deciding which to install.

@@ -86,6 +86,56 @@ negate patterns. For example, to remove the file `unwanted`:
----------------


## CONE PATTERN SET

The full pattern set allows for arbitrary pattern matches and complicated

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thrown by the intro for the "cone" method talking about the "full" set.

Perhaps a L2 header for "Full/.gitignore algorithm" and then a L2 for the "Cone algorithm".

@derrickstolee
Copy link
Collaborator Author

Merging without responding to comments because they may dramatically change based on community feedback. After feedback from the community, I will revert these commits and apply the new version to the feature branch as we go. It is important to unblock work.

@derrickstolee derrickstolee merged commit 8d3da92 into microsoft:features/sparse-checkout-2.23.0 Aug 22, 2019
derrickstolee added a commit that referenced this pull request Aug 22, 2019
In #183, I reverted a change to builtin/checkout.c that slowed
down 'git checkout -b'. That change introduced a new instance of
the core_apply_sparse_checkout global value that was not in the
previous version. I then merged #180 without regenerating a
build. My bad.

This fixes that remaining core_apply_sparse_checkout instance.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee added a commit that referenced this pull request Aug 22, 2019
In #183, I reverted a change to builtin/checkout.c that slowed
down 'git checkout -b'. That change introduced a new instance of
the core_apply_sparse_checkout global value that was not in the
previous version. I then merged #180 without regenerating a
build. My bad.

This fixes that remaining core_apply_sparse_checkout instance.
derrickstolee added a commit to microsoft/scalar that referenced this pull request Aug 22, 2019
Uses the code currently at microsoft/git#180. Covers the basics of #8.

* The default `scalar clone` runs `git sparse-checkout init` so the working directory only has files at root.

* Run `git sparse-checkout add <folders.txt` to pipe in a list of folders, and it will expand those files.

Example workflow:

```sh
$ scalar clone https://dev.azure.com/gvfs/ci/_git/ForTests
Clone parameters:
  Repo URL:     https://dev.azure.com/gvfs/ci/_git/ForTests
  Branch:       Default
  Cache Server: Default
  Local Cache:  C:\.scalarCache
  Destination:  C:\_git\test2\ForTests
Authenticating...Succeeded
Querying remote for config...Succeeded
Using cache server: None (https://dev.azure.com/gvfs/ci/_git/ForTests)

WARNING: Unable to validate your Scalar version
Server not configured to provide supported Scalar versions

Cloning...Succeeded
Fetching commits and trees from origin (no cache server)...Succeeded
Validating repo...Succeeded
Mounting...Succeeded

$ cd ForTests/src/
$ ls
AuthoringTests.md  GvFlt_EULA.md  GVFS.sln  License.md  nuget.config  Protocol.md  Readme.md  Settings.StyleCop

$ echo GVFS/GVFS.Common >>../folders.txt
$ echo GVFS/GVFS.UnitTests >>../folders.txt
$ echo GitHooksLoader >>../folders.txt
$ git sparse-checkout add <../folders.txt

$ ls
AuthoringTests.md  GitHooksLoader/  GvFlt_EULA.md  GVFS/  GVFS.sln  License.md  nuget.config  Protocol.md  Readme.md  Settings.StyleCop

$ echo GVFS/GVFS >>../folders2.txt
$ echo GVFS/GVFS.FunctionalTests >>../folders2.txt
$ git sparse-checkout add <../folders2.txt

$ ls
AuthoringTests.md  GitHooksLoader/  GvFlt_EULA.md  GVFS/  GVFS.sln  License.md  nuget.config  Protocol.md  Readme.md  Settings.StyleCop

$ ls GVFS
GVFS/  GVFS.Common/  GVFS.FunctionalTests/  GVFS.UnitTests/  LibGit2Sharp.NativeBinaries.props  ProjectedFSLib.NativeBinaries.props

$ git sparse-checkout list
/
/GVFS/
/GVFS/GVFS/*
/GVFS/GVFS.Common/*
/GVFS/GVFS.FunctionalTests/*
/GVFS/GVFS.UnitTests/*
/GitHooksLoader/*

```

I spun up a few remaining issues for follow-up work: #76, #77, #78.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants