Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce path walk API and add 'git pack-objects --path-walk' #28

Draft
wants to merge 13 commits into
base: full-name-windows
Choose a base branch
from

Commits on Sep 19, 2024

  1. path-walk: introduce an object walk by path

    In anticipation of a few planned applications, introduce the most basic form
    of a path-walk API. It currently assumes that there are no UNINTERESTING
    objects, and does not include any complicated filters. It calls a function
    pointer on groups of tree and blob objects as grouped by path. This only
    includes objects the first time they are discovered, so an object that
    appears at multiple paths will not be included in two batches.
    
    There are many future adaptations that could be made, but they are left for
    future updates when consumers are ready to take advantage of those features.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    f93fde5 View commit details
    Browse the repository at this point in the history
  2. t6601: add helper for testing path-walk API

    Add some tests based on the current behavior, doing interesting checks
    for different sets of branches, ranges, and the --boundary option. This
    sets a baseline for the behavior and we can extend it as new options are
    introduced.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    9f10275 View commit details
    Browse the repository at this point in the history
  3. path-walk: allow consumer to specify object types

    This adds the ability to ask for the commits as a single list. This will
    also reduce the calls in 'git backfill' to be a BUG() statement if called
    with anything other than blobs.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    d61a764 View commit details
    Browse the repository at this point in the history
  4. path-walk: allow visiting tags

    In anticipation of using the path-walk API to analyze tags or include
    them in a pack-file, add the ability to walk the tags that were included
    in the revision walk.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    8eb29c0 View commit details
    Browse the repository at this point in the history
  5. revision: create mark_trees_uninteresting_dense()

    The sparse tree walk algorithm was created in d5d2e93 (revision:
    implement sparse algorithm, 2019-01-16) and involves using the
    mark_trees_uninteresting_sparse() method. This method takes a repository
    and an oidset of tree IDs, some of which have the UNINTERESTING flag and
    some of which do not.
    
    Create a method that has an equivalent set of preconditions but uses a
    "dense" walk (recursively visits all reachable trees, as long as they
    have not previously been marked UNINTERESTING). This is an important
    difference from mark_tree_uninteresting(), which short-circuits if the
    given tree has the UNINTERESTING flag.
    
    A use of this method will be added in a later change, with a condition
    set whether the sparse or dense approach should be used.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    dfcffb5 View commit details
    Browse the repository at this point in the history
  6. path-walk: add prune_all_uninteresting option

    This option causes the path-walk API to act like the sparse tree-walk
    algorithm implemented by mark_trees_uninteresting_sparse() in
    list-objects.c.
    
    Starting from the commits marked as UNINTERESTING, their root trees and
    all objects reachable from those trees are UNINTERSTING, at least as we
    walk path-by-path. When we reach a path where all objects associated
    with that path are marked UNINTERESTING, then do no continue walking the
    children of that path.
    
    We need to be careful to pass the UNINTERESTING flag in a deep way on
    the UNINTERESTING objects before we start the path-walk, or else the
    depth-first search for the path-walk API may accidentally report some
    objects as interesting.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    50f6e7f View commit details
    Browse the repository at this point in the history
  7. pack-objects: add --path-walk option

    In order to more easily compute delta bases among objects that appear at the
    exact same path, add a --path-walk option to 'git pack-objects'.
    
    This option will use the path-walk API instead of the object walk given by
    the revision machinery. Since objects will be provided in batches
    representing a common path, those objects can be tested for delta bases
    immediately instead of waiting for a sort of the full object list by
    name-hash. This has multiple benefits, including avoiding collisions by
    name-hash.
    
    The objects marked as UNINTERESTING are included in these batches, so we
    are guaranteeing some locality to find good delta bases.
    
    After the individual passes are done on a per-path basis, the default
    name-hash is used to find other opportunistic delta bases that did not
    match exactly by the full path name.
    
    RFC TODO: It is important to note that this option is inherently
    incompatible with using a bitmap index. This walk probably also does not
    work with other advanced features, such as delta islands.
    
    Getting ahead of myself, this option compares well with --full-name-hash
    when the packfile is large enough, but also performs at least as well as
    the default in all cases that I've seen.
    
    RFC TODO: this should probably be recording the batch locations to another
    list so they could be processed in a second phase using threads.
    
    RFC TODO: list some examples of how this outperforms previous pack-objects
    strategies. (This is coming in later commits that include performance
    test changes.)
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    2199a8b View commit details
    Browse the repository at this point in the history
  8. pack-objects: introduce GIT_TEST_PACK_PATH_WALK

    There are many tests that validate whether 'git pack-objects' works as
    expected. Instead of duplicating these tests, add a new test environment
    variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
    when specified.
    
    This was useful in testing the implementation of the --path-walk
    implementation, especially in conjunction with test such as:
    
     - t0411-clone-from-partial.sh : One test fetches from a repo that does
       not have the boundary objects. This causes the path-based walk to
       fail. Disable the variable for this test.
    
     - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
       without a boundary object.
    
     - t5310-pack-bitmaps.sh : One test compares the case when packing with
       bitmaps to the case when packing without them. Since we disable the
       test variable when writing bitmaps, this causes a difference in the
       object list (the --path-walk option adds an extra object). Specify
       --no-path-walk in both processes for the comparison. Another test
       checks for a specific delta base, but when computing dynamically
       without using bitmaps, the base object it too small to be considered
       in the delta calculations so no base is used.
    
     - t5316-pack-delta-depth.sh : This script cares about certain delta
       choices and their chain lengths. The --path-walk option changes how
       these chains are selected, and thus changes the results of this test.
    
     - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
       the --sparse option and how it combines with --path-walk.
    
     - t5332-multi-pack-reuse.sh : This test verifies that the preferred
       pack is used for delta reuse when possible. The --path-walk option is
       not currently aware of the preferred pack at all, so finds a
       different delta base.
    
     - t7406-submodule-update.sh : When using the variable, the --depth
       option collides with the --path-walk feature, resulting in a warning
       message. Disable the variable so this warning does not appear.
    
    I want to call out one specific test change that is only temporary:
    
     - t5530-upload-pack-error.sh : One test cares specifically about an
       "unable to read" error message. Since the current implementation
       performs delta calculations within the path-walk API callback, a
       different "unable to get size" error message appears. When this
       is changed in a future refactoring, this test change can be reverted.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    9f6c681 View commit details
    Browse the repository at this point in the history
  9. repack: add --path-walk option

    Since 'git pack-objects' supports a --path-walk option, allow passing it
    through in 'git repack'. This presents interesting testing opportunities for
    comparing the different repacking strategies against each other.
    
    Add the --path-walk option to the performance tests in p5313.
    
    For the microsoft/fluentui repo [1] checked out at a specific commit [2],
    the results are very interesting:
    
    Test                                           this tree
    ------------------------------------------------------------------
    5313.2: thin pack                              0.40(0.47+0.04)
    5313.3: thin pack size                                    1.2M
    5313.4: thin pack with --full-name-hash        0.09(0.10+0.04)
    5313.5: thin pack size with --full-name-hash             22.8K
    5313.6: thin pack with --path-walk             0.08(0.06+0.02)
    5313.7: thin pack size with --path-walk                  20.8K
    5313.8: big pack                               2.16(8.43+0.23)
    5313.9: big pack size                                    17.7M
    5313.10: big pack with --full-name-hash        1.42(3.06+0.21)
    5313.11: big pack size with --full-name-hash             18.0M
    5313.12: big pack with --path-walk             2.21(8.39+0.24)
    5313.13: big pack size with --path-walk                  17.8M
    5313.14: repack                                98.05(662.37+2.64)
    5313.15: repack size                                    449.1K
    5313.16: repack with --full-name-hash          33.95(129.44+2.63)
    5313.17: repack size with --full-name-hash              182.9K
    5313.18: repack with --path-walk               106.21(121.58+0.82)
    5313.19: repack size with --path-walk                   159.6K
    
    [1] https://github.com/microsoft/fluentui
    [2] e70848ebac1cd720875bccaa3026f4a9ed700e08
    
    This repo suffers from having a lot of paths that collide in the name
    hash, so examining them in groups by path leads to better deltas. Also,
    in this case, the single-threaded implementation is competitive with the
    full repack. This is saving time diffing files that have significant
    differences from each other.
    
    A similar, but private, repo has even more extremes in the thin packs:
    
    Test                                           this tree
    --------------------------------------------------------------
    5313.2: thin pack                              2.39(2.91+0.10)
    5313.3: thin pack size                                    4.5M
    5313.4: thin pack with --full-name-hash        0.29(0.47+0.12)
    5313.5: thin pack size with --full-name-hash             15.5K
    5313.6: thin pack with --path-walk             0.35(0.31+0.04)
    5313.7: thin pack size with --path-walk                  14.2K
    
    Notice, however, that while the --full-name-hash version is working
    quite well in these cases for the thin pack, it does poorly for some
    other standard cases, such as this test on the Linux kernel repository:
    
    Test                                           this tree
    --------------------------------------------------------------
    5313.2: thin pack                              0.01(0.00+0.00)
    5313.3: thin pack size                                     310
    5313.4: thin pack with --full-name-hash        0.00(0.00+0.00)
    5313.5: thin pack size with --full-name-hash              1.4K
    5313.6: thin pack with --path-walk             0.00(0.00+0.00)
    5313.7: thin pack size with --path-walk                    310
    
    Here, the --full-name-hash option does much worse than the default name
    hash, but the path-walk option does exactly as well.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    d1d4ff9 View commit details
    Browse the repository at this point in the history
  10. pack-objects: enable --path-walk via config

    Users may want to enable the --path-walk option for 'git pack-objects' by
    default, especially underneath commands like 'git push' or 'git repack'.
    
    This should be limited to client repositories, since the --path-walk option
    disables bitmap walks, so would be bad to include in Git servers when
    serving fetches and clones. There is potential that it may be helpful to
    consider when repacking the repository, to take advantage of improved deltas
    across historical versions of the same files.
    
    Much like how "pack.useSparse" was introduced and included in
    "feature.experimental" before being enabled by default, use the repository
    settings infrastructure to make the new "pack.usePathWalk" config enabled by
    "feature.experimental" and "feature.manyFiles".
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    8beae74 View commit details
    Browse the repository at this point in the history
  11. scalar: enable path-walk during push via config

    Repositories registered with Scalar are expected to be client-only
    repositories that are rather large. This means that they are more likely to
    be good candidates for using the --path-walk option when running 'git
    pack-objects', especially under the hood of 'git push'. Enable this config
    in Scalar repositories.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    98d3f3f View commit details
    Browse the repository at this point in the history
  12. pack-objects: refactor path-walk delta phase

    Previously, the --path-walk option to 'git pack-objects' would compute
    deltas inline with the path-walk logic. This would make the progress
    indicator look like it is taking a long time to enumerate objects, and
    then very quickly computed deltas.
    
    Instead of computing deltas on each region of objects organized by tree,
    store a list of regions corresponding to these groups. These can later
    be pulled from the list for delta compression before doing the "global"
    delta search.
    
    This presents a new progress indicator that can be used in tests to
    verify that this stage is happening.
    
    The current implementation is not integrated with threads, but could be
    done in a future update.
    
    Since we do not attempt to sort objects by size until after exploring
    all trees, we can remove the previous change to t5530 due to a different
    error message appearing first.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    25cacdc View commit details
    Browse the repository at this point in the history
  13. pack-objects: thread the path-based compression

    Adapting the implementation of ll_find_deltas(), create a threaded
    version of the --path-walk compression step in 'git pack-objects'.
    
    This involves adding a 'regions' member to the thread_params struct,
    allowing each thread to own a section of paths. We can simplify the way
    jobs are split because there is no value in extending the batch based on
    name-hash the way sections of the object entry array are attempted to be
    grouped. We re-use the 'list_size' and 'remaining' items for the purpose
    of borrowing work in progress from other "victim" threads when a thread
    has finished its batch of work more quickly.
    
    Using the Git repository as a test repo, the p5313 performance test
    shows that the resulting size of the repo is the same, but the threaded
    implementation gives gains of varying degrees depending on the number of
    objects being packed. (This was tested on a 16-core machine.)
    
    Test                                    HEAD~1    HEAD
    -------------------------------------------------------------
    5313.6: thin pack with --path-walk        0.01    0.01  +0.0%
    5313.7: thin pack size with --path-walk    475     475  +0.0%
    5313.12: big pack with --path-walk        1.99    1.87  -6.0%
    5313.13: big pack size with --path-walk  14.4M   14.3M  -0.4%
    5313.18: repack with --path-walk         98.14   41.46 -57.8%
    5313.19: repack size with --path-walk   197.2M  197.3M  +0.0%
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    4059aca View commit details
    Browse the repository at this point in the history