Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Asset Paths #1312

Merged
merged 28 commits into from
Oct 19, 2022
Merged

Optimize Asset Paths #1312

merged 28 commits into from
Oct 19, 2022

Conversation

jjnesbitt
Copy link
Member

@jjnesbitt jjnesbitt commented Oct 4, 2022

Closes #652

This is a performance optimization to how we handle asset paths. This implementation greatly improves read times for the asset paths endpoint, as simplifies the UI FileBrowser component.

Some notes:

  • This PR is messier than I'd like due to us not having a proper asset service layer yet. I expect that will come soon, and that when it does, the asset paths service layer will become entirely internal to that.
  • Since this PR adds restrictions to the Asset path field, any offending assets in production will need to be dealt with (likely manually) before this is merged/deployed.
  • After deploy, the ingest_asset_paths management command will need to be run, which means there will be a small period of time where the file browser is not functional.

Copy link
Member

@mvandenburgh mvandenburgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

How will we migrate the existing data to this new model? Would someone run the ingest_asset_paths management command manually?

dandiapi/api/models/asset_paths.py Outdated Show resolved Hide resolved
dandiapi/api/models/asset_paths.py Outdated Show resolved Hide resolved
dandiapi/api/views/asset.py Outdated Show resolved Hide resolved
Comment on lines +160 to +164
models.CheckConstraint(name='asset_path_regex', check=Q(path__regex=ASSET_PATH_REGEX)),
models.CheckConstraint(
name='asset_path_no_leading_slash', check=~Q(path__startswith='/')
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done as two check constraints as it was difficult to create a working/efficient regex that covered both.

@jjnesbitt
Copy link
Member Author

How will we migrate the existing data to this new model? Would someone run the ingest_asset_paths management command manually?

Yes that's correct.

@mvandenburgh
Copy link
Member

Now that #1006 is merged, the migrations here will need to be recreated.

@jjnesbitt jjnesbitt force-pushed the asset-paths-design branch 3 times, most recently from c55e06b to d7bd079 Compare October 11, 2022 19:33
@yarikoptic
Copy link
Member

FWIW, as I was one among those who "questioned" the efficiency of a "flat list of paths" representation when it was introduced in original DANDI design, and performance improvements are always welcome but I think the cons of such approach should also be considered. It

  • raises the implementation complexity. More models: AssetPath, AssetPathRelation; necessity to make sure to call into the AssetPath datastructure operations for anything which might be changing path of any asset.
  • adds information duplication (having "path" field and then AssetPath* structures to represent any particular path in the tree)
    • would need adding of some fsck across the models to ensure consistency. Is this PR comes with some kind of such check function/end point which could trigger such checking? How long would it take to fsck the entire archive ATM
  • adds performance penalty for some operations, e.g. renaming/moving paths into another folder might require O(depth) operations (adjust all parents in the tree for size, [ref])https://github.com/dandi/dandi-archive/pull/1312/files#diff-8b69af35cb401375428f41693ecb95e9f791080123167e6faef32023e0928b4eR52)) and thus possibly started to ask for locking to avoid race conditions etc. Do we have guestimate of the impact from such RF on dandi move in some typical case?

An alternative, and generic, could be - caching of views based on the last-modified timestamp of the dandiset version (I thought I filed an issue but I failed to find it), on which we rely anyways already in backups and have check introduced which would alert us if there are changes without timestamp boost.

@waxlamp
Copy link
Member

waxlamp commented Oct 17, 2022

  • raises the implementation complexity. More models: AssetPath, AssetPathRelation; necessity to make sure to call into the AssetPath datastructure operations for anything which might be changing path of any asset.

It does indeed cost some complexity to gain better correctness and performance for the file browsing functionality. However, this is a case where we actually failed to model properly this corner of the system. In that sense, the new models are required to capture the domain logic here.

I'm not sure what you mean exactly by the need to call into AssetPath, but we can take steps as needed to make this part a bit easier. For example, if we identify which things have a need to "change path of any asset" (i.e., dandi move?) we can expose high-level API to enable those things without having to recompose complex operations in each place it's needed.

  • adds information duplication (having "path" field and then AssetPath* structures to represent any particular path in the tree)

    • would need adding of some fsck across the models to ensure consistency. Is this PR comes with some kind of such check function/end point which could trigger such checking? How long would it take to fsck the entire archive ATM

We are indeed denormalizing the path data in the name of maintaining performance. However, we are not particularly worried about loss of consistency, since the system doesn't modify the existing path data (assets must be copied first in order to change their paths, or else they need to be deleted and re-added, etc.). If we add an "fsck" type of script, it would be a maintenance script we could run to help us debug in case we see something going wrong, rather than something that would be exposed via the API.

The design is intended to accelerate the common case of reading path information, at a slight expense for deletion and addition operations. But the complexity is not O(depth), but rather O(1). The necessary updates to the database happen in about two or three bulk queries. I expect the impact on something like dandi move should be negligible.

Furthermore, we should talk about exactly what kind of thing dandi move enables. If there is a safer, faster way to support that operation, we should look into it.

An alternative, and generic, could be - caching of views based on the last-modified timestamp of the dandiset version (I thought I filed an issue but I failed to find it), on which we rely anyways already in backups and have check introduced which would alert us if there are changes without timestamp boost.

This approach carries a lot of risk (as you know, cache invalidation is one of the two hard things in computer science). It is hard to guess at the performance boost this approach would give us, measured against the risk of incorrect updates, etc. On top of that, we are able to solve other problems with our current approach with an actual model of file paths (efficient updates of file sizes and file counts; pagination).

Jake's PR brings much more good than harm, and I think we should (after final reviews) deploy it to staging, play around with it, and gain some more confidence that it's working.

@jwodder
Copy link
Member

jwodder commented Oct 19, 2022

@AlmightyYakob Does this resolve #1109?

@jjnesbitt
Copy link
Member Author

@AlmightyYakob Does this resolve #1109?

No, but it would be trivial to address with these new models. Since there's been so much churn on this PR, I'd like to avoid fixing it here and do a follow up afterwards.

@jjnesbitt jjnesbitt merged commit 1ab45a0 into master Oct 19, 2022
@jjnesbitt jjnesbitt deleted the asset-paths-design branch October 19, 2022 17:43
@dandibot
Copy link
Member

dandibot commented Nov 7, 2022

🚀 PR was released in v0.3.0 🚀

@dandibot dandibot added the released This issue/pull request has been released. label Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

directory depth error
7 participants