Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stop reading latest manifest #1365

Merged
merged 3 commits into from
Oct 5, 2023
Merged

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Oct 5, 2023

As a quick fix to address safety/consistency issues, we will find the latest manifest by scanning the _versions directory.

We continue to write the _latest.manifest file, so older readers are still compatible.

Performance considerations: in object stores like S3, we can get 1,000 objects listed with 1 request. So right now this adds 1 IO for those tables. Tables with more versions should probably have cleanup_old_versions applied.

Fixes: #1356

We will optimize this in follow ups, as described in #1362

@wjones127 wjones127 marked this pull request as ready for review October 5, 2023 19:27
// That avoids a HEAD request later.
let mut manifest_files = object_store
.inner
.list(Some(&base.child(VERSIONS_DIR)))
Copy link
Contributor

@chebbyChefNEQ chebbyChefNEQ Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might take a long time if the dataset has a lot of versions.

Maybe we could try something like this?

  1. read _latest.manifest
  2. use the version in _latest.manifest as the start
  3. if we hit a version like xxxx9999, list from 10000000 and keep going until we find the latest manifest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, i realized reading _latest.manifest is a no starter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can't list from because we aren't zero-padding.

But also see my comment about performance in description.

@wjones127 wjones127 merged commit 299aab1 into main Oct 5, 2023
15 checks passed
@wjones127 wjones127 deleted the wjones127/no-latest-manifest branch October 5, 2023 21:33
wjones127 added a commit that referenced this pull request Aug 26, 2024
BREAKING CHANGE: this makes tables written with this and future versions
unreadable by Lance v0.8.3 and earlier.

We stopped reading this way back in October of last year (#1365).
Continuing to write this can actually cause us to hit per-object
mutation rate limits in GCS, so we have some reason to stop doing this
now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assertion failed with many concurrent writers on GCS
2 participants