-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: stop reading latest manifest #1365
Conversation
rust/lance/src/io/commit.rs
Outdated
// That avoids a HEAD request later. | ||
let mut manifest_files = object_store | ||
.inner | ||
.list(Some(&base.child(VERSIONS_DIR))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might take a long time if the dataset has a lot of versions.
Maybe we could try something like this?
- read
_latest.manifest
- use the version in
_latest.manifest
as the start - if we hit a version like
xxxx9999
, list from10000000
and keep going until we find the latest manifest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, i realized reading _latest.manifest
is a no starter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we can't list from because we aren't zero-padding.
But also see my comment about performance in description.
BREAKING CHANGE: this makes tables written with this and future versions unreadable by Lance v0.8.3 and earlier. We stopped reading this way back in October of last year (#1365). Continuing to write this can actually cause us to hit per-object mutation rate limits in GCS, so we have some reason to stop doing this now.
As a quick fix to address safety/consistency issues, we will find the latest manifest by scanning the
_versions
directory.We continue to write the
_latest.manifest
file, so older readers are still compatible.Performance considerations: in object stores like S3, we can get 1,000 objects listed with 1 request. So right now this adds 1 IO for those tables. Tables with more versions should probably have
cleanup_old_versions
applied.Fixes: #1356
We will optimize this in follow ups, as described in #1362