Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create standalone IVF training API in python #2553

Merged
merged 6 commits into from
Jul 3, 2024

Conversation

westonpace
Copy link
Contributor

This is the start of breaking IVF / PQ index training up into four standalone methods with check pointing in between each method.

f"Vector column {c} must be FixedSizeListArray "
f"1-dimensional FixedShapeTensorArray, got {field.type}"
)
if not pa.types.is_floating(field.type.value_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can support uint8 as well

@codecov-commenter
Copy link

codecov-commenter commented Jul 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.01%. Comparing base (8c69571) to head (f9e7a41).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2553      +/-   ##
==========================================
+ Coverage   79.95%   80.01%   +0.05%     
==========================================
  Files         207      209       +2     
  Lines       59500    59821     +321     
  Branches    59500    59821     +321     
==========================================
+ Hits        47576    47865     +289     
- Misses       9122     9131       +9     
- Partials     2802     2825      +23     
Flag Coverage Δ
unittests 80.01% <ø> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here :)

self.distance_type = distance_type
"""The distance type used to train the IVF model"""

def num_partitions(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be a property:

Suggested change
def num_partitions(self) -> int:
@property
def num_partitions(self) -> int:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a property


K-means is an iterative algorithm that can be computationally expensive. The
accelerator argument can be used to offload the computation to a hardware
accelerator such as a GPU or TPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a progress bar for this? Should we mention the LANCE_LOG variable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a tqdm progress bar if an accelerator is used. I started to add one if an accelerator is not used but it ended up being a fair amount of code. Will open a different PR.

from .dependencies import torch


class IvfModel:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for now, but it would be very cool to be able to get this from an existing index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@westonpace
Copy link
Contributor Author

I'm going to merge this on green. I have a few PRs built on top of this one already.

@westonpace westonpace merged commit bc4d863 into lancedb:main Jul 3, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants