-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] JS Client Refactor Part 1 - Refactoring the Client #2384
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
clients/js/src/types.ts
Outdated
nResults?: PositiveInteger; | ||
where?: Where; | ||
queryTexts?: string | string[]; | ||
query: DocQuery; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having queryTexts and queryEmbeddings separately, I'm thinking we can just have this single type DocQuery
which can be a string, which will be converted to an embedding, according to the configured embedding function in the collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily opposed to this, but wondering how much we want to try to keep Python/JS interfaces roughly in sync
@atroyn wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this PR, we want to keep them in sync. This might be a bit janky for now but I would prefer that we review the interface as a whole between both clients rather than changing it piecemeal in each.
Can we create a new issue to discuss this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atroyn does that apply to the pluralization logic as well? I've added convenience overrides for singular document operations e.g:
const resp = getRecords(collection, {
id: "blah"
});
console.log(resp); // prints "{id: "blah", embedding: [1, 2, 3], document: "whatever" }"
be9c162
to
42feaa1
Compare
clients/js/src/ChromaClient.ts
Outdated
@@ -78,10 +104,12 @@ export class ChromaClient { | |||
database: database, | |||
}); | |||
|
|||
// TODO: Validate tenant and database on client creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I've approached this in similar situations is to have a shared init promise that all methods await at the beginning. Since all of the methods on this class are async anyway, we just asynchronously validate that it's been initted successfully when the user calls any of the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, wondering if it's worth wrapping the class with a Proxy so we don't have to manually add it to each public method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is a good callout. I think there's a high likelihood that this will get dropped in the future. One thing that the current approach allows is to put certain validations before the await initPromise;
e.g.
if (typeof params.ids === "string") {
throw new Error(
"To add a single record, use the singular field names (id, embedding, metadata, document) instead of the plural field names (ids, embeddings, metadatas, documents)",
);
}
await this.initPromise;
Also, ideally, this would only apply to async functions. Do you know if there is a way to detect that in a proxy and only wrap the async functions? Obviously we only have async functions today but I would worry about accidentally making a synchronous function async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah not opposed to this. My gut reaction would be to hold off on this, but happy to do this in this PR if you think that's the better trade-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave as-is for now, easy to adjust later
29bd707
to
32596a4
Compare
Adding @codetheweb for review; will cover this this week! |
clients/js/src/types.ts
Outdated
nResults?: PositiveInteger; | ||
where?: Where; | ||
queryTexts?: string | string[]; | ||
query: DocQuery; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessarily opposed to this, but wondering how much we want to try to keep Python/JS interfaces roughly in sync
@atroyn wdyt
clients/js/src/ChromaClient.ts
Outdated
@@ -78,10 +104,12 @@ export class ChromaClient { | |||
database: database, | |||
}); | |||
|
|||
// TODO: Validate tenant and database on client creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, wondering if it's worth wrapping the class with a Proxy so we don't have to manually add it to each public method?
* ``` | ||
*/ | ||
async addRecords( | ||
collection: Collection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most (all?) methods that deal with documents/records only need the collection's ID, so maybe we could update the typing to reflect that?
collection: Collection, | |
collection: Pick<Collection, "id">, |
then people can avoid an API call to fetch full collection details if they really want to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the one thing that we do regularly need is the embedding function so I think it'll beneficial for us to keep Collection as a kind of opaque object that they have to get from the API. My understanding is that there's work underway for the backend to store some representation of embedding function alongside the collection which I think will make this opaque Collection object make more sense.
@atroyn in case I've misunderstood anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally, as a JS dev, I would assume any object returned from an idiomatic API client library is trivially serializable/de-serializable (JSON.[stringify|parse]
)
as long as we can hold that assumption that sounds ok to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, if every single method only needs the collection ID, I do feel it would be more idiomatic to have methods accept async addRecords(collection_id: string...)
instead (and store any necessary collection metadata on the client object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will be able to hold that assumption with EF persist in the works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlabasterAxe what do you think about using the collection ID for these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so if I'm understanding what you proposed above, you're saying that the client would maintain a mapping from collection ids to embedding functions and then will just look that up whenever the user supplies a collection id?
One potential issue that could arise is that a user could attempt to call the addRecords
method even if the embedding function had never been configured for a particular collection.
So hypothetically we could have the following situation:
/* Run 1 */
const collection = client.createCollection({name: "test"});
console.log(collection.id) // prints 1234
/* Run 2 */
client.addRecords("1234", {/* record contents */}); // How do we know which embedding function to use
// here if the client has not gotten the collection in this
// run of the script?
I'll defer to @atroyn here: Should we continue accepting collection objects in the record methods? or should we move to accepting collection ids in the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed with @atroyn and we settled on keeping the methods accepting collection objects for now
* ``` | ||
*/ | ||
async addRecords( | ||
collection: Collection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally, as a JS dev, I would assume any object returned from an idiomatic API client library is trivially serializable/de-serializable (JSON.[stringify|parse]
)
as long as we can hold that assumption that sounds ok to me
Initial review pass looks good, @codetheweb is reviewing docs currently. @AlabasterAxe could we please split this up into a stack of PRs for easier in-depth review of each part? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs look pretty good to me 👍
d46e547
to
00eb6be
Compare
@atroyn I don't have access to Chroma's Graphite instance so I did this manually. Let me know if that works for you. Also this PR is in kind of a weird state. I don't have permission to push branches to the chroma repo directly so I started it from my personal chroma fork. That means the subsequent PRs are not considered as being against the chroma repo yet but I think that should be okay because once you all give the go ahead I can merge them in quick succession. Also, it sounds like it may be good to sync about all of the behavior changes this PR includes because it sounds like I may have diverged a bit too much from the Python client. Let me know if you'd like to sync quickly and then I can make the final changes to this PR so it's ready for the final round of reviews. |
I've recorded my thoughts about the various suggested changes to client behavior in these isues: As discussed yesterday these are just for posterity/a jumping off point for larger conversations about client behavior as opposed to proposals that I'm advocating we implement in the near term. I still need to revert the behavior changes as discussed from yesterday though. Will ping on this PR when those changes are in. |
155aaf2
to
b3aac46
Compare
@atroyn Okay, I've made the requested changes so these PRs should be ready for another look. Specifically:
Also, I introduced a some style regressions along the way so I've just added a final formatting PR at the end of the stack. |
* ``` | ||
*/ | ||
async addRecords( | ||
collection: Collection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlabasterAxe what do you think about using the collection ID for these methods?
33da896
to
f3d0df1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validates tenant and database on client startup
let's extract this out into a separate PR, we can merge it before this entire stack is approved since it doesn't change any interfaces
also, a quick PR to export types in package.json
could be merged before this stack is approved as well
otherwise this first PR is looking good! (outside of failing checks)
clients/js/src/ChromaClient.ts
Outdated
collection: Collection, | ||
params: MultiAddRecordsOperationParams, | ||
): Promise<AddResponse>; | ||
async addRecords( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this signature (and others like it below) need to be overloaded since the response shape doesn't change?
05cc9f3
to
257267b
Compare
ff2e326
to
cc2c834
Compare
cc2c834
to
8e93147
Compare
@codetheweb validation PR here: #2537 This PR is rebased on top of that PR but I'm still using the chroma repo main as its base for this PR. |
8e93147
to
35b1372
Compare
ac9cb1c
to
cc56f82
Compare
498f591
to
dc03d67
Compare
I've propagated all of the PR feedback through the stack and put together the "Full stack" PR here: #2542 All PR checks are passing there so I'm thinking that as long as you approve of each of the PRs in the stack, we should be good to merge the Full Stack PR (even though e.g. this PR is failing checks) WDYT? |
collection: Collection, | ||
params: AddRecordsParams, | ||
): Promise<AddResponse> { | ||
if (typeof params.ids === "string") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: This check has been removed in a subsequent stack (here and below). Not changing here to avoid further PR churn.
45d6f37
to
f858eff
Compare
Description of changes
The changes in this PR are described in detail in the "Phase 1" section of this doc: https://docs.google.com/document/d/1DH9Gl5UQeuDr6vyGgoYUw0TkBxITgXIutZSU-ZYdP_E/edit#heading=h.wnaf3x8gqzzl
The TL;DR is that this PR consists of the following changes:
This PR corresponds with this issue: #2333
This is the first PR in a series that comprise the whole client refactor:
Test plan
Documentation Changes