Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] JS Client Refactor Part 1 - Refactoring the Client #2384

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

AlabasterAxe
Copy link
Contributor

@AlabasterAxe AlabasterAxe commented Jun 19, 2024

Description of changes

The changes in this PR are described in detail in the "Phase 1" section of this doc: https://docs.google.com/document/d/1DH9Gl5UQeuDr6vyGgoYUw0TkBxITgXIutZSU-ZYdP_E/edit#heading=h.wnaf3x8gqzzl

The TL;DR is that this PR consists of the following changes:

  • turns Collection into a data object (with the exception of the embedding function) and flattens its methods into the top-level ChromaClient
  • Improves the typing of the methods on the client so that we do a better job of handling single vs multiple cases.
  • validates tenant and database on client startup
  • removes some unnecessary code in the client that was just used to enable certain test assertions

This PR corresponds with this issue: #2333

This is the first PR in a series that comprise the whole client refactor:

Test plan

  • all unit tests are passing

Documentation Changes

  • will need to update docs as this is a breaking change with the previous API

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

nResults?: PositiveInteger;
where?: Where;
queryTexts?: string | string[];
query: DocQuery;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having queryTexts and queryEmbeddings separately, I'm thinking we can just have this single type DocQuery which can be a string, which will be converted to an embedding, according to the configured embedding function in the collection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily opposed to this, but wondering how much we want to try to keep Python/JS interfaces roughly in sync

@atroyn wdyt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this PR, we want to keep them in sync. This might be a bit janky for now but I would prefer that we review the interface as a whole between both clients rather than changing it piecemeal in each.

Can we create a new issue to discuss this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@AlabasterAxe AlabasterAxe Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atroyn does that apply to the pluralization logic as well? I've added convenience overrides for singular document operations e.g:

const resp = getRecords(collection, {
  id: "blah"
});

console.log(resp); // prints "{id: "blah", embedding: [1, 2, 3], document: "whatever" }"

clients/js/src/types.ts Outdated Show resolved Hide resolved
@AlabasterAxe AlabasterAxe force-pushed the matt/client-refactor branch 3 times, most recently from be9c162 to 42feaa1 Compare June 28, 2024 14:34
@AlabasterAxe AlabasterAxe changed the title WIP Client Refactor [ENH] JavaScript Client Refactor Jun 28, 2024
@AlabasterAxe AlabasterAxe marked this pull request as ready for review June 28, 2024 23:04
@AlabasterAxe AlabasterAxe requested a review from atroyn June 28, 2024 23:04
clients/js/src/AdminClient.ts Show resolved Hide resolved
clients/js/src/AdminClient.ts Show resolved Hide resolved
clients/js/src/ChromaClient.ts Outdated Show resolved Hide resolved
@@ -78,10 +104,12 @@ export class ChromaClient {
database: database,
});

// TODO: Validate tenant and database on client creation
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I've approached this in similar situations is to have a shared init promise that all methods await at the beginning. Since all of the methods on this class are async anyway, we just asynchronously validate that it's been initted successfully when the user calls any of the methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, wondering if it's worth wrapping the class with a Proxy so we don't have to manually add it to each public method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is a good callout. I think there's a high likelihood that this will get dropped in the future. One thing that the current approach allows is to put certain validations before the await initPromise; e.g.

if (typeof params.ids === "string") {
  throw new Error(
    "To add a single record, use the singular field names (id, embedding, metadata, document) instead of the plural field names (ids, embeddings, metadatas, documents)",
  );
}
await this.initPromise;

Also, ideally, this would only apply to async functions. Do you know if there is a way to detect that in a proxy and only wrap the async functions? Obviously we only have async functions today but I would worry about accidentally making a synchronous function async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah not opposed to this. My gut reaction would be to hold off on this, but happy to do this in this PR if you think that's the better trade-off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave as-is for now, easy to adjust later

clients/js/test/client.test.ts Outdated Show resolved Hide resolved
clients/js/src/types.ts Outdated Show resolved Hide resolved
clients/js/src/utils.ts Outdated Show resolved Hide resolved
@atroyn
Copy link
Contributor

atroyn commented Jul 7, 2024

Adding @codetheweb for review; will cover this this week!

@codetheweb codetheweb linked an issue Jul 8, 2024 that may be closed by this pull request
clients/js/examples/node/yarn.lock Outdated Show resolved Hide resolved
nResults?: PositiveInteger;
where?: Where;
queryTexts?: string | string[];
query: DocQuery;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily opposed to this, but wondering how much we want to try to keep Python/JS interfaces roughly in sync

@atroyn wdyt

clients/js/src/ChromaClient.ts Outdated Show resolved Hide resolved
@@ -78,10 +104,12 @@ export class ChromaClient {
database: database,
});

// TODO: Validate tenant and database on client creation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, wondering if it's worth wrapping the class with a Proxy so we don't have to manually add it to each public method?

* ```
*/
async addRecords(
collection: Collection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most (all?) methods that deal with documents/records only need the collection's ID, so maybe we could update the typing to reflect that?

Suggested change
collection: Collection,
collection: Pick<Collection, "id">,

then people can avoid an API call to fetch full collection details if they really want to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the one thing that we do regularly need is the embedding function so I think it'll beneficial for us to keep Collection as a kind of opaque object that they have to get from the API. My understanding is that there's work underway for the backend to store some representation of embedding function alongside the collection which I think will make this opaque Collection object make more sense.

@atroyn in case I've misunderstood anything here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally, as a JS dev, I would assume any object returned from an idiomatic API client library is trivially serializable/de-serializable (JSON.[stringify|parse])
as long as we can hold that assumption that sounds ok to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, if every single method only needs the collection ID, I do feel it would be more idiomatic to have methods accept async addRecords(collection_id: string...) instead (and store any necessary collection metadata on the client object)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will be able to hold that assumption with EF persist in the works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlabasterAxe what do you think about using the collection ID for these methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so if I'm understanding what you proposed above, you're saying that the client would maintain a mapping from collection ids to embedding functions and then will just look that up whenever the user supplies a collection id?

One potential issue that could arise is that a user could attempt to call the addRecords method even if the embedding function had never been configured for a particular collection.

So hypothetically we could have the following situation:

/* Run 1 */
const collection = client.createCollection({name: "test"});
console.log(collection.id) // prints 1234

/* Run 2 */
client.addRecords("1234", {/* record contents */}); // How do we know which embedding function to use 
                                                    // here if the client has not gotten the collection in this
                                                    // run of the script?

I'll defer to @atroyn here: Should we continue accepting collection objects in the record methods? or should we move to accepting collection ids in the API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with @atroyn and we settled on keeping the methods accepting collection objects for now

clients/js/src/index.ts Outdated Show resolved Hide resolved
clients/js/src/index.ts Outdated Show resolved Hide resolved
docs/docs.trychroma.com/pages/guides/index.md Outdated Show resolved Hide resolved
docs/docs.trychroma.com/pages/guides/index.md Outdated Show resolved Hide resolved
* ```
*/
async addRecords(
collection: Collection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally, as a JS dev, I would assume any object returned from an idiomatic API client library is trivially serializable/de-serializable (JSON.[stringify|parse])
as long as we can hold that assumption that sounds ok to me

@atroyn
Copy link
Contributor

atroyn commented Jul 9, 2024

Initial review pass looks good, @codetheweb is reviewing docs currently.

@AlabasterAxe could we please split this up into a stack of PRs for easier in-depth review of each part?

Copy link
Contributor

@codetheweb codetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs look pretty good to me 👍

@AlabasterAxe
Copy link
Contributor Author

Initial review pass looks good, @codetheweb is reviewing docs currently.

@AlabasterAxe could we please split this up into a stack of PRs for easier in-depth review of each part?

@atroyn I don't have access to Chroma's Graphite instance so I did this manually. Let me know if that works for you. Also this PR is in kind of a weird state. I don't have permission to push branches to the chroma repo directly so I started it from my personal chroma fork. That means the subsequent PRs are not considered as being against the chroma repo yet but I think that should be okay because once you all give the go ahead I can merge them in quick succession.

Also, it sounds like it may be good to sync about all of the behavior changes this PR includes because it sounds like I may have diverged a bit too much from the Python client. Let me know if you'd like to sync quickly and then I can make the final changes to this PR so it's ready for the final round of reviews.

@AlabasterAxe
Copy link
Contributor Author

AlabasterAxe commented Jul 10, 2024

@atroyn

I've recorded my thoughts about the various suggested changes to client behavior in these isues:
#2483 (the queryTexts vs query issue from above)
#2492
#2493
#2494

As discussed yesterday these are just for posterity/a jumping off point for larger conversations about client behavior as opposed to proposals that I'm advocating we implement in the near term. I still need to revert the behavior changes as discussed from yesterday though. Will ping on this PR when those changes are in.

@AlabasterAxe
Copy link
Contributor Author

@atroyn Okay, I've made the requested changes so these PRs should be ready for another look. Specifically:

  • revert the merging of queryTexts and queryEmbedding into a single query param
  • revert the singular field naming
  • revert the parameter-dependent return type (i.e. if user supplies a singular type, unwrap resulting array)

Also, I introduced a some style regressions along the way so I've just added a final formatting PR at the end of the stack.

clients/js/src/utils.ts Outdated Show resolved Hide resolved
clients/js/src/utils.ts Outdated Show resolved Hide resolved
clients/js/src/utils.ts Outdated Show resolved Hide resolved
clients/js/src/embeddings/OpenAIEmbeddingFunction.ts Outdated Show resolved Hide resolved
* ```
*/
async addRecords(
collection: Collection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlabasterAxe what do you think about using the collection ID for these methods?

Copy link
Contributor

@codetheweb codetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validates tenant and database on client startup

let's extract this out into a separate PR, we can merge it before this entire stack is approved since it doesn't change any interfaces

also, a quick PR to export types in package.jsoncould be merged before this stack is approved as well

otherwise this first PR is looking good! (outside of failing checks)

collection: Collection,
params: MultiAddRecordsOperationParams,
): Promise<AddResponse>;
async addRecords(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this signature (and others like it below) need to be overloaded since the response shape doesn't change?

@AlabasterAxe
Copy link
Contributor Author

@codetheweb validation PR here: #2537

This PR is rebased on top of that PR but I'm still using the chroma repo main as its base for this PR.

@AlabasterAxe
Copy link
Contributor Author

@codetheweb

I've propagated all of the PR feedback through the stack and put together the "Full stack" PR here: #2542

All PR checks are passing there so I'm thinking that as long as you approve of each of the PRs in the stack, we should be good to merge the Full Stack PR (even though e.g. this PR is failing checks) WDYT?

collection: Collection,
params: AddRecordsParams,
): Promise<AddResponse> {
if (typeof params.ids === "string") {
Copy link
Contributor Author

@AlabasterAxe AlabasterAxe Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This check has been removed in a subsequent stack (here and below). Not changing here to avoid further PR churn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Update][Ease of Use] TypeScript Client Update
3 participants