Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add layer for darts lookup table? #1225

Open
jeongukjae opened this issue Oct 23, 2023 · 2 comments
Open

Feature request: add layer for darts lookup table? #1225

jeongukjae opened this issue Oct 23, 2023 · 2 comments

Comments

@jeongukjae
Copy link

In some cases, it can be more efficient and memory-efficient than hashtable in tensorflow.

It should be great if darts lookup table has following methods

  • build:
    • Arguments: strings tensors(key), int tensors(values), and default value
  • common prefix search:
    • Arguments: string tensors
    • Returns: ragged tensor for searched results, and results' value
  • exact match search:
    • Arguments: string tensors
    • Returns: searched value, or default value for unknown strings
@cantonios
Copy link
Collaborator

Is this a tf.text-specific request, or should it be filed against tensorflow?

Do you have a link for "darts lookup table"?

@jeongukjae
Copy link
Author

jeongukjae commented Oct 24, 2023

Ah, sorry, I didn't specify the details. It's tensorflow-text specific request.

Darts is double array trie and we can use it like lookup table. You can check the basic interface here: https://github.com/s-yata/darts-clone/blob/master/doc/en/Interface.md#dictionary-class. Additionally, tensorflow text already has a dependency of darts-clone (used in wordpiece tokenizer, darts-clone is cloned repository of darts)

text/WORKSPACE

Lines 37 to 45 in b32645f

http_archive(
name = "darts_clone",
build_file = "//third_party/darts_clone:BUILD.bzl",
sha256 = "c97f55d05c98da6fcaf7f9ecc6a6dc6bc5b18b8564465f77abff8879d446491c",
strip_prefix = "darts-clone-e40ce4627526985a7767444b6ed6893ab6ff8983",
urls = [
"https://github.com/s-yata/darts-clone/archive/e40ce4627526985a7767444b6ed6893ab6ff8983.zip",
],
)

Double array trie is performant and efficient data structure to store lots of strings and paired values, so it can be useful to train/serve with lots of vocabs. (like tens of milliions vocabs in the single model. it can be hard to use hash table because of the memory burden)

So I'm suggesting implementing the basic methods of the darts-clone's interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants