Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokens Chunking to respect Language Word Boundaries #866

Open
loretoparisi opened this issue May 17, 2023 · 0 comments
Open

Tokens Chunking to respect Language Word Boundaries #866

loretoparisi opened this issue May 17, 2023 · 0 comments

Comments

@loretoparisi
Copy link

loretoparisi commented May 17, 2023

Hello, for my LLM input I need to split input prompt tokens into chunks, where each chunk represent a context window (or session):

let tokensToChunks =  function (arr, chunks) {
      return arr.map(function (e, i) {
        return i % chunks === 0 ? arr.slice(i, i + chunks) : null;
      })
        .filter(function (e) { return e; });
    };
// inputTokens is Uint32Array
let sessions = tokensToChunks(Array.from(inputTokens), max_tokens)

in order to respect the max_tokens at each inference session. This works ok in most cases, but occasionally subwords may fall into different chunks e.g.

let codes = encode("Yorkshire")
//  [ 100077, 15255 ] 
session1 = session1.append( codes[0] )
session2 = session2.append(  codes[1])

thus causing the context window to have improper or missing semantic meaning.
Assumed we work at token level (hence no words, but codes), any good practice to handle in-context chunking to avoid this semantic displacement between different contexts (chunks)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant