Tokens Chunking to respect Language Word Boundaries #866

loretoparisi · 2023-05-17T16:20:27Z

Hello, for my LLM input I need to split input prompt tokens into chunks, where each chunk represent a context window (or session):

let tokensToChunks =  function (arr, chunks) {
      return arr.map(function (e, i) {
        return i % chunks === 0 ? arr.slice(i, i + chunks) : null;
      })
        .filter(function (e) { return e; });
    };
// inputTokens is Uint32Array
let sessions = tokensToChunks(Array.from(inputTokens), max_tokens)

in order to respect the max_tokens at each inference session. This works ok in most cases, but occasionally subwords may fall into different chunks e.g.

let codes = encode("Yorkshire")
//  [ 100077, 15255 ] 
session1 = session1.append( codes[0] )
session2 = session2.append(  codes[1])

thus causing the context window to have improper or missing semantic meaning.
Assumed we work at token level (hence no words, but codes), any good practice to handle in-context chunking to avoid this semantic displacement between different contexts (chunks)?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokens Chunking to respect Language Word Boundaries #866

Tokens Chunking to respect Language Word Boundaries #866

loretoparisi commented May 17, 2023 •

edited

Loading

Tokens Chunking to respect Language Word Boundaries #866

Tokens Chunking to respect Language Word Boundaries #866

Comments

loretoparisi commented May 17, 2023 • edited Loading

loretoparisi commented May 17, 2023 •

edited

Loading