Skip to content

Commit

Permalink
feat: add json streaming to JSONReader (#1119)
Browse files Browse the repository at this point in the history
  • Loading branch information
KindOfAScam authored Sep 6, 2024
1 parent 0148354 commit ae1149f
Show file tree
Hide file tree
Showing 7 changed files with 271 additions and 164 deletions.
5 changes: 5 additions & 0 deletions .changeset/blue-bears-invite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"llamaindex": patch
---

feat: add JSON streaming to JSONReader
7 changes: 6 additions & 1 deletion apps/docs/docs/modules/data_loaders/json.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

A simple JSON data loader with various options.
Either parses the entire string, cleaning it and treat each line as an embedding or performs a recursive depth-first traversal yielding JSON paths.
Supports streaming of large JSON data using [@discoveryjs/json-ext](https://github.com/discoveryjs/json-ext)

## Usage

Expand All @@ -20,12 +21,16 @@ const docsFromContent = reader.loadDataAsContent(content);

Basic:

- `streamingThreshold?`: The threshold for using streaming mode in MB of the JSON Data. CEstimates characters by calculating bytes: `(streamingThreshold * 1024 * 1024) / 2` and comparing against `.length` of the JSON string. Set `undefined` to disable streaming or `0` to always use streaming. Default is `50` MB.

- `ensureAscii?`: Wether to ensure only ASCII characters be present in the output by converting non-ASCII characters to their unicode escape sequence. Default is `false`.

- `isJsonLines?`: Wether the JSON is in JSON Lines format. If true, will split into lines, remove empty one and parse each line as JSON. Default is `false`
- `isJsonLines?`: Wether the JSON is in JSON Lines format. If true, will split into lines, remove empty one and parse each line as JSON. Note: Uses a custom streaming parser, most likely less robust than json-ext. Default is `false`

- `cleanJson?`: Whether to clean the JSON by filtering out structural characters (`{}, [], and ,`). If set to false, it will just parse the JSON, not removing structural characters. Default is `true`.

- `logger?`: A placeholder for a custom logger function.

Depth-First-Traversal:

- `levelsBack?`: Specifies how many levels up the JSON structure to include in the output. `cleanJson` will be ignored. If set to 0, all levels are included. If undefined, parses the entire JSON, treat each line as an embedding and create a document per top-level array. Default is `undefined`
Expand Down
3 changes: 2 additions & 1 deletion examples/readers/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@
"start:assemblyai": "node --import tsx ./src/assemblyai.ts",
"start:llamaparse-dir": "node --import tsx ./src/simple-directory-reader-with-llamaparse.ts",
"start:llamaparse-json": "node --import tsx ./src/llamaparse-json.ts",
"start:discord": "node --import tsx ./src/discord.ts"
"start:discord": "node --import tsx ./src/discord.ts",
"start:json": "node --import tsx ./src/json.ts"
},
"dependencies": {
"llamaindex": "*"
Expand Down
1 change: 1 addition & 0 deletions packages/llamaindex/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
"@azure/identity": "^4.4.1",
"@datastax/astra-db-ts": "^1.4.1",
"@discordjs/rest": "^2.3.0",
"@discoveryjs/json-ext": "^0.6.1",
"@google-cloud/vertexai": "1.2.0",
"@google/generative-ai": "0.12.0",
"@grpc/grpc-js": "^1.11.1",
Expand Down
Loading

0 comments on commit ae1149f

Please sign in to comment.