Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add json streaming to JSONReader #1119

Merged
merged 19 commits into from
Sep 6, 2024
Merged
5 changes: 5 additions & 0 deletions .changeset/blue-bears-invite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"llamaindex": patch
---

feat: add JSON streaming to JSONReader
7 changes: 6 additions & 1 deletion apps/docs/docs/modules/data_loaders/json.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

A simple JSON data loader with various options.
Either parses the entire string, cleaning it and treat each line as an embedding or performs a recursive depth-first traversal yielding JSON paths.
Supports streaming of large JSON data using [@discoveryjs/json-ext](https://github.com/discoveryjs/json-ext)

## Usage

Expand All @@ -20,12 +21,16 @@ const docsFromContent = reader.loadDataAsContent(content);

Basic:

- `streamingThreshold?`: The threshold for using streaming mode in MB of the JSON Data. CEstimates characters by calculating bytes: `(streamingThreshold * 1024 * 1024) / 2` and comparing against `.length` of the JSON string. Set `undefined` to disable streaming or `0` to always use streaming. Default is `50` MB.

- `ensureAscii?`: Wether to ensure only ASCII characters be present in the output by converting non-ASCII characters to their unicode escape sequence. Default is `false`.

- `isJsonLines?`: Wether the JSON is in JSON Lines format. If true, will split into lines, remove empty one and parse each line as JSON. Default is `false`
- `isJsonLines?`: Wether the JSON is in JSON Lines format. If true, will split into lines, remove empty one and parse each line as JSON. Note: Uses a custom streaming parser, most likely less robust than json-ext. Default is `false`

- `cleanJson?`: Whether to clean the JSON by filtering out structural characters (`{}, [], and ,`). If set to false, it will just parse the JSON, not removing structural characters. Default is `true`.

- `logger?`: A placeholder for a custom logger function.

Depth-First-Traversal:

- `levelsBack?`: Specifies how many levels up the JSON structure to include in the output. `cleanJson` will be ignored. If set to 0, all levels are included. If undefined, parses the entire JSON, treat each line as an embedding and create a document per top-level array. Default is `undefined`
Expand Down
3 changes: 2 additions & 1 deletion examples/readers/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@
"start:assemblyai": "node --import tsx ./src/assemblyai.ts",
"start:llamaparse-dir": "node --import tsx ./src/simple-directory-reader-with-llamaparse.ts",
"start:llamaparse-json": "node --import tsx ./src/llamaparse-json.ts",
"start:discord": "node --import tsx ./src/discord.ts"
"start:discord": "node --import tsx ./src/discord.ts",
"start:json": "node --import tsx ./src/json.ts"
},
"dependencies": {
"llamaindex": "*"
Expand Down
1 change: 1 addition & 0 deletions packages/llamaindex/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
"@azure/identity": "^4.2.1",
"@datastax/astra-db-ts": "^1.2.1",
"@discordjs/rest": "^2.3.0",
"@discoveryjs/json-ext": "^0.6.1",
"@google-cloud/vertexai": "^1.2.0",
"@google/generative-ai": "0.12.0",
"@grpc/grpc-js": "^1.10.11",
Expand Down
Loading