-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #120 from kuzudb/import-export
Import/export docs
- Loading branch information
Showing
8 changed files
with
423 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
--- | ||
title: Export CSV | ||
--- | ||
|
||
The `COPY TO` clause can export query results to a CSV file, and is used as follows: | ||
|
||
```cypher | ||
COPY (MATCH (u:User) RETURN u.*) TO 'user.csv' (header=true); | ||
``` | ||
|
||
The CSV file consists of the following fields: | ||
|
||
```csv | ||
u.name,u.age | ||
Adam,30 | ||
Karissa,40 | ||
Zhang,50 | ||
Noura,25 | ||
``` | ||
|
||
Nested data types like lists and structs will be represented as strings within their respective columns. | ||
|
||
Available options are: | ||
|
||
<div class="scroll-table"> | ||
|
||
| Option | Default Value | Description | | ||
|:------------------------:|:-----------------------:|---------------------------------------------------------------------------| | ||
| `ESCAPE` | `\` | Character used to escape special characters in CSV | | ||
| `DELIM` | `,` | Character that separates fields in the CSV | | ||
| `QUOTE` | `"` | Character used to enclose fields containing special characters or spaces | | ||
| `Header` | `false` | Indicates whether to output a header row | | ||
|
||
</div> | ||
|
||
Another example is shown below. | ||
|
||
```cypher | ||
COPY (MATCH (a:User)-[f:Follows]->(b:User) RETURN a.name, f.since, b.name) TO 'follows.csv' (header=false, delim='|'); | ||
``` | ||
|
||
This outputs the following results to `follows.csv`: | ||
```csv | ||
Adam|2020|Karissa | ||
Adam|2020|Zhang | ||
Karissa|2021|Zhang | ||
Zhang|2022|Noura | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
--- | ||
title: Overview | ||
--- | ||
|
||
import { LinkCard } from '@astrojs/starlight/components'; | ||
|
||
The `COPY TO` command allows you to export query results directly to the specified file format. This | ||
is useful when you want to persist the results of a query to be used in other systems, or for | ||
archiving purposes. | ||
|
||
## `COPY TO` CSV | ||
|
||
<LinkCard | ||
title="Export CSV" | ||
href="/export/csv" | ||
/> | ||
|
||
## `COPY TO` Parquet | ||
|
||
<LinkCard | ||
title="Export Parquet" | ||
href="/export/parquet" | ||
/> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
--- | ||
title: Export Parquet | ||
--- | ||
|
||
The `COPY TO` clause can export query results to a Parquet file. It can be combined with a subquery | ||
and used as shown below. | ||
|
||
```cypher | ||
COPY (MATCH (u:User) return u.*) TO 'user.parquet'; | ||
``` | ||
|
||
The `LOAD FROM` clause can used to scan the Parquet file and to verify that the export worked: | ||
|
||
```cypher | ||
> LOAD FROM 'user.parquet' RETURN *; | ||
------------------- | ||
| u.name | u.age | | ||
------------------- | ||
| Adam | 30 | | ||
------------------- | ||
| Karissa | 40 | | ||
------------------- | ||
| Zhang | 50 | | ||
------------------- | ||
| Noura | 25 | | ||
------------------- | ||
``` | ||
|
||
:::caution[Notes] | ||
- Exporting [fixed list](../cypher/data-types#list) or [variant](../../cypher/data-types/variant) data types to Parquet are not yet supported. | ||
- [UNION](../../cypher/data-types/union) is exported as a [STRUCT](../../cypher/data-types/struct), which is the internal representation of the `Union` data type. | ||
- Currently, only Snappy compression is supported for exports. | ||
::: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
--- | ||
title: Import data from CSV files | ||
--- | ||
|
||
You can bulk import data to node and relationship tables from CSV files | ||
using the `COPY FROM` command. It is **highly recommended** to use `COPY FROM` if you are creating large | ||
databases. | ||
|
||
The CSV import configuration can be manually set by specifying the parameters inside `( )` at the | ||
end of the the `COPY FROM` clause. The following table shows the configuration parameters supported: | ||
|
||
| Parameter | Description | Default Value | | ||
|:-----|:-----|:-----| | ||
| `HEADER` | Whether the first line of the CSV file is the header. Can be true or false. | false | | ||
| `DELIM` | Character that separates different columns in a lines. | `,`| | ||
| `QUOTE` | Character to start a string quote. | `"` | | ||
| `ESCAPE` | Character within string quotes to escape QUOTE and other characters, e.g., a line break. <br/> See the important note below about line breaks lines below.| `\` | | ||
| `LIST_BEGIN`/`LIST_END` | For the [list data type](../cypher/data-types/list.md), the delimiters to specify <br/> list begin and list end characters | `[`, `]`| | ||
| `PARALLEL` | Read csv files in parallel or not | true | | ||
|
||
The example below specifies that the CSV delimiter is`|` and also that the header row exists. | ||
|
||
```cypher | ||
COPY User FROM "user.csv" (HEADER=true, DELIM="|"); | ||
``` | ||
|
||
:::caution[Guidelines] | ||
- **Start with empty tables:** `COPY FROM` commands can be used when your tables are completely empty. So you should use `COPY FROM` immediately after you define the schemas of your tables. | ||
- **Copy nodes before relationships:** In order to copy a relationship table `R` from a csv file `RFile`, the nodes that appear in `RFile` need to | ||
already exist in the database (either imported in bulk or inserted through Cypher data manipulation commands). | ||
- **Wrap strings inside quotes:** Kùzu will accept strings in string columns both with and without quotes, though it's recommended to wrap strings in quotes to avoid any ambiguity with delimiters. | ||
- **Avoid leading and trailing spaces**: As per the CSV standard, Kùzu does not ignore leading and trailing spaces (e.g., if you input ` 213 ` for | ||
an integer value, that will be read as malformed integer and the corresponding node/rel property will be set to NULL. | ||
::: | ||
|
||
## Import to node table | ||
|
||
Create a node table `User` as follows: | ||
|
||
```cypher | ||
CREATE NODE TABLE User(name STRING, age INT64, reg_date DATE, PRIMARY KEY (name)) | ||
``` | ||
|
||
The CSV file `user.csv` contains the following fields: | ||
```csv | ||
name,age,reg_date | ||
Adam,30,2020-06-22 | ||
Karissa,40,2019-05-12 | ||
... | ||
``` | ||
|
||
The following statement will load `user.csv` into User table. | ||
|
||
```cypher | ||
COPY User FROM "user.csv" (header=true); | ||
``` | ||
|
||
## Import to relationship table | ||
|
||
When loading into a relationship table, Kùzu assumes the first two columns in the file are: | ||
|
||
- `FROM` Node Column: The primary key of the `FROM` nodes. | ||
- `TO` Node Column: The primary key of the `TO` nodes. | ||
|
||
The rest of the columns correspond to relationship properties. | ||
|
||
Create a relationship table `Follows` using the following Cypher query: | ||
|
||
```cypher | ||
CREATE REL TABLE Follows(FROM User TO User, since DATE) | ||
``` | ||
|
||
This reads data from the below CSV file `follows.csv`: | ||
```csv | ||
Adam,Karissa,2010-01-30 | ||
Karissa,Michelle,2014-01-30 | ||
... | ||
``` | ||
|
||
The following statement loads the `follows.csv` file into a `Follows` table. | ||
|
||
```cypher | ||
COPY Follows FROM "follows.csv"; | ||
``` | ||
|
||
Note that the header wasn't present in the CSV file, hence the `header` parameter is not set. | ||
|
||
## Import multiple files to a single table | ||
|
||
It is common practice to divide a large CSV file into several smaller files for cleaner data management. | ||
Kùzu can read multiple files with the same structure, consolidating their data into a single node or relationship table. | ||
You can specify that multiple files are loaded in the following ways: | ||
|
||
### Glob pattern | ||
|
||
This is similar to the Unix [glob](https://man7.org/linux/man-pages/man7/glob.7.html) pattern, where you specify | ||
file paths that match a given pattern. The following wildcard characters are supported: | ||
|
||
| Wildcard | Description | | ||
| :-----------: | ----------- | | ||
| `*` | match any number of any characters (including none) | | ||
| `?` | match any single character | | ||
| `[abc]` | match any one of the characters enclosed within the brackets | | ||
| `[a-z]` | match any one of the characters within the range | | ||
|
||
```cypher | ||
COPY User FROM "User*.csv" | ||
``` | ||
|
||
### List of files | ||
|
||
Alternatively, you can just specify a list of files to be loaded. | ||
|
||
```cypher | ||
COPY User FROM ["User0.csv", "User0.csv", "User2.csv"] | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
--- | ||
title: Overview | ||
--- | ||
|
||
import { LinkCard } from '@astrojs/starlight/components'; | ||
|
||
There are multiple ways to import data in Kùzu. The only prerequisite for inserting data | ||
into a database is that you first create a graph schema, i.e., the structure of your node and relationship tables. | ||
|
||
For small graphs (a few thousand nodes), the `CREATE` and `MERGE` [Cypher clauses](../cypher/data-manipulation-clauses) | ||
can be used to insert nodes and | ||
relationships. These are similar to SQL's `INSERT` statements, but bear in mind that they are slower than the bulk import | ||
options shown below. The `CREATE`/`MERGE` clauses are intended to do small additions or updates on a sporadic basis. | ||
|
||
In general, the recommended approach is to use `COPY FROM` (rather than creating or | ||
merging nodes one by one), for larger graphs of millions of nodes and beyond. For now, the `COPY FROM` | ||
commands can only be used when tables are empty. | ||
|
||
## `COPY FROM` CSV | ||
|
||
The `COPY FROM` command is used to bulk import data from a CSV file into a node or relationship table. | ||
See the linked card below for more information and examples. | ||
|
||
<LinkCard | ||
title="Import CSV" | ||
href="/import/csv" | ||
/> | ||
|
||
## `COPY FROM` Parquet | ||
|
||
Similar to CSV, the `COPY FROM` command is used to bulk import data from a Parquet file into a node or relationship table. | ||
See the linked card below for more information and examples. | ||
|
||
<LinkCard | ||
title="Import Parquet" | ||
href="/import/parquet" | ||
/> | ||
|
||
## `COPY FROM` NumPy | ||
|
||
Importing from NumPy is a specific use case that allows you to import numeric data from a NumPy file into a node table. | ||
|
||
<LinkCard | ||
title="Import NumPy" | ||
href="/import/npy" | ||
/> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
--- | ||
title: Import NumPy | ||
--- | ||
|
||
The `.npy` format is the standard binary file format in [NumPy](https://numpy.org/) for persisting a | ||
single arbitrary NumPy array on disk. | ||
|
||
The primary use case for bulk loading NumPy files is to load | ||
large node features or vectors that are stored in `.npy` format. You can use the `COPY FROM` statement | ||
to import a set of `*.npy` files into a node table. | ||
|
||
:::caution[Notes] | ||
This feature is an experimental feature and will evolve. Currently, this feature has the following constraints: | ||
- **Import to node table only**: For now, Kùzu supports loading `.npy` files into **node tables** only. | ||
- **Start with empty tables**: `COPY FROM` commands can be used when your tables are completely empty. | ||
So you should use `COPY FROM` immediately after you define the schemas of your tables. | ||
- **NPY file mapped to column**: Each `.npy` file will be loaded as a node table column. So, in the `COPY FROM` statement, the | ||
number of `.npy` files must be equal to the number of columns defined in DDL. | ||
- **Numerical types only**: A `.npy` file can only contain numerical values. | ||
::: | ||
|
||
## Import to node table | ||
Consider a `Paper` table with an `id` column, a feature column that is an embedding (vector) with 768 dimensions, | ||
a `year` column and a `label` column as ground truth. We first define the schema with the following statement: | ||
|
||
```cypher | ||
CREATE NODE TABLE Paper(id INT64, feat FLOAT[768], year INT64, label DOUBLE, PRIMARY KEY(id)); | ||
``` | ||
|
||
The raw data is stored in `.npy` format where each column is represented as a NumPy array on disk. The files are | ||
specified below: | ||
|
||
``` | ||
node_id.npy", "node_feat_f32.npy", "node_year.npy", "node_label.npy" | ||
``` | ||
|
||
We can copy the files with the following statement: | ||
|
||
```cypher | ||
COPY Paper FROM ("node_id.npy", "node_feat_f32.npy", "node_year.npy", "node_label.npy") BY COLUMN; | ||
``` | ||
|
||
As stated before, the number of `*.npy` files must equal the number of columns, and must also be | ||
specified in the same order as they are defined in the DDL. |
Oops, something went wrong.