Skip to content

Commit

Permalink
Merge pull request #120 from kuzudb/import-export
Browse files Browse the repository at this point in the history
Import/export docs
  • Loading branch information
prrao87 authored Apr 9, 2024
2 parents 171bb8f + 1d523b1 commit e013e3c
Show file tree
Hide file tree
Showing 8 changed files with 423 additions and 1 deletion.
21 changes: 20 additions & 1 deletion astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,25 @@ export default defineConfig({
{ label: 'Run graph algorithms', link: '/get-started/graph-algorithms' },
]
},
{
label: 'Import data',
collapsed: true,
items: [
{ label: 'Overview', link: '/import' },
{ label: 'Copy from CSV', link: '/import/csv' },
{ label: 'Copy from Parquet', link: '/import/parquet' },
{ label: 'Copy from NumPy', link: '/import/npy', badge: { text: 'Experimental', variant: 'danger'}},
]
},
{
label: 'Export data',
collapsed: true,
items: [
{ label: 'Overview', link: '/export' },
{ label: 'Copy to CSV', link: '/export/csv' },
{ label: 'Copy to Parquet', link: '/export/parquet' },
]
},
{
label: 'Visualize graphs',
link: '/visualization',
Expand All @@ -75,7 +94,7 @@ export default defineConfig({
{ label: 'Create your first RDF graph', link: '/rdf-graphs/example-rdfgraph' },
{ label: 'Query an RDF graph in Cypher', link: '/rdf-graphs/rdfgraphs-overview' },
{ label: 'RDF bulk data import', link: '/rdf-graphs/rdf-import' },
{ label: 'Example RDFGraphs', link: '/rdf-graphs/rdfgraphs-repo' },
{ label: 'Preloaded RDFGraphs', link: '/rdf-graphs/rdfgraphs-repo' },
],
autogenerate: { directory: 'reference' },
},
Expand Down
48 changes: 48 additions & 0 deletions src/content/docs/export/csv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
title: Export CSV
---

The `COPY TO` clause can export query results to a CSV file, and is used as follows:

```cypher
COPY (MATCH (u:User) RETURN u.*) TO 'user.csv' (header=true);
```

The CSV file consists of the following fields:

```csv
u.name,u.age
Adam,30
Karissa,40
Zhang,50
Noura,25
```

Nested data types like lists and structs will be represented as strings within their respective columns.

Available options are:

<div class="scroll-table">

| Option | Default Value | Description |
|:------------------------:|:-----------------------:|---------------------------------------------------------------------------|
| `ESCAPE` | `\` | Character used to escape special characters in CSV |
| `DELIM` | `,` | Character that separates fields in the CSV |
| `QUOTE` | `"` | Character used to enclose fields containing special characters or spaces |
| `Header` | `false` | Indicates whether to output a header row |

</div>

Another example is shown below.

```cypher
COPY (MATCH (a:User)-[f:Follows]->(b:User) RETURN a.name, f.since, b.name) TO 'follows.csv' (header=false, delim='|');
```

This outputs the following results to `follows.csv`:
```csv
Adam|2020|Karissa
Adam|2020|Zhang
Karissa|2021|Zhang
Zhang|2022|Noura
```
23 changes: 23 additions & 0 deletions src/content/docs/export/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: Overview
---

import { LinkCard } from '@astrojs/starlight/components';

The `COPY TO` command allows you to export query results directly to the specified file format. This
is useful when you want to persist the results of a query to be used in other systems, or for
archiving purposes.

## `COPY TO` CSV

<LinkCard
title="Export CSV"
href="/export/csv"
/>

## `COPY TO` Parquet

<LinkCard
title="Export Parquet"
href="/export/parquet"
/>
33 changes: 33 additions & 0 deletions src/content/docs/export/parquet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: Export Parquet
---

The `COPY TO` clause can export query results to a Parquet file. It can be combined with a subquery
and used as shown below.

```cypher
COPY (MATCH (u:User) return u.*) TO 'user.parquet';
```

The `LOAD FROM` clause can used to scan the Parquet file and to verify that the export worked:

```cypher
> LOAD FROM 'user.parquet' RETURN *;
-------------------
| u.name | u.age |
-------------------
| Adam | 30 |
-------------------
| Karissa | 40 |
-------------------
| Zhang | 50 |
-------------------
| Noura | 25 |
-------------------
```

:::caution[Notes]
- Exporting [fixed list](../cypher/data-types#list) or [variant](../../cypher/data-types/variant) data types to Parquet are not yet supported.
- [UNION](../../cypher/data-types/union) is exported as a [STRUCT](../../cypher/data-types/struct), which is the internal representation of the `Union` data type.
- Currently, only Snappy compression is supported for exports.
:::
116 changes: 116 additions & 0 deletions src/content/docs/import/csv.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
title: Import data from CSV files
---

You can bulk import data to node and relationship tables from CSV files
using the `COPY FROM` command. It is **highly recommended** to use `COPY FROM` if you are creating large
databases.

The CSV import configuration can be manually set by specifying the parameters inside `( )` at the
end of the the `COPY FROM` clause. The following table shows the configuration parameters supported:

| Parameter | Description | Default Value |
|:-----|:-----|:-----|
| `HEADER` | Whether the first line of the CSV file is the header. Can be true or false. | false |
| `DELIM` | Character that separates different columns in a lines. | `,`|
| `QUOTE` | Character to start a string quote. | `"` |
| `ESCAPE` | Character within string quotes to escape QUOTE and other characters, e.g., a line break. <br/> See the important note below about line breaks lines below.| `\` |
| `LIST_BEGIN`/`LIST_END` | For the [list data type](../cypher/data-types/list.md), the delimiters to specify <br/> list begin and list end characters | `[`, `]`|
| `PARALLEL` | Read csv files in parallel or not | true |

The example below specifies that the CSV delimiter is`|` and also that the header row exists.

```cypher
COPY User FROM "user.csv" (HEADER=true, DELIM="|");
```

:::caution[Guidelines]
- **Start with empty tables:** `COPY FROM` commands can be used when your tables are completely empty. So you should use `COPY FROM` immediately after you define the schemas of your tables.
- **Copy nodes before relationships:** In order to copy a relationship table `R` from a csv file `RFile`, the nodes that appear in `RFile` need to
already exist in the database (either imported in bulk or inserted through Cypher data manipulation commands).
- **Wrap strings inside quotes:** Kùzu will accept strings in string columns both with and without quotes, though it's recommended to wrap strings in quotes to avoid any ambiguity with delimiters.
- **Avoid leading and trailing spaces**: As per the CSV standard, Kùzu does not ignore leading and trailing spaces (e.g., if you input ` 213 ` for
an integer value, that will be read as malformed integer and the corresponding node/rel property will be set to NULL.
:::

## Import to node table

Create a node table `User` as follows:

```cypher
CREATE NODE TABLE User(name STRING, age INT64, reg_date DATE, PRIMARY KEY (name))
```

The CSV file `user.csv` contains the following fields:
```csv
name,age,reg_date
Adam,30,2020-06-22
Karissa,40,2019-05-12
...
```

The following statement will load `user.csv` into User table.

```cypher
COPY User FROM "user.csv" (header=true);
```

## Import to relationship table

When loading into a relationship table, Kùzu assumes the first two columns in the file are:

- `FROM` Node Column: The primary key of the `FROM` nodes.
- `TO` Node Column: The primary key of the `TO` nodes.

The rest of the columns correspond to relationship properties.

Create a relationship table `Follows` using the following Cypher query:

```cypher
CREATE REL TABLE Follows(FROM User TO User, since DATE)
```

This reads data from the below CSV file `follows.csv`:
```csv
Adam,Karissa,2010-01-30
Karissa,Michelle,2014-01-30
...
```

The following statement loads the `follows.csv` file into a `Follows` table.

```cypher
COPY Follows FROM "follows.csv";
```

Note that the header wasn't present in the CSV file, hence the `header` parameter is not set.

## Import multiple files to a single table

It is common practice to divide a large CSV file into several smaller files for cleaner data management.
Kùzu can read multiple files with the same structure, consolidating their data into a single node or relationship table.
You can specify that multiple files are loaded in the following ways:

### Glob pattern

This is similar to the Unix [glob](https://man7.org/linux/man-pages/man7/glob.7.html) pattern, where you specify
file paths that match a given pattern. The following wildcard characters are supported:

| Wildcard | Description |
| :-----------: | ----------- |
| `*` | match any number of any characters (including none) |
| `?` | match any single character |
| `[abc]` | match any one of the characters enclosed within the brackets |
| `[a-z]` | match any one of the characters within the range |

```cypher
COPY User FROM "User*.csv"
```

### List of files

Alternatively, you can just specify a list of files to be loaded.

```cypher
COPY User FROM ["User0.csv", "User0.csv", "User2.csv"]
```
46 changes: 46 additions & 0 deletions src/content/docs/import/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
title: Overview
---

import { LinkCard } from '@astrojs/starlight/components';

There are multiple ways to import data in Kùzu. The only prerequisite for inserting data
into a database is that you first create a graph schema, i.e., the structure of your node and relationship tables.

For small graphs (a few thousand nodes), the `CREATE` and `MERGE` [Cypher clauses](../cypher/data-manipulation-clauses)
can be used to insert nodes and
relationships. These are similar to SQL's `INSERT` statements, but bear in mind that they are slower than the bulk import
options shown below. The `CREATE`/`MERGE` clauses are intended to do small additions or updates on a sporadic basis.

In general, the recommended approach is to use `COPY FROM` (rather than creating or
merging nodes one by one), for larger graphs of millions of nodes and beyond. For now, the `COPY FROM`
commands can only be used when tables are empty.

## `COPY FROM` CSV

The `COPY FROM` command is used to bulk import data from a CSV file into a node or relationship table.
See the linked card below for more information and examples.

<LinkCard
title="Import CSV"
href="/import/csv"
/>

## `COPY FROM` Parquet

Similar to CSV, the `COPY FROM` command is used to bulk import data from a Parquet file into a node or relationship table.
See the linked card below for more information and examples.

<LinkCard
title="Import Parquet"
href="/import/parquet"
/>

## `COPY FROM` NumPy

Importing from NumPy is a specific use case that allows you to import numeric data from a NumPy file into a node table.

<LinkCard
title="Import NumPy"
href="/import/npy"
/>
44 changes: 44 additions & 0 deletions src/content/docs/import/npy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: Import NumPy
---

The `.npy` format is the standard binary file format in [NumPy](https://numpy.org/) for persisting a
single arbitrary NumPy array on disk.

The primary use case for bulk loading NumPy files is to load
large node features or vectors that are stored in `.npy` format. You can use the `COPY FROM` statement
to import a set of `*.npy` files into a node table.

:::caution[Notes]
This feature is an experimental feature and will evolve. Currently, this feature has the following constraints:
- **Import to node table only**: For now, Kùzu supports loading `.npy` files into **node tables** only.
- **Start with empty tables**: `COPY FROM` commands can be used when your tables are completely empty.
So you should use `COPY FROM` immediately after you define the schemas of your tables.
- **NPY file mapped to column**: Each `.npy` file will be loaded as a node table column. So, in the `COPY FROM` statement, the
number of `.npy` files must be equal to the number of columns defined in DDL.
- **Numerical types only**: A `.npy` file can only contain numerical values.
:::

## Import to node table
Consider a `Paper` table with an `id` column, a feature column that is an embedding (vector) with 768 dimensions,
a `year` column and a `label` column as ground truth. We first define the schema with the following statement:

```cypher
CREATE NODE TABLE Paper(id INT64, feat FLOAT[768], year INT64, label DOUBLE, PRIMARY KEY(id));
```

The raw data is stored in `.npy` format where each column is represented as a NumPy array on disk. The files are
specified below:

```
node_id.npy", "node_feat_f32.npy", "node_year.npy", "node_label.npy"
```

We can copy the files with the following statement:

```cypher
COPY Paper FROM ("node_id.npy", "node_feat_f32.npy", "node_year.npy", "node_label.npy") BY COLUMN;
```

As stated before, the number of `*.npy` files must equal the number of columns, and must also be
specified in the same order as they are defined in the DDL.
Loading

0 comments on commit e013e3c

Please sign in to comment.