Skip to content

Commit

Permalink
Merge pull request #13 from mrpowers-io/update-readme
Browse files Browse the repository at this point in the history
update falsa readme
  • Loading branch information
MrPowers authored Sep 6, 2024
2 parents 3d6a06b + 7c48239 commit 553c517
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 15 deletions.
102 changes: 87 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,107 @@
# Falsa
# falsa

Falsa is a tool for generating H2O db-like-benchmark.
This implementation is unofficial! For the official implementation please check [DuckDB fork of orginial H2O project](https://github.com/duckdblabs/db-benchmark/tree/main/_data).
falsa makes it easy to generate sample datasets.

## Quick start
Here is how to generate a Parquet file with 100 million rows and 9 columns of data for example:

Falsa is built via maturin and pyo3. It works with python 3.9+. For maturin installation please follow [an official documentation](https://www.maturin.rs/installation).
```
falsa groupby --path-prefix=~/data --size MEDIUM
```

### Maturin build
![falsa example](https://github.com/mrpowers-io/falsa/blob/main/images/falsa_example.png)

Here are the first three rows of data in the file:

```
┌───────┬──────────┬──────────────┬─────┬─────┬────────┬─────┬─────┬───────────┐
│ id1 ┆ id2 ┆ id3 ┆ id4 ┆ id5 ┆ id6 ┆ v1 ┆ v2 ┆ v3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 │
╞═══════╪══════════╪══════════════╪═════╪═════╪════════╪═════╪═════╪═══════════╡
│ id038 ┆ id850817 ┆ id0000837021 ┆ 90 ┆ 8 ┆ 898164 ┆ 4 ┆ 15 ┆ 28.133477 │
│ id095 ┆ id73309 ┆ id0000312443 ┆ 3 ┆ 75 ┆ 177193 ┆ 1 ┆ 12 ┆ 91.555302 │
│ id055 ┆ id248099 ┆ id0000141631 ┆ 12 ┆ 94 ┆ 132406 ┆ 1 ┆ 3 ┆ 64.543029 │
└───────┴──────────┴──────────────┴─────┴─────┴────────┴─────┴─────┴───────────┘
```

With falsa, you can generate many sample datasets.

## Installation

### Pip install

In virtualenv with python 3.9+:

```sh
maturin develop --release
pip install git+https://github.com/mrpowers-io/falsa.git@main
falsa --help
```

### Pip install
### Maturin build

In virtualenv with python 3.9+:

```sh
pip install git+https://github.com/mrpowers-io/falsa.git@main
maturin develop --release
falsa --help
```

## Supported output formats
## h2o datasets

The h2o datasets are used to benchmark query engines on a single machine, [see here](https://duckdblabs.github.io/db-benchmark/).

Here are [the original R Scripts](https://github.com/duckdblabs/db-benchmark/tree/main/_data) to generate the sample datasets. These still work if you know how to run R (the large dataset generation can error out if you machine doesn't have sufficient memory).

At the moment the following output formats are supported:
falsa is good if you want to generate these datasets with a Python interface or if you are facing memory issues with the R scripts.

- CSV
- Parquet
- Delta*
### h2o groupby dataset

The h2o groupby dataset has 9 columns and 10 million/100 million/1 billion rows of data.

Here are three representative rows of data:

```
┌───────┬──────────┬──────────────┬─────┬─────┬────────┬─────┬─────┬───────────┐
│ id1 ┆ id2 ┆ id3 ┆ id4 ┆ id5 ┆ id6 ┆ v1 ┆ v2 ┆ v3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 │
╞═══════╪══════════╪══════════════╪═════╪═════╪════════╪═════╪═════╪═══════════╡
│ id038 ┆ id850817 ┆ id0000837021 ┆ 90 ┆ 8 ┆ 898164 ┆ 4 ┆ 15 ┆ 28.133477 │
│ id095 ┆ id73309 ┆ id0000312443 ┆ 3 ┆ 75 ┆ 177193 ┆ 1 ┆ 12 ┆ 91.555302 │
│ id055 ┆ id248099 ┆ id0000141631 ┆ 12 ┆ 94 ┆ 132406 ┆ 1 ┆ 3 ┆ 64.543029 │
└───────┴──────────┴──────────────┴─────┴─────┴────────┴─────┴─────┴───────────┘
```

Here's a short description of the columns:

* id1: 100 distinct values between id001 and id100
* id2: 100 distinct values between id001 and id100
* id3: 1_000_000 distinct values
* id4: random float values between zero and 100
* id5: random integer values between zero and 100
* id6: random integer values between 1 and 1_000_000
* v1: integer values between 1 and 5
* v2: integer valuees between 1 and 15
* v3: floating values between zero and 100

Here's the detailed description of the table:

```
┌────────────┬───────────┬───────────┬──────────────┬───────────┬───┬───────────────┬──────────┬───────────┬───────────┐
│ statistic ┆ id1 ┆ id2 ┆ id3 ┆ id4 ┆ … ┆ id6 ┆ v1 ┆ v2 ┆ v3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪═══════════╪═══════════╪══════════════╪═══════════╪═══╪═══════════════╪══════════╪═══════════╪═══════════╡
│ count ┆ 100000000 ┆ 100000000 ┆ 100000000 ┆ 1e8 ┆ … ┆ 1e8 ┆ 1e8 ┆ 1e8 ┆ 1e8 │
│ null_count ┆ 0 ┆ 0 ┆ 0 ┆ 0.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 │
│ mean ┆ null ┆ null ┆ null ┆ 50.500471 ┆ … ┆ 499977.133559 ┆ 3.000173 ┆ 8.0002679 ┆ 50.000731 │
│ std ┆ null ┆ null ┆ null ┆ 28.864911 ┆ … ┆ 288668.423121 ┆ 1.414225 ┆ 4.320694 ┆ 28.868118 │
│ min ┆ id001 ┆ id001 ┆ id0000000001 ┆ 1.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 1.0 ┆ 0.000002 │
│ 25% ┆ null ┆ null ┆ null ┆ 26.0 ┆ … ┆ 249956.0 ┆ 2.0 ┆ 4.0 ┆ 24.999205 │
│ 50% ┆ null ┆ null ┆ null ┆ 51.0 ┆ … ┆ 499949.0 ┆ 3.0 ┆ 8.0 ┆ 50.002307 │
│ 75% ┆ null ┆ null ┆ null ┆ 75.0 ┆ … ┆ 749987.0 ┆ 4.0 ┆ 12.0 ┆ 75.002693 │
│ max ┆ id100 ┆ id999999 ┆ id0001000000 ┆ 100.0 ┆ … ┆ 1e6 ┆ 5.0 ┆ 15.0 ┆ 100.0 │
└────────────┴───────────┴───────────┴──────────────┴───────────┴───┴───────────────┴──────────┴───────────┴───────────┘
```

_*There is a problem with Delta at the moment: writing to Delta requires materialization of all `pyarrow` batches first and may be slow and tends to OOM-like errors. We are working on it now and will provide a patched version soon._
The h2o dataset is useful for group by benchmarks. For example, you can use id1 to do an aggregation on a low cardinality column and id3 to do an aggreation on a high cardinality column.
Binary file added images/falsa_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 553c517

Please sign in to comment.