Skip to content
This repository has been archived by the owner on Aug 13, 2024. It is now read-only.

Commit

Permalink
Merge branch 'release/0.3.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
alexanderdean committed Jul 29, 2015
2 parents 1a9280b + bd4efd8 commit c1b4d09
Show file tree
Hide file tree
Showing 13 changed files with 749 additions and 202 deletions.
13 changes: 13 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
0.3.0 (2015-07-29)
------------------
Swapped all occurrences of "igluutils" with "schemaddl" to reflect renaming (#97)
Fixed ordering for JSONPaths file (#96)
Updated README to reflect new 0.3.0 (#93)
Optional self-desc JSON with --raw (#92)
Now correctly handling dir of JSONs (#91)
No longer checking for .ndjson extension when --ndjson set (#74)
Changed default SchemaVer to 1-0-0 (#80)
Unified CLI options (#90)
Added `ddl` command which generates JSON Paths files and Redshift DDL (#84)
Moved existing functionality into `derive` command (#83)

0.2.0 (2015-07-01)
------------------
Updated vagrant push to also build and publish webui artifact (#72)
Expand Down
123 changes: 98 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,63 +2,115 @@

[ ![Build Status] [travis-image] ] [travis] [ ![Release] [release-image] ] [releases] [ ![License] [license-image] ] [license]

Schema Guru is a tool (CLI and web) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances.
Schema Guru is a tool (CLI and web) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances process and transform it into different data definition formats.

Current primary features include:

- deriviation of JSON Schema from set of JSON instances (``schema`` command)
- generation of **[Redshift] [redshift]** table DDL and JSONPaths file (``ddl`` command)

Unlike other tools for deriving JSON Schemas, Schema Guru allows you to derive schema from an unlimited set of instances (making schemas much more precise), and supports many more JSON Schema validation properties.

Schema Guru is used heavily in association with Snowplow's own **[Snowplow] [snowplow]** and **[Iglu] [iglu]** projects.
Schema Guru is used heavily in association with Snowplow's own **[Snowplow] [snowplow]**, **[Iglu] [iglu]** and **[Schema DDL] [schema-ddl]** projects.

## User Quickstart

### CLI

Download the latest Schema Guru from Bintray:

```bash
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.2.0.zip
$ unzip schema_guru_0.2.0.zip
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.3.0.zip
$ unzip schema_guru_0.3.0.zip
```

Assuming you have a recent JVM installed:
Assuming you have a recent JVM installed.

### CLI

#### Schema derivation

You can use as input either single JSON file or directory with JSON instances (it will be processed recursively).

Following command will print JSON Schema to stdout:

```bash
$ ./schema-guru-0.2.0 --dir {{jsons_directory}}
$ ./schema-guru-0.3.0 schema {{input}}
```

Also you can specify output file for your schema:

```bash
$ ./schema-guru-0.2.0 --dir {{jsons_directory}} --output {{json_schema_file}}
$ ./schema-guru-0.3.0 schema --output {{json_schema_file}} {{input}}
```

Or you can analyze a single JSON instance:
You can also switch Schema Guru into **[NDJSON] [ndjson]** mode, where it will look for newline delimited JSONs:

```bash
$ ./schema-guru-0.2.0 --file {{json_instance}}
$ ./schema-guru-0.3.0 schema --ndjson {{input}}
```

You can also switch Schema Guru into ndjson mode, where it will look for newline delimited JSONs.
You can specify the enum cardinality tolerance for your fields. It means that *all* fields which are found to have less than the specified cardinality will be specified in the JSON Schema using the `enum` property.

```bash
$ ./schema-guru-0.3.0 schema --enum 5 {{input}}
```

#### DDL derivation

Like for Schema derivation, for DDL input may be also single file with JSON Schema or directory containing JSON Schemas.

Currently we support DDL only for **[Amazon Redshift] [redshift]**, but in future releases you'll be able to specify another with ``--db`` option.

Following command will just save Redshift (default ``--db`` value) DDL to current dir.

```bash
$ ./schema-guru-0.3.0 ddl {{input}}
```

In this case all your files need to have `.ndjson` extension (as the **[specifications][ndjson-spec]** says); all `.json` files will be skipped.
You also can specify directory for output:

```bash
$ ./schema-guru-0.2.0 --ndjson --dir {{ndjsons_directory}}
$ ./schema-guru-0.3.0 ddl --output {{ddl_dir}} {{input}}
```

You can specify the enum cardinality tolerance for for your fields. It means that *all* fields which are found to have less than the specified cardinality will be specified in the JSON Schema using the `enum` property.
If you're not a Snowplow Platform user, don't use **[Self-describing Schema] [self-describing]** or just don't want anything specific to it you can produce raw schema:

```bash
$ ./schema-guru-0.2.0 --enum 5 --dir {{jsons_directory}}
$ ./schema-guru-0.3.0 ddl --raw {{input}}
```

You may also want to get JSONPaths file for Redshift's **[COPY] [redshift-copy]** command. It will place ``jsonpaths`` dir alongside with ``sql``:

```bash
$ ./schema-guru-0.3.0 ddl --with-json-paths {{input}}
```

The most embarrassing part of shifting from dynamic-typed world to static-typed is product types (or union types) like this in JSON Schema: ``["integer", "string"]``.
How to represent them in SQL DDL? It's a taught question and we think there's no ideal solution.
Thus we provide you two options. By default product types will be transformed as most general ``VARCHAR(4096)``.
But there's another way - you can split column with product types into separate ones with it's types as postfix, for example property ``model`` with type ``["string", "integer"]`` will be transformed into two columns ``mode_string`` and ``model_integer``.
This behaviour can be achieved with ``--split-product-types``.

Another thing everyone need to consider is default VARCHAR size. If there's no clues about it (like ``maxLength``) 255 will be used.
You can also specify this default value:

```bash
$ ./schema-guru-0.3.0 ddl --size 32 {{input}}
```

You can also specify Redshift Schema for your table. For non-raw mode ``atomic`` used as default.

```bash
$ ./schema-guru-0.3.0 ddl --raw --schema business {{input}}
```

### Web UI

You can access our hosted demo of the Schema Guru web UI at [schemaguru.snplowanalytics.com] [webui-hosted]. To run it locally:

```bash
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.2.0.zip
$ unzip schema_guru_webui_0.2.0.zip
$ ./schema-guru-webui-0.2.0
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.3.0.zip
$ unzip schema_guru_webui_0.3.0.zip
$ ./schema-guru-webui-0.3.0
```

The above will run a Spray web server containing Schema Guru on [0.0.0.0:8000] [webui-local]. Interface and port can be specified by `--interface` and `--port` respectively.
Expand Down Expand Up @@ -88,6 +140,8 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk

### Functionality

#### Schema derivation

* Takes a directory as an argument and will print out the resulting JsonSchema:
- Processes each JSON sequentially
- Merges all results into one master Json Schema
Expand All @@ -104,20 +158,35 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk
* Allows to produce JSON Schemas with different names based on given JSON Path
* Supports **[Newline Delimited JSON] [ndjson]**

#### DDL derivation

* Correctly transforms some of string formats
- uuid becomes ``CHAR(36)``
- ipv4 becomes ``VARCHAR(14)``
- ipv6 becomes ``VARCHAR(39)``
- date-time becomes ``TIMESTAMP``
* Handles properties with only enums
* Property with ``maxLength(n)`` and ``minLength(n)`` becomes ``CHAR(n)``
* Can output JSONPaths file
* Can split product types
* Number with ``multiplyOf`` 0.01 becomes ``DECIMAL``
* Handles Self-describing JSON and can produce raw DDL
* Recognizes integer size by ``minimum`` and ``maximum`` values


### Assumptions

* All JSONs in the directory are assumed to be of the same event type and will be merged together
* All JSONs are assumed to start with either `{ ... }` or `[ ... ]`
- If they do not they are discarded
* Schema should be as strict as possible - e.g. no `additionalProperties` are allowed currently
* When using Schema Guru to derive schema from newline delimited JSONs they need to have .ndjson extension

### Self-describing JSON
Schema Guru allows you to produce **[Self-describing JSON Schema] [self-describing]**.
To produce it you need to specify vendor, name (if segmentation isn't using, see below), and version (optional, default value is 0-1-0).
``schema`` command allows you to produce **[Self-describing JSON Schema] [self-describing]**.
To produce it you need to specify vendor, name (if segmentation isn't using, see below), and version (optional, default value is 1-0-0).

```bash
$ ./schema-guru-0.2.0 --dir {{jsons_directory}} --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}}
$ ./schema-guru-0.3.0 schema --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}} {{input}}
```

### Schema Segmentation
Expand Down Expand Up @@ -150,7 +219,7 @@ and

You can run it as follows:
```bash
$ ./schema-guru-0.2.0 --dir {{mixed_jsons_directory}} --output-dir {{output_dir}} --schema-by $.event
$ ./schema-guru-0.3.0 schema --output {{output_dir}} --schema-by $.event {{mixed_jsons_directory}}
```

It will put two (or may be more) JSON Schemas into output dir: Purchased_an_Item.json and Posted_a_comment.json.
Expand Down Expand Up @@ -253,7 +322,7 @@ limitations under the License.
[license-image]: http://img.shields.io/badge/license-Apache--2-blue.svg?style=flat
[license]: http://www.apache.org/licenses/LICENSE-2.0

[release-image]: http://img.shields.io/badge/release-0.2.0-blue.svg?style=flat
[release-image]: http://img.shields.io/badge/release-0.3.0-blue.svg?style=flat
[releases]: https://github.com/snowplow/schema-guru/releases

[json-schema]: http://json-schema.org/
Expand All @@ -266,8 +335,12 @@ limitations under the License.

[snowplow]: https://github.com/snowplow/snowplow
[iglu]: https://github.com/snowplow/iglu
[schema-ddl]: https://github.com/snowplow/schema-ddl
[self-describing]: http://snowplowanalytics.com/blog/2014/05/15/introducing-self-describing-jsons/

[redshift]: http://aws.amazon.com/redshift/
[redshift-copy]: http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

[vagrant-install]: http://docs.vagrantup.com/v2/installation/index.html
[virtualbox-install]: https://www.virtualbox.org/wiki/Downloads

Expand Down
2 changes: 1 addition & 1 deletion project/BuildSettings.scala
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ object BuildSettings {
// Common settings for all our projects
lazy val commonSettings = Seq[Setting[_]](
organization := "com.snowplowanalytics",
version := "0.2.0",
version := "0.3.0",
scalaVersion := "2.10.5",
crossScalaVersions := Seq("2.10.5", "2.11.6"),
scalacOptions := Seq("-deprecation", "-encoding", "utf8",
Expand Down
2 changes: 2 additions & 0 deletions project/Dependencies.scala
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ object Dependencies {
val specs2 = "2.3.13"
val scalazSpecs2 = "0.2"
val scalaCheck = "1.12.2"
val schemaddl = "0.1.0"
}

object Libraries {
Expand All @@ -55,6 +56,7 @@ object Dependencies {
val json4sJackson = "org.json4s" %% "json4s-jackson" % V.json4s
val json4sScalaz = "org.json4s" %% "json4s-scalaz" % V.json4s
val jsonpath = "io.gatling" %% "jsonpath" % V.jsonpath
val schemaddl = "com.snowplowanalytics" %% "schema-ddl" % V.schemaddl
// Spray
val akka = "com.typesafe.akka" %% "akka-actor" % V.akka
val sprayCan = "io.spray" %% "spray-can" % V.spray
Expand Down
1 change: 1 addition & 0 deletions project/SchemaGuruBuild.scala
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ object SchemaGuruBuild extends Build {
Libraries.json4sJackson,
Libraries.json4sScalaz,
Libraries.jsonpath,
Libraries.schemaddl,
// Scala (test only)
Libraries.specs2,
Libraries.scalazSpecs2,
Expand Down
10 changes: 10 additions & 0 deletions src/main/scala/com.snowplowanalytics/package.scala
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,14 @@ package object schemaguru {
* Type Alias for a Valid list of JSONs
*/
type ValidJsonList = List[Validation[String, JValue]]

/**
* Class holding JSON with file name
*/
case class JsonFile(fileName: String, content: JValue)

/**
* Type Alias for a Valid list of JSON files
*/
type ValidJsonFileList = List[Validation[String, JsonFile]]
}
Loading

0 comments on commit c1b4d09

Please sign in to comment.