Skip to content
This repository has been archived by the owner on Aug 13, 2024. It is now read-only.

Commit

Permalink
Prepared for release
Browse files Browse the repository at this point in the history
  • Loading branch information
chuwy authored and alexanderdean committed Apr 7, 2016
1 parent 59be358 commit ca05d7f
Show file tree
Hide file tree
Showing 4 changed files with 36 additions and 25 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
Version 0.6.0 (2016-04-07)
--------------------------
Added force flag (#141)
Fixed column ordering for Redshift tables across ADDITIONs (#135)
Added SQL migrations between schema ADDITIONs (#134)
Replaced argot with scopt (#124)
Reimplemented --schema-by with proper Jackson converting (#125)
Removed AWS-related tools from up.playbooks (#146)

Version 0.5.0 (2016-02-11)
--------------------------
Bumped schema-ddl to 0.3.0 (#130)
Expand Down
46 changes: 24 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ Schema Guru is used heavily in association with Snowplow's own **[Snowplow] [sno
Download the latest Schema Guru from Bintray:

```bash
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.5.0.zip
$ unzip schema_guru_0.5.0.zip
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.6.0.zip
$ unzip schema_guru_0.6.0.zip
```

Assuming you have a recent JVM installed.
Expand All @@ -33,31 +33,31 @@ You can use as input either single JSON file or directory with JSON instances (i
Following command will print JSON Schema to stdout:

```bash
$ ./schema-guru-0.5.0 schema {{input}}
$ ./schema-guru-0.6.0 schema {{input}}
```

Also you can specify output file for your schema:

```bash
$ ./schema-guru-0.5.0 schema --output {{json_schema_file}} {{input}}
$ ./schema-guru-0.6.0 schema --output {{json_schema_file}} {{input}}
```

You can also switch Schema Guru into **[NDJSON] [ndjson]** mode, where it will look for newline delimited JSONs:

```bash
$ ./schema-guru-0.5.0 schema --ndjson {{input}}
$ ./schema-guru-0.6.0 schema --ndjson {{input}}
```

You can specify the enum cardinality tolerance for your fields. It means that *all* fields which are found to have less than the specified cardinality will be specified in the JSON Schema using the `enum` property.

```bash
$ ./schema-guru-0.5.0 schema --enum 5 {{input}}
$ ./schema-guru-0.6.0 schema --enum 5 {{input}}
```

If you know that some particular set of values can appear, but don't want to set big enum cardinality, you may want to specify predefined enum set with ``--enum-sets`` multioption, like this:

```bash
$ ./schema-guru-0.5.0 schema --enum-sets iso_4217 --enum-sets iso_3166-1_aplha-3 /path/to/instances
$ ./schema-guru-0.6.0 schema --enum-sets iso_4217 --enum-sets iso_3166-1_aplha-3 /path/to/instances
```

Currently Schema Guru includes following built-in enum sets (written as they should appear in CLI):
Expand All @@ -76,15 +76,15 @@ If you need to include very specific enum set, you can define it by yourself in
And pass path to this file instead of enum name:

```bash
$ ./schema-guru-0.5.0 schema --enum-sets all --enum-sets /path/to/browsers.json /path/to/instances
$ ./schema-guru-0.6.0 schema --enum-sets all --enum-sets /path/to/browsers.json /path/to/instances
```

Schema Guru will derive `minLength` and `maxLength` properties for strings based on shortest and longest strings.
But this may be a problem if you process small amount of instances.
To avoid this too strict Schema, you can use `--no-length` option.

```bash
$ ./schema-guru-0.5.0 schema --no-length /path/to/few-instances
$ ./schema-guru-0.6.0 schema --no-length /path/to/few-instances
```

#### DDL derivation
Expand All @@ -96,7 +96,7 @@ Currently we support DDL only for **[Amazon Redshift] [redshift]**, but in futur
Following command will just save Redshift (default ``--db`` value) DDL to current dir.

```bash
$ ./schema-guru-0.5.0 ddl {{input}}
$ ./schema-guru-0.6.0 ddl {{input}}
```

If you specified as input a directory with several Self-describing JSON Schemas belonging to a single REVISION, Schema Guru will also generate a migrations.
Expand All @@ -119,13 +119,13 @@ so you can safely alter your tables while they belong to a single REVISION.
You also can specify directory for output:

```bash
$ ./schema-guru-0.5.0 ddl --output {{ddl_dir}} {{input}}
$ ./schema-guru-0.6.0 ddl --output {{ddl_dir}} {{input}}
```

If you're not a Snowplow Platform user, don't use **[Self-describing Schema] [self-describing]** or just don't want anything specific to it you can produce raw schema:

```bash
$ ./schema-guru-0.5.0 ddl --raw {{input}}
$ ./schema-guru-0.6.0 ddl --raw {{input}}
```

But bear in mind that Self-describing Schemas bring many benefits.
Expand All @@ -134,7 +134,7 @@ For example, raw Schemas will not preserve an order for your columns (it just im
You may also want to get JSONPaths file for Redshift's **[COPY] [redshift-copy]** command. It will place ``jsonpaths`` dir alongside with ``sql``:

```bash
$ ./schema-guru-0.5.0 ddl --with-json-paths {{input}}
$ ./schema-guru-0.6.0 ddl --with-json-paths {{input}}
```

The most embarrassing part of shifting from dynamic-typed world to static-typed is product types (or union types) like this in JSON Schema: ``["integer", "string"]``.
Expand All @@ -147,13 +147,13 @@ Another thing everyone need to consider is default VARCHAR size. If there's no c
You can also specify this default value:

```bash
$ ./schema-guru-0.5.0 ddl --varchar-size 32 {{input}}
$ ./schema-guru-0.6.0 ddl --varchar-size 32 {{input}}
```

You can also specify Redshift Schema for your table. For non-raw mode ``atomic`` used as default.

```bash
$ ./schema-guru-0.5.0 ddl --raw --schema business {{input}}
$ ./schema-guru-0.6.0 ddl --raw --schema business {{input}}
```

Some users do not full rely on Schema Guru JSON Schema derivation or DDL generation and edit their DDLs manually.
Expand All @@ -171,9 +171,9 @@ $ ./schema-guru-0.6.0 ddl --force {{input}}
You can access our hosted demo of the Schema Guru web UI at [schemaguru.snplowanalytics.com] [webui-hosted]. To run it locally:

```bash
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.4.0.zip
$ unzip schema_guru_webui_0.4.0.zip
$ ./schema-guru-webui-0.4.0
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.6.0.zip
$ unzip schema_guru_webui_0.6.0.zip
$ ./schema-guru-webui-0.6.0
```

The above will run a Spray web server containing Schema Guru on [0.0.0.0:8000] [webui-local]. Interface and port can be specified by `--interface` and `--port` respectively.
Expand All @@ -198,7 +198,9 @@ $ cd sparkjob
$ inv run_emr my-profile my-bucket/input/ my-bucket/output/ my-bucket/errors/ my-bucket/logs my-ec2-keypair
```

If you need some specific options for Spark job, you can specify these in `tasks.py`. The Spark job accepts the same options as the CLI application, but note that `--output` isn't optional and we have a new optional `--errors-path`.
If you need some specific options for Spark job, you can specify these in `tasks.py`.
The Spark job accepts the same options as the CLI application, but note that `--output` isn't optional and we have a new optional `--errors-path`.
Also, instead of specifying some of predefined enum sets you can just enable it with `--enum-sets` flag, so it has the same behaviour as `--enum-sets all`.

## Developer Quickstart

Expand Down Expand Up @@ -279,7 +281,7 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk
To produce it you need to specify vendor, name (if segmentation isn't using, see below), and version (optional, default value is 1-0-0).

```bash
$ ./schema-guru-0.5.0 schema --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}} {{input}}
$ ./schema-guru-0.6.0 schema --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}} {{input}}
```

### Schema Segmentation
Expand Down Expand Up @@ -312,7 +314,7 @@ and

You can run it as follows:
```bash
$ ./schema-guru-0.5.0 schema --output {{output_dir}} --schema-by $.event {{mixed_jsons_directory}}
$ ./schema-guru-0.6.0 schema --output {{output_dir}} --schema-by $.event {{mixed_jsons_directory}}
```

It will put two (or may be more) JSON Schemas into output dir: Purchased_an_Item.json and Posted_a_comment.json.
Expand Down Expand Up @@ -415,7 +417,7 @@ limitations under the License.
[license-image]: http://img.shields.io/badge/license-Apache--2-blue.svg?style=flat
[license]: http://www.apache.org/licenses/LICENSE-2.0

[release-image]: http://img.shields.io/badge/release-0.5.0-blue.svg?style=flat
[release-image]: http://img.shields.io/badge/release-0.6.0-blue.svg?style=flat
[releases]: https://github.com/snowplow/schema-guru/releases

[json-schema]: http://json-schema.org/
Expand Down
4 changes: 2 additions & 2 deletions project/BuildSettings.scala
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2014 Snowplow Analytics Ltd. All rights reserved.
* Copyright (c) 2016 Snowplow Analytics Ltd. All rights reserved.
*
* This program is licensed to you under the Apache License Version 2.0,
* and you may not use this file except in compliance with the
Expand All @@ -20,7 +20,7 @@ object BuildSettings {
// Common settings for all our projects
lazy val commonSettings = Seq[Setting[_]](
organization := "com.snowplowanalytics",
version := "0.6.0-M1",
version := "0.6.0",
scalaVersion := "2.10.6",
crossScalaVersions := Seq("2.10.6", "2.11.7"),
scalacOptions := Seq("-deprecation", "-encoding", "utf8",
Expand Down
2 changes: 1 addition & 1 deletion sparkjob/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from boto.emr.bootstrap_action import BootstrapAction

DIR_WITH_JAR = "./target/scala-2.10/"
JAR_FILE = "schema-guru-sparkjob-0.6.0-M1"
JAR_FILE = "schema-guru-sparkjob-0.6.0-rc1"

S3_REGIONS = { 'us-east-1': Location.DEFAULT,
'us-west-1': Location.USWest,
Expand Down

0 comments on commit ca05d7f

Please sign in to comment.