Skip to content

Generate Scala case class definitions from Avro schemas

License

Notifications You must be signed in to change notification settings

kmatasfp/avrohugger

 
 

Repository files navigation

avrohugger

Join the chat at https://gitter.im/julianpeeters/avrohugger

Schema-to-case-class code generation for working with Avro in Scala.

  • avrohugger-core: Generate source code at runtime for evaluation at a later step.
  • avrohugger-filesorter: Sort schema files for proper compilation order.
  • avrohugger-tools: Generate source code at the command line with the avrohugger-tools jar.

Alternative Distributions:

  • sbt: sbt-avrohugger - Generate source code at compile time with an sbt plugin found here.
  • maven: avrohugger-maven-plugin - Generate source code at compile time with a maven plugin found here.
  • on the web: avro2caseclass - Generate source code from a web app, found here.

Table of contents

Generates Scala case classes in various formats:
  • Standard Vanilla case classes (for use with Apache Avro's GenericRecord API, etc.)

  • SpecificRecord Case classes that implement SpecificRecordBase and therefore have mutable var fields (for use with the Avro Specific API - Scalding, Spark, Avro, etc.).

  • Scavro Case classes with immutable fields, intended to wrap Java generated Avro classes (for use with the Scavro runtime, Java classes provided separately (see Scavro Plugin or sbt-avro)).

Supports generating case classes with arbitrary fields of the following datatypes:
Avro Standard SpecificRecord Scavro Notes
INT Int Int Int
LONG Long Long Long
FLOAT Float Float Float
DOUBLE Double Double Double
STRING String String String
BOOLEAN Boolean Boolean Boolean
NULL Null Null Null
MAP Map Map Map
ENUM scala.Enumeration
Scala case object
Java Enum
EnumAsScalaString
Java Enum
EnumAsScalaString
scala.Enumeration
Scala case object
Java Enum
EnumAsScalaString
See Customizable Type Mapping
BYTES Array[Byte] Array[Byte] Array[Byte]
FIXED //TODO //TODO //TODO
ARRAY List
Array
Vector
List
Array
Vector
Array
List
Vector
See Customizable Type Mapping
UNION Option
Either
Shapeless Coproduct
Option Option See Customizable Type Mapping
RECORD case class case class extending SpecificRecordBase case class extending AvroSerializeable
PROTOCOL N/A
Scala ADT
RPC trait
Scala ADT
N/A
Scala ADT
See Customizable Type Mapping
Protocol Support:
  • the records defined in .avdl, .avpr, and json protocol strings can be generated as ADTs if the protocols define more than one Scala definition (note: message definitions are ignored when this setting is used). See Customizable Type Mapping.

  • For SpecificRecord, if the protocol contains messages then an RPC trait is generated (instead of generating and ADT, or ignoring the message definitions).

Doc Support:
  • .avdl: Comments that begin with /** are used as the documentation string for the type or field definition that follows the comment.

  • .avsc, .avpr, and .avro: Docs in Avro schemas are used to define a case class' ScalaDoc

  • .scala: ScalaDocs of case class definitions are used to define record and field docs

Note: Currently Treehugger appears to generate Javadoc style docs (thus compatible with ScalaDoc style).

Usage

  • For Scala 2.10, 2.11, and 2.12
  • Generates Code Compatible with Scala 2.10, 2.11, and 2.12

avrohugger-core

Get the dependency with:
"com.julianpeeters" %% "avrohugger-core" % "1.0.0-RC3"
Description:

Instantiate a Generator with Standard, Scavro, or SpecificRecord source formats. Then use

tToFile(input: T, outputDir: String): Unit

or

tToStrings(input: T): List[String]

where T can be File, Schema, or String.

Example
import avrohugger.Generator
import format.SpecificRecord

val schemaFile = new File("path/to/schema")
val generator = new Generator(SpecificRecord)
generator.fileToFile(schemaFile, "optional/path/to/output") // default output path = "target/generated-sources"

where an input File can be .avro, .avsc, .avpr, or .avdl,

and where an input String can be the string representation of an Avro schema, protocol, IDL, or a set of case classes that you'd like to have implement SpecificRecordBase.

Customizable Type Mapping:

To reassign Scala types to Avro types, use the following:

import avrohugger.format.SpecificRecord
import avrohugger.types.ScalaVector

val myScalaTypes = Some(SpecificRecord.defaultTypes.copy(array = ScalaVector))
val generator = new Generator(SpecificRecord, avroScalaCustomTypes = myScalaTypes)
  • array can be assigned to ScalaArray, ScalaList, and ScalaVector
  • enum can be assigned to JavaEnum, ScalaCaseObjectEnum, EnumAsScalaString, and ScalaEnumeration
  • union can be assigned to OptionEitherShapelessCoproduct and OptionShapelessCoproduct
  • int, long, float, double can be assigned to ScalaInt, ScalaLong, ScalaFloat, ScalaDouble
  • protocol can be assigned to ScalaADT and NoTypeGenerated
Customizable Namespace Mapping:

Namespaces can be reassigned by instantiating a Generator with a custom namespace map (please see warnings below):

val generator = new Generator(SpecificRecord, avroScalaCustomNamespace = Map("oldnamespace"->"newnamespace"))

Scavro: by default, a "model" package is appended to the namespace to create a Scala namespace that does not conflict with Scavro's generated Java. To override, either customize each package namespace separately (preempting the use of the default package name), or override the package name like so:

val generator = new Generator(SpecificRecord, avroScalaCustomNamespace = Map("SCAVRO_DEFAULT_PACKAGE$"->"scavro"))
Generate Classes Instead of Case Classes

Generate simple classes instead of case classes when fields.size > 22, useful for generating code for Scala 2.10 from large schemas.

val generator = new Generator(SpecificRecord, restrictedFieldNumber = true)

avrohugger-filesorter

Get the dependency with:
"com.julianpeeters" %% "avrohugger-filesorter" % "1.0.0-RC3"
Description:

To ensure dependent schemas are compiled in the proper order (thus avoiding org.apache.avro.SchemaParseException: Undefined name: "com.example.MyRecord" parser errors), sort avsc and avdl files with the sortSchemaFiles method on AvscFileSorter and AvdlFileSorterrespectively.

Example:
import avrohugger.filesorter.AvscFileSorter
import java.io.File

val sorted: List[File] = AvscFileSorter.sortSchemaFiles((srcDir ** "*.avsc")

avrohugger-tools

Download the avrohugger-tools jar for Scala 2.10, Scala 2.11, or Scala 2.12 (>30MB!) and use it like the avro-tools jar Usage: [-string] (schema|protocol|datafile) input... outputdir:

  • generate generates Scala case class definitions:

java -jar /path/to/avrohugger-tools_2.12-1.0.0-RC3-assembly.jar generate schema user.avsc .

  • generate-specific generates definitions that extend Avro's SpecificRecordBase:

java -jar /path/to/avrohugger-tools_2.12-1.0.0-RC3-assembly.jar generate-specific schema user.avsc .

  • generate-scavro generates definitions that extend Scavro's AvroSerializable:

java -jar /path/to/avrohugger-tools_2.12-1.0.0-RC3-assembly.jar generate-scavro schema user.avsc .

Warnings

  1. If your framework is one that relies on reflection to get the Schema, it will fail since Scala fields are private. Therefore preempt it by passing in a Schema to DatumReaders and DatumWriters (as in the Avro example above).

  2. For the SpecificRecord format, generated case class fields must be mutable (var) in order to be compatible with the SpecificRecord API. Note: If your framework allows GenericRecord, avro4s provides a type class that converts to and from immutable case classes cleanly.

  3. When the input is a case class definition String, import statements are not supported, please use fully qualified type names if using records/classes from multiple namespaces.

  4. By default, a schema's namespace is used as a package name. In the case of the Scavro output format, the default is the namespace with model appended.

  5. While Scavro format uses custom namespaces in a way that leaves it unaffected, most formats fail on schemas with records within unions (see [avro forum](see http://apache-avro.679487.n3.nabble.com/Deserialize-with-different-schema-td4032782.html)).

  6. SpecificRecord requires that enum be represented as JavaEnum

Best Practices

  1. Avoid recursive schemas since they can cause compatibility issues if trying to flow data into a system that doesn't support them (e.g., Hive).

  2. Use namespaces to ensure compatibility when importing into Java/Scala.

  3. Use default field values in case of future schema evolution (further reading).

Testing

To test for regressions, please run sbt:avrohugger> + test.

To test that generated code can be de/serialized as expected, please run:

  1. sbt:avrohugger> + publishLocal
  2. then clone sbt-avrohugger and update its avrohugger dependency to the locally published version
  3. finally run sbt:sbt-avrohugger> scripted avrohugger/*, or, e.g., scripted avrohugger/GenericSerializationTests

Credits

Depends on Avro and Treehugger. avrohugger-tools is based on avro-tools.

Contributors:

Criticism is appreciated.
Fork away, just make sure the tests pass before sending a pull request.

About

Generate Scala case class definitions from Avro schemas

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 99.1%
  • Java 0.9%