Release RumbleDB 1.21.0 "Hawthorn blossom" beta · RumbleDB/rumble

NEW! The jar for Spark 3.5 was added and is available for download.

Use RumbleDB to query data with JSONiq, even data that does not fit in DataFrames.

Try-it-out sandbox: https://colab.research.google.com/github/RumbleDB/rumble/blob/master/RumbleSandbox.ipynb

Instructions to get started: https://rumble.readthedocs.io/en/latest/Getting%20started/

Spark 3.0 and 3.1 are no longer supported as of RumbleDB 1.21, as they are no longer supported officially by the Spark team. Spark 3.4 is newly supported.

RumbleDB comes in 4 jars that you can pick from depending on your needs:

rumbledb-1.21.0-standalone.jar contains Spark already and can simply be run "out of the box" with java -jar rumbledb-1.21.0-standalone.jar with Java 8 or 11.
rumbledb-1.21.0-for-spark-3.X.jar (3.2, 3.3, 3.4) is smaller in size, does not contain Spark, and can be run in a corresponding, existing Spark environment either local (so you need to download and install Spark) or on a cluster (EMR with just a few clicks, etc) with spark-submit rumbledb-1.21.0-for-spark-3.X.jar

Improvements

Automatically parallelizes range expressions with more than a million items with no need to call parallelize() any more.
some simple map expressions on homogeneous input are now faster (native SQL behind the scene).
general comparisons on equality are now considerably faster
reverse() is now more efficient and faster on homogeneous sequences
Fixed bug on equijoin involving homogeneous sequences
Add two functions jn:cosh and jn:sinh
Automatic optimization of general comparisons to value comparisons when it is detected that the sequences have at most one item (can be deactivated with --optimize-general-comparison-to-value-comparison on)
Better static type detection
It is now possible to force a sequential execution (without Spark) with --parallel-execution no. This also works with queries containing calls to parallelize() (which will be ineffective), json-doc(), and json-file() (which will simply stream-read from the disk). Other I/O functions (such as csv-file(), etc) will still involve Spark for reading, but immediately materialize for the rest of the execution.
It is now possible to deactivate Native Spark SQL execution (forcing a fallback to the use of UDFs by RumbleDB) with --native-execution no.
annotate expression (similar syntax to validate expression) allows directly annotating an item without checking for validity.
More static types are detected
Non-recursive functions are now automatically inlined for faster execution. This can be deactivated with --function-inlining no (reverting to behavior in previous versions)
TypeSwitch expressions now support DataFrame execution

Bugfixes

Fixed bug when reading longs from DataFrames
Fixed an issue with projection pushdowns in join queries
Fixed a few bugs with queries that navigate JSON in for clauses; they are compiled to native SQL whenever possible, but some chains were throwing errors (e.g., an array unboxing followed by object lookup)
Fixed a bug in which calling count() on a grouping variable did not return 1 when native SQL execution is activated
hexBinary and base64Binary values can now be used in order by clauses with parallel execution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RumbleDB 1.21.0 "Hawthorn blossom" beta