Skip to content

Latest commit

 

History

History
1213 lines (1052 loc) · 108 KB

CHANGES.md

File metadata and controls

1213 lines (1052 loc) · 108 KB

Parquet

Version 1.14.1

Release Notes - Parquet - Version 1.14.1

Bug

  • PARQUET-2468 - ParquetMetadata.toPrettyJSON throws exception on file read when LOG.isDebugEnabled()
  • PARQUET-2498 - Hadoop vector IO API doesn't handle empty list of ranges

Version 1.14.0

Release Notes - Parquet - Version 1.14.0

Bug

  • PARQUET-2260 - Bloom filter bytes size shouldn't be larger than maxBytes size in the configuration
  • PARQUET-2266 - Fix support for files without ColumnIndexes
  • PARQUET-2276 - ParquetReader reads do not work with Hadoop version 2.8.5
  • PARQUET-2300 - Update jackson-core 2.13.4 to a version without CVE PRISMA-2023-0067
  • PARQUET-2325 - Fix parquet-cli's dictionary subcommand to work with FIXED_LEN_BYTE_ARRAY
  • PARQUET-2329 - Fix wrong help messages of parquet-cli subcommands
  • PARQUET-2330 - Fix convert-csv to show the correct position of the invalid record
  • PARQUET-2332 - Fix unexpectedly disabled tests to be executed
  • PARQUET-2336 - Add caching key to CodecFactory
  • PARQUET-2342 - Parquet writer produced a corrupted file due to page value count overflow
  • PARQUET-2343 - Fixes NPE when rewriting file with multiple rowgroups
  • PARQUET-2348 - Recompression/Re-encrypt should rewrite bloomfilter
  • PARQUET-2354 - Apparent race condition in CharsetValidator
  • PARQUET-2363 - ParquetRewriter should encrypt the V2 page header
  • PARQUET-2365 - Fixes NPE when rewriting column without column index
  • PARQUET-2408 - Fix license header in .gitattributes
  • PARQUET-2420 - ThriftParquetWriter converts thrift byte to int32 without adding logical type
  • PARQUET-2429 - Direct buffer churn in NonBlockedDecompressor
  • PARQUET-2438 - Fixes minMaxSize for BinaryColumnIndexBuilder
  • PARQUET-2442 - Remove Parquet Site from parquet-mr
  • PARQUET-2448 - parquet-avro does not support nested logical-type for avro <= 1.8
  • PARQUET-2449 - Writing using LocalOutputFile creates a large buffer
  • PARQUET-2450 - ParquetAvroReader throws exception projecting a single field of a repeated record type
  • PARQUET-2456 - avro schema conversion may fail with name conflict when using fixed types
  • PARQUET-2457 - Missing maven-scala-plugin version
  • PARQUET-2458 - Java compiler should use release instead of source/target
  • PARQUET-2465 - Fall back to Hadoop Configuration

New Feature

Improvement

Test

  • PARQUET-2361 - Reduce failure rate of unit test testParquetFileWithBloomFilterWithFpp

Task

Version 1.13.1

Release Notes - Parquet - Version 1.13.1

Improvement

Version 1.13.0

Release Notes - Parquet - Version 1.13.0

New Feature

  • PARQUET-1020 - Add support for Dynamic Messages in parquet-protobuf

Task

  • PARQUET-2230 - Add a new rewrite command powered by ParquetRewriter
  • PARQUET-2228 - ParquetRewriter supports more than one input file
  • PARQUET-2229 - ParquetRewriter supports masking and encrypting the same column
  • PARQUET-2227 - Refactor different file rewriters to use single implementation

Improvement

Bug

  • PARQUET-2202 - Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte
  • PARQUET-2164 - CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written
  • PARQUET-2103 - Fix crypto exception in print toPrettyJSON
  • PARQUET-2251 - Avoid generating Bloomfilter when all pages of a column are encoded by dictionary
  • PARQUET-2243 - Support zstd-jni in DirectCodecFactory
  • PARQUET-2247 - Fail-fast if CapacityByteArrayOutputStream write overflow
  • PARQUET-2241 - Fix ByteStreamSplitValuesReader with nulls
  • PARQUET-2244 - Fix notIn for columns with null values
  • PARQUET-2173 - Fix parquet build against hadoop 3.3.3+
  • PARQUET-2219 - ParquetFileReader skips empty row group
  • PARQUET-2198 - Updating jackson data bind version to fix CVEs
  • PARQUET-2177 - Fix parquet-cli not to fail showing descriptions
  • PARQUET-1711 - Support recursive proto schemas by limiting recursion depth
  • PARQUET-2142 - parquet-cli without hadoop throws java.lang.NoSuchMethodError on any parquet file access command
  • PARQUET-2160 - Close decompression stream to free off-heap memory in time
  • PARQUET-2185 - ParquetReader constructed using builder fails to read encrypted files
  • PARQUET-2167 - CLI show footer command fails if Parquet file contains date fields
  • PARQUET-2134 - Incorrect type checking in HadoopStreams.wrap
  • PARQUET-2161 - Fix row index generation in combination with range filtering
  • PARQUET-2154 - ParquetFileReader should close its input stream when filterRowGroups throw Exception in constructor

Test

Version 1.12.3

Release Notes - Parquet - Version 1.12.3

New Feature

  • PARQUET-2117 - Add rowPosition API in parquet record readers

Task

  • PARQUET-2081 - Encryption translation tool - Parquet-hadoop

Improvement

Bug

Version 1.12.2

Release Notes - Parquet - Version 1.12.2

Bug

Version 1.12.1

Release Notes - Parquet - Version 1.12.1

Bug

  • PARQUET-1633 - Fix integer overflow
  • PARQUET-2022 - ZstdDecompressorStream should close zstdInputStream
  • PARQUET-2027 - Fix calculating directory offset for merge
  • PARQUET-2052 - Integer overflow when writing huge binary using dictionary encoding
  • PARQUET-2054 - fix TCP leaking when calling ParquetFileWriter.appendFile
  • PARQUET-2072 - Do Not Determine Both Min/Max for Binary Stats
  • PARQUET-2073 - Fix estimate remaining row count in ColumnWriteStoreBase.
  • PARQUET-2078 - Failed to read parquet file after writing with the same parquet version

Improvement

Version 1.12.0

Release Notes - Parquet - Version 1.12.0

Sub-task

Bug

  • PARQUET-1438 - [C++] corrupted files produced on 32-bit architecture (i686)
  • PARQUET-1493 - maven protobuf plugin not work properly
  • PARQUET-1455 - [parquet-protobuf] Handle "unknown" enum values for parquet-protobuf
  • PARQUET-1554 - Compilation error when upgrading Scrooge version
  • PARQUET-1599 - Fix to-avro to respect the overwrite option
  • PARQUET-1684 - [parquet-protobuf] default protobuf field values are stored as nulls
  • PARQUET-1699 - Could not resolve org.apache.yetus:audience-annotations:0.11.0
  • PARQUET-1741 - APIs backward compatibility issues cause master branch build failure
  • PARQUET-1765 - Invalid filteredRowCount in InternalParquetRecordReader
  • PARQUET-1794 - Random data generation may cause flaky tests
  • PARQUET-1803 - Could not find FilleInputSplit in ParquetInputSplit
  • PARQUET-1808 - SimpleGroup.toString() uses String += and so has poor performance
  • PARQUET-1818 - Fix collision of encryption and bloom filters in format-structure Util
  • PARQUET-1850 - toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit
  • PARQUET-1851 - ParquetMetadataConveter throws NPE in an Iceberg unit test
  • PARQUET-1868 - Parquet reader options toggle for bloom filter toggles dictionary filtering
  • PARQUET-1879 - Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
  • PARQUET-1893 - H2SeekableInputStream readFully() doesn't respect start and len
  • PARQUET-1894 - Please fix the related Shaded Jackson Databind CVEs
  • PARQUET-1896 - [Maven] parquet-tools build is broken
  • PARQUET-1910 - Parquet-cli is broken after TransCompressionCommand was added
  • PARQUET-1917 - [parquet-proto] default values are stored in oneOf fields that aren't set
  • PARQUET-1920 - Fix issue with reading parquet files with too large column chunks
  • PARQUET-1923 - parquet-tools 1.11.0: TestSimpleRecordConverter fails with ExceptionInInitializerError on openjdk 15
  • PARQUET-1928 - Interpret Parquet INT96 type as FIXED[12] AVRO Schema
  • PARQUET-1944 - Unable to download transitive dependency hadoop-lzo
  • PARQUET-1947 - DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data
  • PARQUET-1949 - Mark Parquet-1872 with not support bloom filter yet
  • PARQUET-1954 - TCP connection leak in parquet dump
  • PARQUET-1963 - DeprecatedParquetInputFormat in CombineFileInputFormat throw NPE when the first sub-split is empty
  • PARQUET-1966 - Fix build with JDK11 for JDK8
  • PARQUET-1970 - Make minor releases source compatible
  • PARQUET-1971 - Flaky test in github action
  • PARQUET-1975 - Test failure on ARM64 CPU architecture
  • PARQUET-1977 - Invalid data_page_offset
  • PARQUET-1979 - Optional bloom_filter_offset is filled if no bloom filter is present
  • PARQUET-1984 - Some tests fail on windows
  • PARQUET-1992 - Cannot build from tarball because of git submodules
  • PARQUET-1999 - NPE might occur if OutputFile is implemented by the client

New Feature

Improvement

Test

  • PARQUET-1832 - Travis fails with too long output
  • PARQUET-1980 - Build and test Apache Parquet on ARM64 CPU architecture

Wish

  • PARQUET-1717 - parquet-thrift converts Thrift i16 to parquet INT32 instead of INT_16

Task

Version 1.11.0

Release Notes - Parquet - Version 1.11.0

Bug

  • PARQUET-138 - Parquet should allow a merge between required and optional schemas
  • PARQUET-952 - Avro union with single type fails with 'is not a group'
  • PARQUET-1128 - [Java] Upgrade the Apache Arrow version to 0.8.0 for SchemaConverter
  • PARQUET-1281 - Jackson dependency
  • PARQUET-1285 - [Java] SchemaConverter should not convert from TimeUnit.SECOND AND TimeUnit.NANOSECOND of Arrow
  • PARQUET-1293 - Build failure when using Java 8 lambda expressions
  • PARQUET-1296 - Travis kills build after 10 minutes, because "no output was received"
  • PARQUET-1297 - [Java] SchemaConverter should not convert from Timestamp(TimeUnit.SECOND) and Timestamp(TimeUnit.NANOSECOND) of Arrow
  • PARQUET-1303 - Avro reflect @Stringable field write error if field not instanceof CharSequence
  • PARQUET-1304 - Release 1.10 contains breaking changes for Hive
  • PARQUET-1305 - Backward incompatible change introduced in 1.8
  • PARQUET-1309 - Parquet Java uses incorrect stats and dictionary filter properties
  • PARQUET-1311 - Update README.md
  • PARQUET-1317 - ParquetMetadataConverter throw NPE
  • PARQUET-1341 - Null count is suppressed when columns have no min or max and use unsigned sort order
  • PARQUET-1344 - Type builders don't honor new logical types
  • PARQUET-1368 - ParquetFileReader should close its input stream for the failure in constructor
  • PARQUET-1371 - Time/Timestamp UTC normalization parameter doesn't work
  • PARQUET-1407 - Data loss on duplicate values with AvroParquetWriter/Reader
  • PARQUET-1417 - BINARY_AS_SIGNED_INTEGER_COMPARATOR fails with IOBE for the same arrays with the different length
  • PARQUET-1421 - InternalParquetRecordWriter logs debug messages at the INFO level
  • PARQUET-1440 - Parquet-tools: Decimal values stored in an int32 or int64 in the parquet file aren't displayed with their proper scale
  • PARQUET-1441 - SchemaParseException: Can't redefine: list in AvroIndexedRecordConverter
  • PARQUET-1456 - Use page index, ParquetFileReader throw ArrayIndexOutOfBoundsException
  • PARQUET-1460 - Fix javadoc errors and include javadoc checking in Travis checks
  • PARQUET-1461 - Third party code does not compile after parquet-mr minor version update
  • PARQUET-1470 - Inputstream leakage in ParquetFileWriter.appendFile
  • PARQUET-1472 - Dictionary filter fails on FIXED_LEN_BYTE_ARRAY
  • PARQUET-1475 - DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor
  • PARQUET-1478 - Can't read spec compliant, 3-level lists via parquet-proto
  • PARQUET-1480 - INT96 to avro not yet implemented error should mention deprecation
  • PARQUET-1485 - Snappy Decompressor/Compressor may cause direct memory leak
  • PARQUET-1488 - UserDefinedPredicate throw NPE
  • PARQUET-1496 - [Java] Update Scala for JDK 11 compatibility
  • PARQUET-1497 - [Java] javax annotations dependency missing for Java 11
  • PARQUET-1498 - [Java] Add instructions to install thrift via homebrew
  • PARQUET-1510 - Dictionary filter skips null values when evaluating not-equals.
  • PARQUET-1514 - ParquetFileWriter Records Compressed Bytes instead of Uncompressed Bytes
  • PARQUET-1527 - [parquet-tools] cat command throw java.lang.ClassCastException
  • PARQUET-1529 - Shade fastutil in all modules where used
  • PARQUET-1531 - Page row count limit causes empty pages to be written from MessageColumnIO
  • PARQUET-1533 - TestSnappy() throws OOM exception with Parquet-1485 change
  • PARQUET-1534 - [parquet-cli] Argument error: Illegal character in opaque part at index 2 on Windows
  • PARQUET-1544 - Possible over-shading of modules
  • PARQUET-1550 - CleanUtil does not work in Java 11
  • PARQUET-1555 - Bump snappy-java to 1.1.7.3
  • PARQUET-1596 - PARQUET-1375 broke parquet-cli's to-avro command
  • PARQUET-1600 - Fix shebang in parquet-benchmarks/run.sh
  • PARQUET-1615 - getRecordWriter shouldn't hardcode CREAT mode when new ParquetFileWriter
  • PARQUET-1637 - Builds are failing because default jdk changed to openjdk11 on Travis
  • PARQUET-1644 - Clean up some benchmark code and docs.
  • PARQUET-1691 - Build fails due to missing hadoop-lzo

New Feature

Improvement

Test

  • PARQUET-1536 - [parquet-cli] Add simple tests for each command

Wish

Task

Version 1.10.1

Release Notes - Parquet - Version 1.10.1

Bug

  • PARQUET-1510 - Dictionary filter skips null values when evaluating not-equals.
  • PARQUET-1309 - Parquet Java uses incorrect stats and dictionary filter properties

Version 1.10.0

Release Notes - Parquet - Version 1.10.0

Bug

  • PARQUET-196 - parquet-tools command to get rowcount & size
  • PARQUET-357 - Parquet-thrift generates wrong schema for Thrift binary fields
  • PARQUET-765 - Upgrade Avro to 1.8.1
  • PARQUET-783 - H2SeekableInputStream does not close its underlying FSDataInputStream, leading to connection leaks
  • PARQUET-786 - parquet-tools README incorrectly has 'java jar' instead of 'java -jar'
  • PARQUET-791 - Predicate pushing down on missing columns should work on UserDefinedPredicate too
  • PARQUET-1005 - Fix DumpCommand parsing to allow column projection
  • PARQUET-1028 - [JAVA] When reading old Spark-generated files with INT96, stats are reported as valid when they aren't
  • PARQUET-1065 - Deprecate type-defined sort ordering for INT96 type
  • PARQUET-1077 - [MR] Switch to long key ids in KEYs file
  • PARQUET-1141 - IDs are dropped in metadata conversion
  • PARQUET-1152 - Parquet-thrift doesn't compile with Thrift 0.9.3
  • PARQUET-1153 - Parquet-thrift doesn't compile with Thrift 0.10.0
  • PARQUET-1156 - dev/merge_parquet_pr.py problems
  • PARQUET-1185 - TestBinary#testBinary unit test fails after PARQUET-1141
  • PARQUET-1191 - Type.hashCode() takes originalType into account but Type.equals() does not
  • PARQUET-1208 - Occasional endless loop in unit test
  • PARQUET-1217 - Incorrect handling of missing values in Statistics
  • PARQUET-1246 - Ignore float/double statistics in case of NaN
  • PARQUET-1258 - Update scm developer connection to github

New Feature

  • PARQUET-1025 - Support new min-max statistics in parquet-mr

Improvement

  • PARQUET-220 - Unnecessary warning in ParquetRecordReader.initialize
  • PARQUET-321 - Set the HDFS padding default to 8MB
  • PARQUET-386 - Printing out the statistics of metadata in parquet-tools
  • PARQUET-423 - Make writing Avro to Parquet less noisy
  • PARQUET-755 - create parquet-arrow module with schema converter
  • PARQUET-777 - Add new Parquet CLI tools
  • PARQUET-787 - Add a size limit for heap allocations when reading
  • PARQUET-801 - Allow UserDefinedPredicates in DictionaryFilter
  • PARQUET-852 - Slowly ramp up sizes of byte[] in ByteBasedBitPackingEncoder
  • PARQUET-884 - Add support for Decimal datatype to Parquet-Pig record reader
  • PARQUET-969 - Decimal datatype support for parquet-tools output
  • PARQUET-990 - More detailed error messages in footer parsing
  • PARQUET-1024 - allow for case insensitive parquet-xxx prefix in PR title
  • PARQUET-1026 - allow unsigned binary stats when min == max
  • PARQUET-1115 - Warn users when misusing parquet-tools merge
  • PARQUET-1135 - upgrade thrift and protobuf dependencies
  • PARQUET-1142 - Avoid leaking Hadoop API to downstream libraries
  • PARQUET-1149 - Upgrade Avro dependancy to 1.8.2
  • PARQUET-1170 - Logical-type-based toString for proper representeation in tools/logs
  • PARQUET-1183 - AvroParquetWriter needs OutputFile based Builder
  • PARQUET-1197 - Log rat failures
  • PARQUET-1198 - Bump java source and target to java8
  • PARQUET-1215 - Add accessor for footer after a file is closed
  • PARQUET-1263 - ParquetReader's builder should use Configuration from the InputFile

Task

Version 1.9.0

Bug

  • PARQUET-182 - FilteredRecordReader skips rows it shouldn't for schema with optional columns
  • PARQUET-212 - Implement nested type read rules in parquet-thrift
  • PARQUET-241 - ParquetInputFormat.getFooters() should return in the same order as what listStatus() returns
  • PARQUET-305 - Logger instantiated for package org.apache.parquet may be GC-ed
  • PARQUET-335 - Avro object model should not require MAP_KEY_VALUE
  • PARQUET-340 - totalMemoryPool is truncated to 32 bits
  • PARQUET-346 - ThriftSchemaConverter throws for unknown struct or union type
  • PARQUET-349 - VersionParser does not handle versions like "parquet-mr 1.6.0rc4"
  • PARQUET-352 - Add tags to "created by" metadata in the file footer
  • PARQUET-353 - Compressors not getting recycled while writing parquet files, causing memory leak
  • PARQUET-360 - parquet-cat json dump is broken for maps
  • PARQUET-363 - Cannot construct empty MessageType for ReadContext.requestedSchema
  • PARQUET-367 - "parquet-cat -j" doesn't show all records
  • PARQUET-372 - Parquet stats can have awkwardly large values
  • PARQUET-373 - MemoryManager tests are flaky
  • PARQUET-379 - PrimitiveType.union erases original type
  • PARQUET-380 - Cascading and scrooge builds fail when using thrift 0.9.0
  • PARQUET-385 - PrimitiveType.union accepts fixed_len_byte_array fields with different lengths when strict mode is on
  • PARQUET-387 - TwoLevelListWriter does not handle null values in array
  • PARQUET-389 - Filter predicates should work with missing columns
  • PARQUET-395 - System.out is used as logger in org.apache.parquet.Log
  • PARQUET-396 - The builder for AvroParquetReader loses the record type
  • PARQUET-400 - Error reading some files after PARQUET-77 bytebuffer read path
  • PARQUET-409 - InternalParquetRecordWriter doesn't use min/max row counts
  • PARQUET-410 - Fix subprocess hang in merge_parquet_pr.py
  • PARQUET-413 - Test failures for Java 8
  • PARQUET-415 - ByteBufferBackedBinary serialization is broken
  • PARQUET-422 - Fix a potential bug in MessageTypeParser where we ignore and overwrite the initial value of a method parameter
  • PARQUET-425 - Fix the bug when predicate contains columns not specified in prejection, to prevent filtering out data improperly
  • PARQUET-426 - Throw Exception when predicate contains columns not specified in prejection, to prevent filtering out data improperly
  • PARQUET-430 - Change to use Locale parameterized version of String.toUpperCase()/toLowerCase
  • PARQUET-431 - Make ParquetOutputFormat.memoryManager volatile
  • PARQUET-495 - Fix mismatches in Types class comments
  • PARQUET-509 - Incorrect number of args passed to string.format calls
  • PARQUET-511 - Integer overflow on counting values in column
  • PARQUET-528 - Fix flush() for RecordConsumer and implementations
  • PARQUET-529 - Avoid evoking job.toString() in ParquetLoader
  • PARQUET-540 - Cascading3 module doesn't build when using thrift 0.9.0
  • PARQUET-544 - ParquetWriter.close() throws NullPointerException on second call, improper implementation of Closeable contract
  • PARQUET-560 - Incorrect synchronization in SnappyCompressor
  • PARQUET-569 - ParquetMetadataConverter offset filter is broken
  • PARQUET-571 - Fix potential leak in ParquetFileReader.close()
  • PARQUET-580 - Potentially unnecessary creation of large int[] in IntList for columns that aren't used
  • PARQUET-581 - Min/max row count for page size check are conflated in some places
  • PARQUET-584 - show proper command usage when there's no arguments
  • PARQUET-612 - Add compression to FileEncodingIT tests
  • PARQUET-623 - DeltaByteArrayReader has incorrect skip behaviour
  • PARQUET-642 - Improve performance of ByteBuffer based read / write paths
  • PARQUET-645 - DictionaryFilter incorrectly handles null
  • PARQUET-651 - Parquet-avro fails to decode array of record with a single field name "element" correctly
  • PARQUET-660 - Writing Protobuf messages with extensions results in an error or data corruption.
  • PARQUET-663 - Link are Broken in README.md
  • PARQUET-674 - Add an abstraction to get the length of a stream
  • PARQUET-685 - Deprecated ParquetInputSplit constructor passes parameters in the wrong order.
  • PARQUET-726 - TestMemoryManager consistently fails
  • PARQUET-743 - DictionaryFilters can re-use StreamBytesInput when compressed

Improvement

  • PARQUET-77 - Improvements in ByteBuffer read path
  • PARQUET-99 - Large rows cause unnecessary OOM exceptions
  • PARQUET-146 - make Parquet compile with java 7 instead of java 6
  • PARQUET-318 - Remove unnecessary objectmapper from ParquetMetadata
  • PARQUET-327 - Show statistics in the dump output
  • PARQUET-341 - Improve write performance with wide schema sparse data
  • PARQUET-343 - Caching nulls on group node to improve write performance on wide schema sparse data
  • PARQUET-358 - Add support for temporal logical types to AVRO/Parquet conversion
  • PARQUET-361 - Add prerelease logic to semantic versions
  • PARQUET-384 - Add Dictionary Based Filtering to Filter2 API
  • PARQUET-386 - Printing out the statistics of metadata in parquet-tools
  • PARQUET-397 - Pig Predicate Pushdown using Filter2 API
  • PARQUET-421 - Fix mismatch of javadoc names and method parameters in module encoding, column, and hadoop
  • PARQUET-427 - Push predicates into the whole read path
  • PARQUET-432 - Complete a todo for method ColumnDescriptor.compareTo()
  • PARQUET-460 - Parquet files concat tool
  • PARQUET-480 - Update for Cascading 3.0
  • PARQUET-484 - Warn when Decimal is stored as INT64 while could be stored as INT32
  • PARQUET-543 - Remove BoundedInt encodings
  • PARQUET-585 - Slowly ramp up sizes of int[]s in IntList to keep sizes small when data sets are small
  • PARQUET-654 - Make record-level filtering optional
  • PARQUET-668 - Provide option to disable auto crop feature in DumpCommand output
  • PARQUET-727 - Ensure correct version of thrift is used
  • PARQUET-740 - Introduce editorconfig

New Feature

  • PARQUET-225 - INT64 support for Delta Encoding
  • PARQUET-382 - Add a way to append encoded blocks in ParquetFileWriter
  • PARQUET-429 - Enables predicates collecting their referred columns
  • PARQUET-548 - Add Java metadata for PageEncodingStats
  • PARQUET-669 - Allow reading file footers from input streams when writing metadata files

Task

Test

  • PARQUET-355 - Create Integration tests to validate statistics
  • PARQUET-378 - Add thoroughly parquet test encodings

Version 1.8.1

Bug

  • PARQUET-331 - Merge script doesn't surface stderr from failed sub processes
  • PARQUET-336 - ArrayIndexOutOfBounds in checkDeltaByteArrayProblem
  • PARQUET-337 - binary fields inside map/set/list are not handled in parquet-scrooge
  • PARQUET-338 - Readme references wrong format of pull request title

Improvement

  • PARQUET-279 - Check empty struct in the CompatibilityChecker util

Task

Version 1.8.0

Bug

  • PARQUET-151 - Null Pointer exception in parquet.hadoop.ParquetFileWriter.mergeFooters
  • PARQUET-152 - Encoding issue with fixed length byte arrays
  • PARQUET-164 - Warn when parquet memory manager kicks in
  • PARQUET-199 - Add a callback when the MemoryManager adjusts row group size
  • PARQUET-201 - Column with OriginalType INT_8 failed at filtering
  • PARQUET-227 - Parquet thrift can write unions that have 0 or more than 1 set value
  • PARQUET-246 - ArrayIndexOutOfBoundsException with Parquet write version v2
  • PARQUET-251 - Binary column statistics error when reuse byte[] among rows
  • PARQUET-252 - parquet scrooge support should support nested container type
  • PARQUET-254 - Wrong exception message for unsupported INT96 type
  • PARQUET-269 - Restore scrooge-maven-plugin to 3.17.0 or greater
  • PARQUET-284 - Should use ConcurrentHashMap instead of HashMap in ParquetMetadataConverter
  • PARQUET-285 - Implement nested types write rules in parquet-avro
  • PARQUET-287 - Projecting unions in thrift causes TExceptions in deserializatoin
  • PARQUET-296 - Set master branch version back to 1.8.0-SNAPSHOT
  • PARQUET-297 - created_by in file meta data doesn't contain parquet library version
  • PARQUET-314 - Fix broken equals implementation(s)
  • PARQUET-316 - Run.sh is broken in parquet-benchmarks
  • PARQUET-317 - writeMetaDataFile crashes when a relative root Path is used
  • PARQUET-320 - Restore semver checks
  • PARQUET-324 - row count incorrect if data file has more than 2^31 rows
  • PARQUET-325 - Do not target row group sizes if padding is set to 0
  • PARQUET-329 - ThriftReadSupport#THRIFT_COLUMN_FILTER_KEY was removed (incompatible change)

Improvement

  • PARQUET-175 - Allow setting of a custom protobuf class when reading parquet file using parquet-protobuf.
  • PARQUET-223 - Add Map and List builiders
  • PARQUET-245 - Travis CI runs tests even if build fails
  • PARQUET-248 - Simplify ParquetWriters's constructors
  • PARQUET-253 - AvroSchemaConverter has confusing Javadoc
  • PARQUET-259 - Support Travis CI in parquet-cpp
  • PARQUET-264 - Update README docs for graduation
  • PARQUET-266 - Add support for lists of primitives to Pig schema converter
  • PARQUET-272 - Updates docs decscription to match data model
  • PARQUET-274 - Updates URLs to link against the apache user instead of Parquet on github
  • PARQUET-276 - Updates CONTRIBUTING file with new repo info
  • PARQUET-286 - Avro object model should use Utf8
  • PARQUET-288 - Add dictionary support to Avro converters
  • PARQUET-289 - Allow object models to extend the ParquetReader builders
  • PARQUET-290 - Add Avro data model to the reader builder
  • PARQUET-306 - Improve alignment between row groups and HDFS blocks
  • PARQUET-308 - Add accessor to ParquetWriter to get current data size
  • PARQUET-309 - Remove unnecessary compile dependency on parquet-generator
  • PARQUET-321 - Set the HDFS padding default to 8MB
  • PARQUET-327 - Show statistics in the dump output

New Feature

  • PARQUET-229 - Make an alternate, stricter thrift column projection API
  • PARQUET-243 - Add avro-reflect support

Task

Version 1.7.0

Version 1.6.0

Bug

  • PARQUET-3 - tool to merge pull requests based on Spark
  • PARQUET-4 - Use LRU caching for footers in ParquetInputFormat.
  • PARQUET-8 - [parquet-scrooge] mvn eclipse:eclipse fails on parquet-scrooge
  • PARQUET-9 - InternalParquetRecordReader will not read multiple blocks when filtering
  • PARQUET-18 - Cannot read dictionary-encoded pages with all null values
  • PARQUET-19 - NPE when an empty file is included in a Hive query that uses CombineHiveInputFormat
  • PARQUET-21 - Fix reference to 'github-apache' in dev docs
  • PARQUET-56 - Added an accessor for the Long column type in example Group
  • PARQUET-62 - DictionaryValuesWriter dictionaries are corrupted by user changes.
  • PARQUET-63 - Fixed-length columns cannot be dictionary encoded.
  • PARQUET-66 - InternalParquetRecordWriter int overflow causes unnecessary memory check warning
  • PARQUET-69 - Add committer doc and REVIEWERS files
  • PARQUET-70 - PARQUET #36: Pig Schema Storage to UDFContext
  • PARQUET-75 - String decode using 'new String' is slow
  • PARQUET-80 - upgrade semver plugin version to 0.9.27
  • PARQUET-82 - ColumnChunkPageWriteStore assumes pages are smaller than Integer.MAX_VALUE
  • PARQUET-88 - Fix pre-version enforcement.
  • PARQUET-94 - ParquetScroogeScheme constructor ignores klass argument
  • PARQUET-96 - parquet.example.data.Group is missing some methods
  • PARQUET-97 - ProtoParquetReader builder factory method not static
  • PARQUET-101 - Exception when reading data with parquet.task.side.metadata=false
  • PARQUET-104 - Parquet writes empty Rowgroup at the end of the file
  • PARQUET-106 - Relax InputSplit Protections
  • PARQUET-107 - Add option to disable summary metadata aggregation after MR jobs
  • PARQUET-114 - Sample NanoTime class serializes and deserializes Timestamp incorrectly
  • PARQUET-122 - make parquet.task.side.metadata=true by default
  • PARQUET-124 - parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException
  • PARQUET-132 - AvroParquetInputFormat should use a parameterized type
  • PARQUET-135 - Input location is not getting set for the getStatistics in ParquetLoader when using two different loaders within a Pig script.
  • PARQUET-136 - NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null
  • PARQUET-142 - parquet-tools doesn't filter _SUCCESS file
  • PARQUET-145 - InternalParquetRecordReader.close() should not throw an exception if initialization has failed
  • PARQUET-150 - Merge script requires ':' in PR names
  • PARQUET-157 - Divide by zero in logging code
  • PARQUET-159 - paquet-hadoop tests fail to compile
  • PARQUET-162 - ParquetThrift should throw when unrecognized columns are passed to the column projection API
  • PARQUET-168 - Wrong command line option description in parquet-tools
  • PARQUET-173 - StatisticsFilter doesn't handle And properly
  • PARQUET-174 - Fix Java6 compatibility
  • PARQUET-176 - Parquet fails to parse schema contains '\r'
  • PARQUET-180 - Parquet-thrift compile issue with 0.9.2.
  • PARQUET-184 - Add release scripts and documentation
  • PARQUET-186 - Poor performance in SnappyCodec because of string concat in tight loop
  • PARQUET-187 - parquet-scrooge doesn't compile under 2.11
  • PARQUET-188 - Parquet writes columns out of order (compared to the schema)
  • PARQUET-189 - Support building parquet with thrift 0.9.0
  • PARQUET-196 - parquet-tools command to get rowcount & size
  • PARQUET-197 - parquet-cascading and the mapred API does not create metadata file
  • PARQUET-202 - Typo in the connection info in the pom prevents publishing an RC
  • PARQUET-207 - ParquetInputSplit end calculation bug
  • PARQUET-208 - revert PARQUET-197
  • PARQUET-214 - Avro: Regression caused by schema handling
  • PARQUET-215 - Parquet Thrift should discard records with unrecognized union members
  • PARQUET-216 - Decrease the default page size to 64k
  • PARQUET-217 - Memory Manager's min allocation heuristic is not valid for schemas with many columns
  • PARQUET-232 - minor compilation issue
  • PARQUET-234 - Restore ParquetInputSplit methods from 1.5.0
  • PARQUET-235 - Fix compatibility of parquet.metadata with 1.5.0
  • PARQUET-236 - Check parquet-scrooge compatibility
  • PARQUET-237 - Check ParquetWriter constructor compatibility with 1.5.0
  • PARQUET-239 - Make AvroParquetReader#builder() static
  • PARQUET-242 - AvroReadSupport.setAvroDataSupplier is broken

Improvement

  • PARQUET-2 - Adding Type Persuasion for Primitive Types
  • PARQUET-25 - Pushdown predicates only work with hardcoded arguments
  • PARQUET-52 - Improve the encoding fall back mechanism for Parquet 2.0
  • PARQUET-57 - Make dev commit script easier to use
  • PARQUET-61 - Avoid fixing protocol events when there is not required field missing
  • PARQUET-74 - Use thread local decoder cache in Binary toStringUsingUTF8()
  • PARQUET-79 - Add thrift streaming API to read metadata
  • PARQUET-84 - Add an option to read the rowgroup metadata on the task side.
  • PARQUET-87 - Better and unified API for projection pushdown on cascading scheme
  • PARQUET-89 - All Parquet CI tests should be run against hadoop-2
  • PARQUET-92 - Parallel Footer Read Control
  • PARQUET-105 - Refactor and Document Parquet Tools
  • PARQUET-108 - Parquet Memory Management in Java
  • PARQUET-115 - Pass a filter object to user defined predicate in filter2 api
  • PARQUET-116 - Pass a filter object to user defined predicate in filter2 api
  • PARQUET-117 - implement the new page format for Parquet 2.0
  • PARQUET-119 - add data_encodings to ColumnMetaData to enable dictionary based predicate push down
  • PARQUET-121 - Allow Parquet to build with Java 8
  • PARQUET-128 - Optimize the parquet RecordReader implementation when: A. filterpredicate is pushed down , B. filterpredicate is pushed down on a flat schema
  • PARQUET-133 - Upgrade snappy-java to 1.1.1.6
  • PARQUET-134 - Enhance ParquetWriter with file creation flag
  • PARQUET-140 - Allow clients to control the GenericData object that is used to read Avro records
  • PARQUET-141 - improve parquet scrooge integration
  • PARQUET-160 - Simplify CapacityByteArrayOutputStream
  • PARQUET-165 - A benchmark module for Parquet would be nice
  • PARQUET-177 - MemoryManager ensure minimum Column Chunk size
  • PARQUET-181 - Scrooge Write Support
  • PARQUET-191 - Avro schema conversion incorrectly converts maps with nullable values.
  • PARQUET-192 - Avro maps drop null values
  • PARQUET-193 - Avro: Implement read compatibility rules for nested types
  • PARQUET-203 - Consolidate PathFilter for hidden files
  • PARQUET-204 - Directory support for parquet-schema
  • PARQUET-210 - JSON output for parquet-cat

New Feature

  • PARQUET-22 - Parquet #13: Backport of HIVE-6938
  • PARQUET-49 - Create a new filter API that supports filtering groups of records based on their statistics
  • PARQUET-64 - Add new logical types to parquet-column
  • PARQUET-123 - Add dictionary support to AvroIndexedRecordReader
  • PARQUET-198 - parquet-cascading Add Parquet Avro Scheme

Task

  • PARQUET-50 - Remove items from semver blacklist
  • PARQUET-139 - Avoid reading file footers in parquet-avro InputFormat
  • PARQUET-190 - Fix an inconsistent Javadoc comment of ReadSupport.prepareForRead
  • PARQUET-230 - Add build instructions to the README

Version 1.5.0

  • ISSUE 399: Fixed resetting stats after writePage bug, unit testing of readFooter
  • ISSUE 397: Fixed issue with column pruning when using requested schema
  • ISSUE 389: Added padding for requested columns not found in file schema
  • ISSUE 392: Value stats fixes
  • ISSUE 338: Added statistics to Parquet pages and rowGroups
  • ISSUE 351: Fix bug #350, fixed length argument out of order.
  • ISSUE 378: configure semver to enforce semantic versioning
  • ISSUE 355: Add support for DECIMAL type annotation.
  • ISSUE 336: protobuf dependency version changed from 2.4.1 to 2.5.0
  • ISSUE 337: issue #324, move ParquetStringInspector to org.apache.hadoop.hive.serde...

Version 1.4.3

  • ISSUE 381: fix metadata concurency problem

Version 1.4.2

  • ISSUE 359: Expose values in SimpleRecord
  • ISSUE 335: issue #290, hive map conversion to parquet schema
  • ISSUE 365: generate splits by min max size, and align to HDFS block when possible
  • ISSUE 353: Fix bug: optional enum field causing ScroogeSchemaConverter to fail
  • ISSUE 362: Fix output bug during parquet-dump command
  • ISSUE 366: do not call schema converter to generate projected schema when projection is not set
  • ISSUE 367: make ParquetFileWriter throw IOException in invalid state case
  • ISSUE 352: Parquet thrift storer
  • ISSUE 349: fix header bug

Version 1.4.1

  • ISSUE 344: select * from parquet hive table containing map columns runs into exception. Issue #341.
  • ISSUE 347: set reading length in ThriftBytesWriteSupport to avoid potential OOM cau...
  • ISSUE 346: stop using strings and b64 for compressed input splits
  • ISSUE 345: set cascading version to 2.5.3
  • ISSUE 342: compress kv pairs in ParquetInputSplits

Version 1.4.0

  • ISSUE 333: Compress schemas in split
  • ISSUE 329: fix filesystem resolution
  • ISSUE 320: Spelling fix
  • ISSUE 319: oauth based authentication; fix grep change
  • ISSUE 310: Merge parquet tools
  • ISSUE 314: Fix avro schema conv for arrays of optional type for #312.
  • ISSUE 311: Avro null default values bug
  • ISSUE 316: Update poms to use thrift.exectuable property.
  • ISSUE 285: [CASCADING] Provide the sink implementation for ParquetTupleScheme
  • ISSUE 264: Native Protocol Buffer support
  • ISSUE 293: Int96 support
  • ISSUE 313: Add hadoop Configuration to Avro and Thrift writers (#295).
  • ISSUE 262: Scrooge schema converter and projection pushdown in Scrooge
  • ISSUE 297: Ports HIVE-5783 to the parquet-hive module
  • ISSUE 303: Avro read schema aliases
  • ISSUE 299: Fill in default values for new fields in the Avro read schema
  • ISSUE 298: Bugfix reorder thrift fields causing writting nulls
  • ISSUE 289: first use current thread's classloader to load a class, if current threa...
  • ISSUE 292: Added ParquetWriter() that takes an instance of Hadoop's Configuration.
  • ISSUE 282: Avro default read schema
  • ISSUE 280: style: junit.framework to org.junit
  • ISSUE 270: Make ParquetInputSplit extend FileSplit

Version 1.3.2

  • ISSUE 271: fix bug: last enum index throws DecodingSchemaMismatchException
  • ISSUE 268: fixes #265: add semver validation checks to non-bundle builds
  • ISSUE 269: Bumps parquet-jackson parent version
  • ISSUE 260: Shade jackson only once for all parquet modules

Version 1.3.1

  • ISSUE 267: handler only handle ignored field, exception during will be thrown as Sk...
  • ISSUE 266: upgrade parquet-mr to elephant-bird 4.4

Version 1.3.0

  • ISSUE 258: Optimize scan
  • ISSUE 259: add delta length byte arrays and delta byte arrays encodings
  • ISSUE 249: make summary files read in parallel; improve memory footprint of metadata; avoid unnecessary seek
  • ISSUE 257: Create parquet-hadoop-bundle which will eventually replace parquet-hive-bundle
  • ISSUE 253: Delta Binary Packing for Int
  • ISSUE 254: Add writer version flag to parquet and make initial changes for supported parquet 2.0 encodings
  • ISSUE 256: Resolves issue #251 by doing additional checks if Hive returns "Unknown" as a version
  • ISSUE 252: refactor error handler for BufferedProtocolReadToWrite to be non-static

Version 1.2.11

  • ISSUE 250: pretty_print_json_for_compatibility_checker
  • ISSUE 243: add parquet cascading integration documentation
  • ISSUE 248: More Hadoop 2 compatibility fixes

Version 1.2.10

  • ISSUE 247: fix bug: when field index is greater than zero
  • ISSUE 244: Feature/error handler
  • ISSUE 187: Plumb OriginalType
  • ISSUE 245: integrate parquet format 2.0

Version 1.2.9

  • ISSUE 242: upgrade elephant-bird version to 4.3
  • ISSUE 240: fix loader cache
  • ISSUE 233: use latest stable release of cascading: 2.5.1
  • ISSUE 241: Update reference to 0.10 in Hive012Binding javadoc
  • ISSUE 239: Fix hive map and array inspectors with null containers
  • ISSUE 234: optimize chunk scan; fix compressed size
  • ISSUE 237: Handle codec not found
  • ISSUE 238: fix pom version caused by bad merge
  • ISSUE 235: Not write pig meta data only when pig is not avaliable
  • ISSUE 227: Breaks parquet-hive up into several submodules, creating infrastructure ...
  • ISSUE 229: add changelog tool
  • ISSUE 236: Make cascading a provided dependency

Version 1.2.8

  • ISSUE 228: enable globing files for parquetTupleScheme, refactor unit tests and rem...
  • ISSUE 224: Changing read and write methods in ParquetInputSplit so that they can de...

Version 1.2.8

  • ISSUE 228: enable globing files for parquetTupleScheme, refactor unit tests and rem...
  • ISSUE 224: Changing read and write methods in ParquetInputSplit so that they can de...

Version 1.2.7

  • ISSUE 223: refactor encoded values changes and test that resetDictionary works
  • ISSUE 222: fix bug: set raw data size to 0 after reset

Version 1.2.6

  • ISSUE 221: make pig, hadoop and log4j jars provided
  • ISSUE 220: parquet-hive should ship and uber jar
  • ISSUE 213: group parquet-format version in one property
  • ISSUE 215: Fix Binary.equals().
  • ISSUE 210: ParquetWriter ignores enable dictionary and validating flags.
  • ISSUE 202: Fix requested schema when recreating splits in hive
  • ISSUE 208: Improve dic fall back
  • ISSUE 207: Fix offset
  • ISSUE 206: Create a "Powered by" page

Version 1.2.5

  • ISSUE 204: ParquetLoader.inputFormatCache as WeakHashMap
  • ISSUE 203: add null check for EnumWriteProtocol
  • ISSUE 205: use cascading 2.2.0
  • ISSUE 199: simplify TupleWriteSupport constructor
  • ISSUE 164: Dictionary changes
  • ISSUE 196: Fixes to the Hive SerDe
  • ISSUE 197: RLE decoder reading past the end of the stream
  • ISSUE 188: Added ability to define arbitrary predicate functions
  • ISSUE 194: refactor serde to remove some unecessary boxing and include dictionary awareness
  • ISSUE 190: NPE in DictionaryValuesWriter.

Version 1.2.4

  • ISSUE 191: Add compatibility checker for ThriftStruct to check for backward compatibility of two thrift structs

Version 1.2.3

  • ISSUE 186: add parquet-pig-bundle
  • ISSUE 184: Update ParquetReader to take Configuration as a constructor argument.
  • ISSUE 183: Disable the time read counter check in DeprecatedInputFormatTest.
  • ISSUE 182: Fix a maven warning about a missing version number.
  • ISSUE 181: FIXED_LEN_BYTE_ARRAY support
  • ISSUE 180: Support writing Avro records with maps with Utf8 keys
  • ISSUE 179: Added Or/Not logical filters for column predicates
  • ISSUE 172: Add sink support for parquet.cascading.ParquetTBaseScheme
  • ISSUE 169: Support avro records with empty maps and arrays
  • ISSUE 162: Avro schema with empty arrays and maps

Version 1.2.2

  • ISSUE 175: fix problem with projection pushdown in parquetloader
  • ISSUE 174: improve readability by renaming variables
  • ISSUE 173: make numbers in log messages easy to read in InternalParquetRecordWriter
  • ISSUE 171: add unit test for parquet-scrooge
  • ISSUE 165: distinguish recoverable exception in BufferedProtocolReadToWrite
  • ISSUE 166: support projection when required fields in thrift class are not projected

Version 1.2.1

  • ISSUE 167: fix oom error dues to bad estimation

Version 1.2.0

  • ISSUE 154: improve thrift error message
  • ISSUE 161: support schema evolution
  • ISSUE 160: Resource leak in parquet.hadoop.ParquetFileReader.readFooter(Configurati...
  • ISSUE 163: remove debugging code from hot path
  • ISSUE 155: Manual pushdown for thrift read support
  • ISSUE 159: Counter for mapred
  • ISSUE 156: Fix site
  • ISSUE 153: Fix projection required field

Version 1.1.1

  • ISSUE 150: add thrift validation on read

Version 1.1.0

  • ISSUE 149: changing default block size to 128mb
  • ISSUE 146: Fix and add unit tests for Hive nested types
  • ISSUE 145: add getStatistics method to parquetloader
  • ISSUE 144: Map key fields should allow other types than strings
  • ISSUE 143: Fix empty encoding col metadata
  • ISSUE 142: Fix total size row group
  • ISSUE 141: add parquet counters for benchmark
  • ISSUE 140: Implemented partial schema for GroupReadSupport
  • ISSUE 138: fix bug of wrong column metadata size
  • ISSUE 137: ParquetMetadataConverter bug
  • ISSUE 133: Update plugin versions for maven aether migration - fixes #125
  • ISSUE 130: Schema validation should not validate the root element's name
  • ISSUE 127: Adding dictionary encoding for non string types.. #99
  • ISSUE 125: Unable to build
  • ISSUE 124: Fix Short and Byte types in Hive SerDe.
  • ISSUE 123: Fix Snappy compressor in parquet-hadoop.
  • ISSUE 120: Fix RLE bug with partial literal groups at end of stream.
  • ISSUE 118: Refactor column reader
  • ISSUE 115: Map key fields should allow other types than strings
  • ISSUE 103: Map key fields should allow other types than strings
  • ISSUE 99: Dictionary encoding for non string types (float double int long boolean)
  • ISSUE 47: Add tests for parquet-scrooge and parquet-cascading

Version 1.0.1

  • ISSUE 126: Unit tests for parquet cascading
  • ISSUE 121: fix wrong RecordConverter for ParquetTBaseScheme
  • ISSUE 119: fix compatibility with thrift remove unused dependency

Version 1.0.0