Added support to skip unparseable records in the csv record reader #11487

rajagopr · 2023-09-01T17:52:04Z

Problem

The CSVRecordReader in place uses the apache common-csv library under the hood to iterate over the records. The default and the only iterator from the commons-csv library throws an exception on the hasNext() method. This makes it impossible to iterate over the records whenever an unparseable record is encountered. There is no way to override this iterator either as the CSVParser class is declared final and the iterator is internal to the CSVParser class.

Open issue with commons-csv library.

Solution

To work around the above problem, the change here is to use an alternate iterator by getting hold of the underlying buffered reader and modifying the methods next() and hasNext() in the CSVRecordReader. With this change, the hasNext() method would not throw an exception thereby allowing to make progress even when unparseable records are encountered. The drawbacks to this approach are: 1) data loss and 2) Reduced ingestion throughput

However, there are situations when this option is desirable and making progress is more important. Under such scenarios, the flag skipUnParseableLines can be set to make use of the line based iterator.

Alternate Solutions

Following were the alternate options considered:

Switch to another library like OpenCSV which allows to plugin a custom iterator. [Not considered due to: 1) maintenance overhead 2) regression and 3) open vulnerability with the library]
Writing a new parser [Not considered as would require significant time investment. This would be revisited later.]

Testing

The change is supplemented with unit tests which ensure the regression is not caused and the new changes are covered. Additionally, tested the performance on a 200MB file and the current parser took on average 5seconds(ran 20 times) and the new parser took on average 7seconds (ran 20 times).

codecov-commenter · 2023-09-01T19:23:30Z

Codecov Report

Merging #11487 (330faec) into master (f14700a) will increase coverage by 0.13%.
Report is 14 commits behind head on master.
The diff coverage is 91.56%.

@@             Coverage Diff              @@
##             master   #11487      +/-   ##
============================================
+ Coverage     62.92%   63.06%   +0.13%     
- Complexity     1108     1109       +1     
============================================
  Files          2318     2320       +2     
  Lines        124328   124551     +223     
  Branches      18980    19016      +36     
============================================
+ Hits          78234    78546     +312     
+ Misses        40539    40418     -121     
- Partials       5555     5587      +32

Flag	Coverage Δ
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`63.04% <91.56%> (+0.13%)`	⬆️
java-17	`62.92% <91.56%> (+0.14%)`	⬆️
java-20	`62.93% <91.56%> (+0.15%)`	⬆️
temurin	`63.06% <91.56%> (+0.13%)`	⬆️
unittests	`63.05% <91.56%> (+0.13%)`	⬆️
unittests1	`67.48% <0.00%> (+0.05%)`	⬆️
unittests2	`14.50% <91.56%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
.../pinot/plugin/inputformat/csv/CSVRecordReader.java	`83.09% <91.13%> (+6.70%)`	⬆️
.../plugin/inputformat/csv/CSVRecordReaderConfig.java	`75.92% <100.00%> (+13.92%)`	⬆️

... and 82 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

snleee · 2023-09-02T02:15:29Z