Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to skip unparseable records in the csv record reader #11487

Merged
merged 1 commit into from
Sep 5, 2023

Conversation

rajagopr
Copy link
Contributor

@rajagopr rajagopr commented Sep 1, 2023

Problem

The CSVRecordReader in place uses the apache common-csv library under the hood to iterate over the records. The default and the only iterator from the commons-csv library throws an exception on the hasNext() method. This makes it impossible to iterate over the records whenever an unparseable record is encountered. There is no way to override this iterator either as the CSVParser class is declared final and the iterator is internal to the CSVParser class.

Open issue with commons-csv library.

Solution

To work around the above problem, the change here is to use an alternate iterator by getting hold of the underlying buffered reader and modifying the methods next() and hasNext() in the CSVRecordReader. With this change, the hasNext() method would not throw an exception thereby allowing to make progress even when unparseable records are encountered. The drawbacks to this approach are: 1) data loss and 2) Reduced ingestion throughput

However, there are situations when this option is desirable and making progress is more important. Under such scenarios, the flag skipUnParseableLines can be set to make use of the line based iterator.

Alternate Solutions

Following were the alternate options considered:

  1. Switch to another library like OpenCSV which allows to plugin a custom iterator. [Not considered due to: 1) maintenance overhead 2) regression and 3) open vulnerability with the library]
  2. Writing a new parser [Not considered as would require significant time investment. This would be revisited later.]

Testing

The change is supplemented with unit tests which ensure the regression is not caused and the new changes are covered. Additionally, tested the performance on a 200MB file and the current parser took on average 5seconds(ran 20 times) and the new parser took on average 7seconds (ran 20 times).

@rajagopr rajagopr force-pushed the csv-record-reader-enhancement branch from bb65b6d to 97bd3f7 Compare September 1, 2023 18:21
@codecov-commenter
Copy link

codecov-commenter commented Sep 1, 2023

Codecov Report

Merging #11487 (330faec) into master (f14700a) will increase coverage by 0.13%.
Report is 14 commits behind head on master.
The diff coverage is 91.56%.

@@             Coverage Diff              @@
##             master   #11487      +/-   ##
============================================
+ Coverage     62.92%   63.06%   +0.13%     
- Complexity     1108     1109       +1     
============================================
  Files          2318     2320       +2     
  Lines        124328   124551     +223     
  Branches      18980    19016      +36     
============================================
+ Hits          78234    78546     +312     
+ Misses        40539    40418     -121     
- Partials       5555     5587      +32     
Flag Coverage Δ
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 63.04% <91.56%> (+0.13%) ⬆️
java-17 62.92% <91.56%> (+0.14%) ⬆️
java-20 62.93% <91.56%> (+0.15%) ⬆️
temurin 63.06% <91.56%> (+0.13%) ⬆️
unittests 63.05% <91.56%> (+0.13%) ⬆️
unittests1 67.48% <0.00%> (+0.05%) ⬆️
unittests2 14.50% <91.56%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
.../pinot/plugin/inputformat/csv/CSVRecordReader.java 83.09% <91.13%> (+6.70%) ⬆️
.../plugin/inputformat/csv/CSVRecordReaderConfig.java 75.92% <100.00%> (+13.92%) ⬆️

... and 82 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Jackie-Jiang Jackie-Jiang added enhancement documentation Configuration Config changes (addition/deletion/change in behavior) labels Sep 1, 2023


/**
* Record reader for CSV file.
*/
@NotThreadSafe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does other record reader has this annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they do not. I have added this as a good practice. This is purely for documentation.

//validate header for the delimiter before splitting
validateHeaderForDelimiter(delimiter, csvHeader, format);
format = format.withHeader(StringUtils.split(csvHeader, delimiter));
// do not validate header if using the line iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there's no harm in running validateHeaderForDelimiter? It's reading the first line and check the delimiter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following problems exist with the current validation.

  1. It calls iterator.hasNext() which is what we are trying to avoid in the first place
  2. It checks if the record has multiple values. User can pass in header and the record can have a single line which is valid. Hence, this check is not valid.
  3. It checks if delimiter is present in header. This is also not valid if the file is like this.
id
100

As all these checks are harmful, I have not made use of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, we should change the validation code. My concern is more on the feature parity. Running validation on the existing approach while not running it on line iterator will give us inconsistent behavior.

// do not validate header if using the line iterator
if (_useLineIterator) {
String[] header = StringUtils.split(csvHeader, delimiter);
setHeaderMap(header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the header will be in good format. Why we are not doing the validation in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet to determine what would be the correct validation for a header line. Existing validation does not look correct.

} else {
// Read the first line and set the header
String headerLine = _bufferedReader.readLine();
String[] header = StringUtils.split(headerLine, _format.getDelimiter());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the same approach as the regular csv for reading the header? I don't think that we need the custom handling here? (header is anyway the first line of the file. So, we can first consume header using the shared code and then diverge?)

Otherwise, I would see different behavior for parsing the header in some edge cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt if the regular csv parser respects this config option at all. I see it marked as TODO in the library version that we are using.
image

The line iterator respects this property and following is the behavior:

  1. If header is supplied by the client and skip header is not set – it is considered that the input file has just data records.
  2. if header is supplied by the client and skip header is set – it is considered that the input file contains a header record but the user wishes to override it.
  3. if header is not provided – first line is considered as the header.

Copy link
Contributor

@snleee snleee Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's investigate the following and address in the follow-up PR.

  • check if newer library version has the proper wiring for these options
  • we should keep the same behavior across 2 approaches.
  • do the same investigation on validateHeaderForDelimiter
  • add the unit tests for these configs (skip header)

@rajagopr rajagopr force-pushed the csv-record-reader-enhancement branch from 97bd3f7 to c445cbd Compare September 2, 2023 05:55
@rajagopr rajagopr force-pushed the csv-record-reader-enhancement branch from c445cbd to 330faec Compare September 2, 2023 05:57
@snleee snleee merged commit 3b2fd7d into apache:master Sep 5, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) documentation enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants