Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modify TimeValidationTransformer to mark rows as invalid in case primary time column is out of range #11907

Conversation

gortiz
Copy link
Contributor

@gortiz gortiz commented Oct 30, 2023

As said in the title, this PR modifies TimeValidationTransformer to mark rows as invalid in case primary time column is out of range.

TimeValidationTransformer has always verified that the received row contains a value for the primary time column that is between 1971 (inclusive) and 2071 (exclusive). In case the value is outside this range, TimeValidationTransformer does:

  1. set the field to null, which will later be set by NullValueTransformer to the millis since epoch at ingestion time
  2. if tableConfig.ingestionConfig.continueOnError is false, aborts the execution
  3. otherwise log a message in debug level

The log in point 3 is not useful. In case it is disabled, no log is present. In case it is enabled, a log is printed for each invalid row. In cases where the error is present in most rows (like for example when the row contains seconds from epoch but schema is defined as millis from epoch) this log is very spammy.

We already supported a way to mark a row as incorrect which is used by most transformers but not TimeValidationTransformer. This PR modifies TimeValidationTransformer to do so.

Important to know

This PR changes the semantics of the ingestion. Before this PR, if the primary time column contains an invalid value, the field was marked as null and the value of NOW() at ingestion time was stored as default value.

After this PR, if the primary time column contains invalid value, the value is replaced with the default value. If it is real-time consumed, ServerMeter.INCOMPLETE_REALTIME_ROWS_CONSUMED will be increased.

cc @Jackie-Jiang @swaminathanmanish @snleee

@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2023

Codecov Report

Merging #11907 (e11947f) into master (55f29fe) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##             master   #11907   +/-   ##
=========================================
  Coverage     61.42%   61.43%           
+ Complexity     1147     1146    -1     
=========================================
  Files          2375     2375           
  Lines        128530   128531    +1     
  Branches      19853    19853           
=========================================
+ Hits          78944    78957   +13     
+ Misses        43884    43867   -17     
- Partials       5702     5707    +5     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (ø)
integration <0.01% <0.00%> (ø)
integration1 <0.01% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 61.40% <100.00%> (+0.04%) ⬆️
java-21 34.73% <0.00%> (-26.56%) ⬇️
skip-bytebuffers-false 61.42% <100.00%> (+0.01%) ⬆️
skip-bytebuffers-true 34.71% <0.00%> (-26.55%) ⬇️
temurin 61.43% <100.00%> (+<0.01%) ⬆️
unittests 61.42% <100.00%> (+<0.01%) ⬆️
unittests1 46.65% <0.00%> (+0.01%) ⬆️
unittests2 27.58% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...l/recordtransformer/TimeValidationTransformer.java 77.77% <100.00%> (+0.50%) ⬆️

... and 16 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -103,6 +103,7 @@ public GenericRow transform(GenericRow record) {
if (_continueOnError) {
LOGGER.debug(errorMessage);
record.putValue(_timeColumnName, null);
record.putValue(GenericRow.INCOMPLETE_RECORD_KEY, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im fine with this change, but its better to check with the authors on why they did it this way.

So with this change, we will skip the record (in NullValueTransformer), instead of assigning currentTime right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, this is the correct fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually causing the even to be skipped? I tried tracing this, and I can't find anywhere this causes skipping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't cause the row being skipped, but only log warnings on incomplete records. Time column will be filled with default value because the ingested value is invalid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's what I thought. Might be worth updating the PR description.

This part seems misleading
After this PR, if the primary time column contains and invalid value, the row will be skipped and ServerMeter.INCOMPLETE_REALTIME_ROWS_CONSUMED will be increased.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description. We do emit metric for real-time consumed rows

Copy link
Contributor

@swaminathanmanish swaminathanmanish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please check with @snleee or @Jackie-Jiang as well

@Jackie-Jiang Jackie-Jiang merged commit c97cff3 into apache:master Oct 30, 2023
19 checks passed
@gortiz gortiz deleted the TimeValidationTransformer-marks-as-incomplete branch October 31, 2023 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants