-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modify TimeValidationTransformer to mark rows as invalid in case primary time column is out of range #11907
modify TimeValidationTransformer to mark rows as invalid in case primary time column is out of range #11907
Conversation
…ary time column is invalid
Codecov Report
@@ Coverage Diff @@
## master #11907 +/- ##
=========================================
Coverage 61.42% 61.43%
+ Complexity 1147 1146 -1
=========================================
Files 2375 2375
Lines 128530 128531 +1
Branches 19853 19853
=========================================
+ Hits 78944 78957 +13
+ Misses 43884 43867 -17
- Partials 5702 5707 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 16 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -103,6 +103,7 @@ public GenericRow transform(GenericRow record) { | |||
if (_continueOnError) { | |||
LOGGER.debug(errorMessage); | |||
record.putValue(_timeColumnName, null); | |||
record.putValue(GenericRow.INCOMPLETE_RECORD_KEY, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im fine with this change, but its better to check with the authors on why they did it this way.
So with this change, we will skip the record (in NullValueTransformer), instead of assigning currentTime right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, this is the correct fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually causing the even to be skipped? I tried tracing this, and I can't find anywhere this causes skipping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't cause the row being skipped, but only log warnings on incomplete records. Time column will be filled with default value because the ingested value is invalid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's what I thought. Might be worth updating the PR description.
This part seems misleading
After this PR, if the primary time column contains and invalid value, the row will be skipped and ServerMeter.INCOMPLETE_REALTIME_ROWS_CONSUMED will be increased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the description. We do emit metric for real-time consumed rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please check with @snleee or @Jackie-Jiang as well
As said in the title, this PR modifies
TimeValidationTransformer
to mark rows as invalid in case primary time column is out of range.TimeValidationTransformer
has always verified that the received row contains a value for the primary time column that is between 1971 (inclusive) and 2071 (exclusive). In case the value is outside this range,TimeValidationTransformer
does:NullValueTransformer
to the millis since epoch at ingestion timetableConfig.ingestionConfig.continueOnError
is false, aborts the executionThe log in point 3 is not useful. In case it is disabled, no log is present. In case it is enabled, a log is printed for each invalid row. In cases where the error is present in most rows (like for example when the row contains seconds from epoch but schema is defined as millis from epoch) this log is very spammy.
We already supported a way to mark a row as incorrect which is used by most transformers but not
TimeValidationTransformer
. This PR modifiesTimeValidationTransformer
to do so.Important to know
This PR changes the semantics of the ingestion. Before this PR, if the primary time column contains an invalid value, the field was marked as null and the value of
NOW()
at ingestion time was stored as default value.After this PR, if the primary time column contains invalid value, the value is replaced with the default value. If it is real-time consumed,
ServerMeter.INCOMPLETE_REALTIME_ROWS_CONSUMED
will be increased.cc @Jackie-Jiang @swaminathanmanish @snleee