-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New API added to referential integrity to allow for row level annotation #466
New API added to referential integrity to allow for row level annotation #466
Conversation
- The provided primary data frame will have a column added to it to indicate a true/false value. - The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false. - Added various tests, including tests for nested columns. - Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well. - Updated scaladoc of the two APIs.
val result = ReferentialIntegrity.subsetCheckRowLevel(ds1, cols, ds2, cols, Some(outcomeCol)) | ||
assert(result.isRight) | ||
val outcomes = result.right.get.select(outcomeCol).collect().toSeq.map { r => r.get(0) } | ||
assert(outcomes == Seq(true, true, false, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused as to why it's true, true, false, true
when the last row is the incorrect one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
val ds2 = rdd2.toDF("id", "state name", "state") | ||
|
||
val cols = Seq("state name", "state") | ||
val outcomeCol = "row_level_outcome" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an assertion for the columns returned? It should be exactly ("id", "state name", "state", "row_level_outcome")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
val result = ReferentialIntegrity.subsetCheckRowLevel(ds1, cols, ds2, cols, Some(outcomeCol)) | ||
assert(result.isRight) | ||
val outcomes = result.right.get.select(outcomeCol).collect().toSeq.map { r => r.get(0) } | ||
assert(outcomes == Seq(true, true, false, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also assert that if I drop "row_level_outcome"
I get a dataframe with exactly what's in rdd1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Would it be possible to also add tests in the |
- The order of the outcomes needed to be updated, to reflect the order of the data that was setup. - Added more assertions based on feedback.
This is a standalone utility and we are not using it in the VerificationSuite. Once the verification suite supports multiple dataframes, we will add this check and integrate the row level results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ion (#466) * New API added to referential integrity to allow for row level annotation - The provided primary data frame will have a column added to it to indicate a true/false value. - The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false. - Added various tests, including tests for nested columns. - Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well. - Updated scaladoc of the two APIs. * Updated tests - The order of the outcomes needed to be updated, to reflect the order of the data that was setup. - Added more assertions based on feedback.
…ion (#466) * New API added to referential integrity to allow for row level annotation - The provided primary data frame will have a column added to it to indicate a true/false value. - The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false. - Added various tests, including tests for nested columns. - Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well. - Updated scaladoc of the two APIs. * Updated tests - The order of the outcomes needed to be updated, to reflect the order of the data that was setup. - Added more assertions based on feedback.
…ion (#466) * New API added to referential integrity to allow for row level annotation - The provided primary data frame will have a column added to it to indicate a true/false value. - The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false. - Added various tests, including tests for nested columns. - Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well. - Updated scaladoc of the two APIs. * Updated tests - The order of the outcomes needed to be updated, to reflect the order of the data that was setup. - Added more assertions based on feedback.
…ion (#466) * New API added to referential integrity to allow for row level annotation - The provided primary data frame will have a column added to it to indicate a true/false value. - The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false. - Added various tests, including tests for nested columns. - Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well. - Updated scaladoc of the two APIs. * Updated tests - The order of the outcomes needed to be updated, to reflect the order of the data that was setup. - Added more assertions based on feedback.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.