New API added to referential integrity to allow for row level annotation #466

rdsharma26 · 2023-04-12T14:43:52Z

Issue #, if available:

No associated issue. This is a WIP experimental utility.

Description of changes:

The provided primary data frame will have a column added to it to indicate a true/false value.
The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false.
Added various tests, including tests for nested columns.
Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well.
Updated scaladoc of the two APIs.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- The provided primary data frame will have a column added to it to indicate a true/false value. - The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false. - Added various tests, including tests for nested columns. - Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well. - Updated scaladoc of the two APIs.

eycho-am · 2023-04-13T19:46:01Z

src/test/scala/com/amazon/deequ/comparison/ReferentialIntegrityTest.scala

+      val result = ReferentialIntegrity.subsetCheckRowLevel(ds1, cols, ds2, cols, Some(outcomeCol))
+      assert(result.isRight)
+      val outcomes = result.right.get.select(outcomeCol).collect().toSeq.map { r => r.get(0) }
+      assert(outcomes == Seq(true, true, false, true))


I'm a little confused as to why it's true, true, false, true when the last row is the incorrect one

mentekid · 2023-04-13T19:44:49Z

src/test/scala/com/amazon/deequ/comparison/ReferentialIntegrityTest.scala

+      val ds2 = rdd2.toDF("id", "state name", "state")
+
+      val cols = Seq("state name", "state")
+      val outcomeCol = "row_level_outcome"


Can you add an assertion for the columns returned? It should be exactly ("id", "state name", "state", "row_level_outcome")

mentekid · 2023-04-13T19:46:38Z

src/test/scala/com/amazon/deequ/comparison/ReferentialIntegrityTest.scala

+      val result = ReferentialIntegrity.subsetCheckRowLevel(ds1, cols, ds2, cols, Some(outcomeCol))
+      assert(result.isRight)
+      val outcomes = result.right.get.select(outcomeCol).collect().toSeq.map { r => r.get(0) }
+      assert(outcomes == Seq(true, true, false, true))


maybe also assert that if I drop "row_level_outcome" I get a dataframe with exactly what's in rdd1?

eycho-am · 2023-04-13T19:48:09Z

Would it be possible to also add tests in the VerificationSuiteTests to see how the row level results for ref integrity are added with other row level results?

- The order of the outcomes needed to be updated, to reflect the order of the data that was setup. - Added more assertions based on feedback.

rdsharma26 · 2023-04-14T20:35:37Z

Would it be possible to also add tests in the VerificationSuiteTests to see how the row level results for ref integrity are added with other row level results?

This is a standalone utility and we are not using it in the VerificationSuite. Once the verification suite supports multiple dataframes, we will add this check and integrate the row level results.

eycho-am

LGTM

…ion (#466) * New API added to referential integrity to allow for row level annotation - The provided primary data frame will have a column added to it to indicate a true/false value. - The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false. - Added various tests, including tests for nested columns. - Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well. - Updated scaladoc of the two APIs. * Updated tests - The order of the outcomes needed to be updated, to reflect the order of the data that was setup. - Added more assertions based on feedback.

eycho-am reviewed Apr 13, 2023

View reviewed changes

mentekid reviewed Apr 13, 2023

View reviewed changes

Updated tests

ed5575d

- The order of the outcomes needed to be updated, to reflect the order of the data that was setup. - Added more assertions based on feedback.

eycho-am approved these changes Apr 18, 2023

View reviewed changes

rdsharma26 merged commit 5aef696 into awslabs:master Apr 18, 2023

rdsharma26 deleted the ref-integrity-row-level branch April 18, 2023 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New API added to referential integrity to allow for row level annotation #466

New API added to referential integrity to allow for row level annotation #466

rdsharma26 commented Apr 12, 2023 •

edited

Loading

eycho-am Apr 13, 2023

rdsharma26 Apr 14, 2023

mentekid Apr 13, 2023

rdsharma26 Apr 14, 2023

mentekid Apr 13, 2023

rdsharma26 Apr 14, 2023

eycho-am commented Apr 13, 2023

rdsharma26 commented Apr 14, 2023

eycho-am left a comment

New API added to referential integrity to allow for row level annotation #466

New API added to referential integrity to allow for row level annotation #466

Conversation

rdsharma26 commented Apr 12, 2023 • edited Loading

eycho-am Apr 13, 2023

Choose a reason for hiding this comment

rdsharma26 Apr 14, 2023

Choose a reason for hiding this comment

mentekid Apr 13, 2023

Choose a reason for hiding this comment

rdsharma26 Apr 14, 2023

Choose a reason for hiding this comment

mentekid Apr 13, 2023

Choose a reason for hiding this comment

rdsharma26 Apr 14, 2023

Choose a reason for hiding this comment

eycho-am commented Apr 13, 2023

rdsharma26 commented Apr 14, 2023

eycho-am left a comment

Choose a reason for hiding this comment

rdsharma26 commented Apr 12, 2023 •

edited

Loading