Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New API added to referential integrity to allow for row level annotation #466

Merged
merged 2 commits into from
Apr 18, 2023

Conversation

rdsharma26
Copy link
Contributor

@rdsharma26 rdsharma26 commented Apr 12, 2023

Issue #, if available:

  • No associated issue. This is a WIP experimental utility.

Description of changes:

  • The provided primary data frame will have a column added to it to indicate a true/false value.
  • The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false.
  • Added various tests, including tests for nested columns.
  • Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well.
  • Updated scaladoc of the two APIs.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- The provided primary data frame will have a column added to it to indicate a true/false value.
- The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false.
- Added various tests, including tests for nested columns.
- Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well.
- Updated scaladoc of the two APIs.
val result = ReferentialIntegrity.subsetCheckRowLevel(ds1, cols, ds2, cols, Some(outcomeCol))
assert(result.isRight)
val outcomes = result.right.get.select(outcomeCol).collect().toSeq.map { r => r.get(0) }
assert(outcomes == Seq(true, true, false, true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to why it's true, true, false, true when the last row is the incorrect one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

val ds2 = rdd2.toDF("id", "state name", "state")

val cols = Seq("state name", "state")
val outcomeCol = "row_level_outcome"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an assertion for the columns returned? It should be exactly ("id", "state name", "state", "row_level_outcome")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

val result = ReferentialIntegrity.subsetCheckRowLevel(ds1, cols, ds2, cols, Some(outcomeCol))
assert(result.isRight)
val outcomes = result.right.get.select(outcomeCol).collect().toSeq.map { r => r.get(0) }
assert(outcomes == Seq(true, true, false, true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also assert that if I drop "row_level_outcome" I get a dataframe with exactly what's in rdd1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@eycho-am
Copy link
Contributor

Would it be possible to also add tests in the VerificationSuiteTests to see how the row level results for ref integrity are added with other row level results?

- The order of the outcomes needed to be updated, to reflect the order of the data that was setup.
- Added more assertions based on feedback.
@rdsharma26
Copy link
Contributor Author

Would it be possible to also add tests in the VerificationSuiteTests to see how the row level results for ref integrity are added with other row level results?

This is a standalone utility and we are not using it in the VerificationSuite. Once the verification suite supports multiple dataframes, we will add this check and integrate the row level results.

Copy link
Contributor

@eycho-am eycho-am left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rdsharma26 rdsharma26 merged commit 5aef696 into awslabs:master Apr 18, 2023
@rdsharma26 rdsharma26 deleted the ref-integrity-row-level branch April 18, 2023 18:36
rdsharma26 added a commit that referenced this pull request Apr 27, 2023
…ion (#466)

* New API added to referential integrity to allow for row level annotation

- The provided primary data frame will have a column added to it to indicate a true/false value.
- The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false.
- Added various tests, including tests for nested columns.
- Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well.
- Updated scaladoc of the two APIs.

* Updated tests

- The order of the outcomes needed to be updated, to reflect the order of the data that was setup.
- Added more assertions based on feedback.
rdsharma26 added a commit that referenced this pull request Apr 16, 2024
…ion (#466)

* New API added to referential integrity to allow for row level annotation

- The provided primary data frame will have a column added to it to indicate a true/false value.
- The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false.
- Added various tests, including tests for nested columns.
- Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well.
- Updated scaladoc of the two APIs.

* Updated tests

- The order of the outcomes needed to be updated, to reflect the order of the data that was setup.
- Added more assertions based on feedback.
rdsharma26 added a commit that referenced this pull request Apr 16, 2024
…ion (#466)

* New API added to referential integrity to allow for row level annotation

- The provided primary data frame will have a column added to it to indicate a true/false value.
- The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false.
- Added various tests, including tests for nested columns.
- Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well.
- Updated scaladoc of the two APIs.

* Updated tests

- The order of the outcomes needed to be updated, to reflect the order of the data that was setup.
- Added more assertions based on feedback.
rdsharma26 added a commit that referenced this pull request Apr 16, 2024
…ion (#466)

* New API added to referential integrity to allow for row level annotation

- The provided primary data frame will have a column added to it to indicate a true/false value.
- The value will be true if the values of the provided columns for that row exist in the reference dataframe. Otherwise, the value will be false.
- Added various tests, including tests for nested columns.
- Refactored the parameter validation from the regular row level check, so that the same validation can be done in the new function as well.
- Updated scaladoc of the two APIs.

* Updated tests

- The order of the outcomes needed to be updated, to reflect the order of the data that was setup.
- Added more assertions based on feedback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants