Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Min/Max] Apply filtered row behavior at the row level evaluation #543

Merged
merged 4 commits into from
Mar 8, 2024

Conversation

rdsharma26
Copy link
Contributor

Description of changes:

  • This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.
case FilteredRowOutcome.TRUE => true
case FilteredRowOutcome.NULL => null
}
case None => null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had discussed that the default analyzerOptions behavior should be FilteredRowOutcome.TRUE, should we modify 933 to be true? (By default filtered rows are true instead of null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing out another scenario that may be problematic.
Given the following dataframe:

+------+----+----+----+----+-----+-----+
|item  |att1|att2|val1|val2|rule4|rule5|
+------+----+----+----+----+-----+-----+
|1     |a   |f   |1   |1   |true |true |
|22    |b   |d   |2   |NULL|true |true |
|333   |a   |NULL|3   |3   |true |true |
|4444  |a   |f   |4   |4   |true |true |
|55555 |b   |NULL|5   |NULL|true |true |
|666666|a   |f   |6   |6   |true |true |
+------+----+----+----+----+-----+-----+

where

val analyzerOptions = Option(AnalyzerOptions(filteredRow = FilteredRowOutcome.TRUE))

val min = new Check(CheckLevel.Error, "rule4")
  .hasMin("val2", _ > 0, None, analyzerOptions)
  .where("val1 < 4")
val max = new Check(CheckLevel.Error, "rule5")
  .hasMax("val2", _ < 4, None, analyzerOptions)
  .where("val1 < 4")

You'll see that rows 1,2,3 should be skipped -> True
Row 5 should be null as val2 is a null value there.
However, with the above method we convert all nulls to true/null - this doesn't distinguish between null values due to being filtered or null values due to null column values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eycho-am for the valuable feedback. The latest PR revision contains a new structure for the column that helps maintain the "source" of a row, whether it is in scope and filtered out. That will help in evaluating the correct outcome for each row.

@@ -374,11 +371,11 @@ class VerificationSuiteTest extends WordSpec with Matchers with SparkContextSpec

// filtered rows 1, 2, 3 (where item > 3)
val minRowLevel = resultData.select(expectedColumn4).collect().map(r => r.getAs[Any](0))
assert(Seq(true, true, true, true, true, true).sameElements(minRowLevel))
assert(Seq(null, null, null, true, true, true).sameElements(minRowLevel))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these test were written with the intention that without specifying analyzer options, the default behavior would be filtered rows are true - related to above comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the change.

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.
case FilteredRowOutcome.TRUE => true
case FilteredRowOutcome.NULL => null
}
case None => null
Copy link
Contributor

@eycho-am eycho-am Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similar comment for the below test-case, should the default behavior for filtered rows be true? Or are we making a decision here that they'll be null by default?

Looks good to me otherwise, I think we just need to make final a decision on this point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are keeping them as is, unless the analyzer option says otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on offline discussion, updated the logic. Thanks @eycho-am for the feedback.

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.
Copy link
Contributor

@eycho-am eycho-am left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Not having it can cause match error.
@rdsharma26 rdsharma26 merged commit a6c218c into awslabs:master Mar 8, 2024
1 check passed
@rdsharma26 rdsharma26 deleted the min-max-row-level-results branch March 8, 2024 18:05
rdsharma26 added a commit that referenced this pull request Mar 8, 2024
* [Min/Max] Apply filtered row behavior at the row level evaluation

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

* Improved the separation of null rows, based on their source

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

* Mark filtered rows as true

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

* Added null behavior - empty string to match block

Not having it can cause match error.
rdsharma26 added a commit that referenced this pull request Mar 8, 2024
* [Min/Max] Apply filtered row behavior at the row level evaluation

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

* Improved the separation of null rows, based on their source

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

* Mark filtered rows as true

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

* Added null behavior - empty string to match block

Not having it can cause match error.
rdsharma26 added a commit that referenced this pull request Apr 16, 2024
* [Min/Max] Apply filtered row behavior at the row level evaluation

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

* Improved the separation of null rows, based on their source

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

* Mark filtered rows as true

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

* Added null behavior - empty string to match block

Not having it can cause match error.
rdsharma26 added a commit that referenced this pull request Apr 16, 2024
* [Min/Max] Apply filtered row behavior at the row level evaluation

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

* Improved the separation of null rows, based on their source

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

* Mark filtered rows as true

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

* Added null behavior - empty string to match block

Not having it can cause match error.
rdsharma26 added a commit that referenced this pull request Apr 16, 2024
* [Min/Max] Apply filtered row behavior at the row level evaluation

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

* Improved the separation of null rows, based on their source

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

* Mark filtered rows as true

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

* Added null behavior - empty string to match block

Not having it can cause match error.
rdsharma26 added a commit that referenced this pull request Apr 17, 2024
* [Min/Max] Apply filtered row behavior at the row level evaluation

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

* Improved the separation of null rows, based on their source

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

* Mark filtered rows as true

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

* Added null behavior - empty string to match block

Not having it can cause match error.
rdsharma26 added a commit that referenced this pull request Apr 17, 2024
* [Min/Max] Apply filtered row behavior at the row level evaluation

- This changes from applying the behavior at the analyzer level. It allows us to prevent the usage of MinValue/MaxValue as placeholder values for filtered rows.

* Improved the separation of null rows, based on their source

- Whether the outcome for a row is null because of being filtered out or due to the target column being null, is now stored in the outcome column itself.
- We could have reused the placeholder value to find out if a row was originally filtered out, but that would not work if the actual value in the row was the same originally.

* Mark filtered rows as true

We recently fixed the outcome of filtered rows and made them default to true instead of false, which was a bug earlier. This change maintains that behavior.

* Added null behavior - empty string to match block

Not having it can cause match error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants