Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Schema Diff Error Message #160

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

zeotuan
Copy link
Collaborator

@zeotuan zeotuan commented Sep 28, 2024

What changed:

  • Improve Schema Diff Error message with color on different element
  • Fix Dataset content color diff to properly display for Dataset[Array[_]] cases
  • Display diff for each element of Dataset[Iterable[_]] cases

Schema Diff before:

image

Schema Diff now:

image

Dispaly for Dataframe diff

image

Display for Dataset diff

image

@zeotuan zeotuan changed the title Add Table support for StructField Diff Improve Schema Diff Error Message Sep 28, 2024
@zeotuan zeotuan marked this pull request as ready for review October 1, 2024 08:45
@zeotuan zeotuan requested a review from alfonsorr October 1, 2024 08:45
else if (expectedSeq.isEmpty)
List(Red(actualSeq.mkString("[", ",", "]")), Green("[]"))
List(Red(prodToString(actualSeq)), Green(emptyProd))
else {
val withEquals = actualSeq
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest skipping this part if the Type is not Row, if it´s T the schema will be compared in compile time and if the type T is equal to the DataSet, it will fail when we execute .collect() in runtime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought by this point we only work with in memory data - Seq[T] ?
or you mean in case of showProductDiff[Dataset[...]]

Comment on lines 70 to 73
val a = actualDS.collect().toSeq
val e = expectedDS.collect().toSeq
if (!a.approximateSameElements(e, equals)) {
val arr = ("Actual Content", "Expected Content")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see the improvement, any Dataset[T] can be transformed into a DataFrame, the only change here is that it will work with the original Class T` that was provided, but the result will be the same, only with the Class name prefixing each line. If the only thing was to prevent the .asRows method we could have done this and not any other change will be needed.

Suggested change
val a = actualDS.collect().toSeq
val e = expectedDS.collect().toSeq
if (!a.approximateSameElements(e, equals)) {
val arr = ("Actual Content", "Expected Content")
val a = actualDS.toDF.collect() //now its a Row
val e = expectedDS.toDF.collect()
if (!a.approximateSameElements(e, equals)) {
val arr = ("Actual Content", "Expected Content")
val msg = "Diffs\n" ++ DataframeUtil.showDataframeDiff(arr, a, e, truncate)

I see this as more unified and we wont have to check if its a Product or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShowProductDiff is an update/extension to the original showDataframeDiff so we can have a nicely format table display for any Case Class and additionally Row. This was done so we have nicely format table for StructField

And since showProductDiff can now also display differences for any Case Class I figured we no longer need to convert a Dataset[T] into a Dataframe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these changes could be simplified by only adding a single parameter in the header because at the end the processing doesn't care if it's a row or a case class, it only requires the list of elements:

private[mrpowers] def showDataframeDiff(
      header: (String, String),
      actual: Seq[Row],
      expected: Seq[Row],
      truncate: Int = 20,
      minColWidth: Int = 3,
     className = Option[String]
  ): String = {
   val (className, lBracket, rBracket)  = className match { 
     case None => ("", "[", "]") 
     case Some(cn) (cn, "(", ")")
    }

Use these 3 elements to make the representation, and when you call this function, pass the class name if it's not a Row.
I don't see any other benefit of getting the element as T because the comparison is not done here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we also use it for Seq[StructField] for displaying schema fields won't we have to make actual and expected Seq[Any] ?

private[mrpowers] def showDataframeDiff(
      header: (String, String),
      actual: Seq[Any],
      expected: Seq[Any],
      truncate: Int = 20,
      minColWidth: Int = 3,
     className = Option[String]
  ): String = {
   val (className, lBracket, rBracket)  = className match { 
     case None => ("", "[", "]") 
     case Some(cn) (cn, "(", ")")
    }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def assertSmallDatasetContentEquality[T: ClassTag](actualDS: Dataset[T], expectedDS: Dataset[T], truncate: Int, equals: (T, T) => Boolean): Unit = {
    val a = actualDS.collect().toSeq
    val e = expectedDS.collect().toSeq
    if (!a.approximateSameElements(e, equals)) {
      val arr = ("Actual Content", "Expected Content")
      val runTimeClass                     = implicitly[ClassTag[T]].runtimeClass
      val prefix = if (runTimeClass == classOf[Row]) None else Some(runTimeClass.getSimpleName)
      val msg = "Diffs\n" ++ DataframeUtil.showDataframeDiff(arr, a.asRows, e.asRows, truncate, prefix)
      throw DatasetContentMismatch(msg)
    }
  }

If you don't want to do this here, move it inside showDataframeDiff, you can put the type parameter but you ar not forced to use it. But I see it weird to have the signature of your function as:

private[mrpowers] def showDataframeDiff[T:ClassTag](
      header: (String, String),
      actual: Seq[Any],
      expected: Seq[Any],
      truncate: Int = 20,
      minColWidth: Int = 3
  ): String

Copy link
Collaborator Author

@zeotuan zeotuan Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer putting it in showProductDiff since it will also need to be on betterSchemaMismatchMessage. Do you think we should still use Seq[T]? I think that would make it at least safer for the input Type. ofcourse unless someone decided to use Any

Copy link
Collaborator

@alfonsorr alfonsorr Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to change the name, it's called from a dataset method. To keep the [T], this is not a public method, it carries the type and the compiler will validate the schema of the actual and expected dataframe. The only problem is if the conversion to DataSet[T] is not correct, e.g. :

assertSmallDataset(spark.table("aTable").as[MyType], expected)

If the table doesn't match, the error will be raised with an exception in the .as[T] method in runtime.
Twhy reason I say if we detect a non-Row DataSet, to skschema validation schema because it's already done.

I see it simpler if you know the exact type of the data as a Row only, and you only have one way to work with it. If you still think this could be good I would suggest first making more tests, not only with products but also with datasets of only one type like DataSet[String] DataSet[Int] DataSet[Timestamp] DataSet[Array[Int]]. These types are not products and in some cases, I don't know how will it work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I forgot that people can total do Dataset[Array[Int]]. I did some test, and it support single value type correctly but not Array[Int].
Actually, currently our main branch also has issue with Dataset[Seq[Int]]. let's me try fixing that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alfonsorr I fixed issue with Dataset[Array[Int]] and also add test cases for String, Int, Array, Seq cases

@zeotuan zeotuan requested a review from alfonsorr October 1, 2024 22:11
@zeotuan zeotuan self-assigned this Oct 1, 2024
@zeotuan zeotuan requested a review from alfonsorr October 6, 2024 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants