Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hasPattern I think is broken #152

Closed
gerileka opened this issue Sep 6, 2023 · 8 comments
Closed

hasPattern I think is broken #152

gerileka opened this issue Sep 6, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@gerileka
Copy link

gerileka commented Sep 6, 2023

I am trying to follow this tutorial using the master version of the package.

"# Announcing the `hasPattern` Rule feature! \n",

Running the following line spits the following problem:

        check.hasPattern(column='email',
                         pattern=r".*@baz.com",
                         assertion=lambda x: x == 1/3)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[37], line 9
      1 check = Check(spark, CheckLevel.Error, "Integrity checks")
      3 checkResult = VerificationSuite(spark) \
      4     .onData(df) \
      5     .addCheck(
      6         check.hasPattern(column='email',
      7                          pattern=r".*@baz.com",
      8                          assertion=lambda x: x == 1/3) \
----> 9         .hasPattern(column='a',
     10                          pattern=r"ba(r|z)",
     11                          assertion=lambda x: x == 0/3) \
     12         .hasPattern(column='email',
     13                      pattern=r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])""",
     14                      assertion=lambda x: x == 1.0)) \
     15     .run()

AttributeError: 'NoneType' object has no attribute 'hasPattern'

This comes from the fact that hasPattern is really empty as a function. Is this function supported anymore ?

def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):

@gerileka
Copy link
Author

gerileka commented Sep 6, 2023

I would say a solution will be like the following that existed in previous versions:

    def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
        """
        Checks for pattern compliance. Given a column name and a regular expression, defines a
        Check on the average compliance of the column's values to the regular expression.

        :param str column: Column in DataFrame to be checked
        :param Regex pattern: A name that summarizes the current check and the
                metrics for the analysis being done.
        :param lambda assertion: A function with an int or float parameter.
        :param str name: A name for the pattern constraint.
        :param str hint: A hint that states why a constraint could have failed.
        :return: hasPattern self: A Check object that runs the condition on the column.
        """
        assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion if assertion else lambda x: x == 1)
        name = self._jvm.scala.Option.apply(name)
        hint = self._jvm.scala.Option.apply(hint)
        pattern_regex = self._jvm.scala.util.matching.Regex(pattern, None)
        self._Check = self._Check.hasPattern(column, pattern_regex, assertion_func, name, hint)
        return self

@chenliu0831
Copy link
Contributor

I just merged #66 which should address this. Pending CI to pass on master and feel free to test again

@chenliu0831 chenliu0831 added the bug Something isn't working label Sep 7, 2023
@gerileka
Copy link
Author

gerileka commented Sep 7, 2023

I just merged #66 which should address this. Pending CI to pass on master and feel free to test again

Hello, thanks for your quick response.

I get this error now when I use the new implementation @chenliu0831 :

---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
Cell In[17], line 6
      1 check = Check(spark, CheckLevel.Warning, "Review Check")
      3 checkResult = (VerificationSuite(spark) 
      4     .onData(orders_reference_mock) 
      5     .addCheck(
----> 6         check 
      7         .hasPattern(column = "concept_id", pattern="[0-9a-fA-F]")
      8         .isUnique("id")
      9         .hasPattern(column = "id", pattern=r"[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}")
     10         .hasMin("gtv", lambda x: x == 30.0) 
     11         .hasMax("gtv", lambda x: x == 50.0) 
     12     )
     13     .run())
     15 checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)

File ~/.local/lib/python3.10/site-packages/pydeequ/checks.py:568, in Check.hasPattern(self, column, pattern, assertion, name, hint)
    554 def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
    555     """
    556     Checks for pattern compliance. Given a column name and a regular expression, defines a
    557     Check on the average compliance of the column's values to the regular expression.
   (...)
    565     :return: hasPattern self: A Check object that runs the condition on the column.
    566     """
    567     assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) if assertion \
--> 568         else getattr(self._Check, "hasPattern$default$2")()
    569     name = self._jvm.scala.Option.apply(name)
    570     hint = self._jvm.scala.Option.apply(hint)

File /pyenv/versions/3.10.11/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
   1315 command = proto.CALL_COMMAND_NAME +\
   1316     self.command_header +\
   1317     args_command +\
   1318     proto.END_COMMAND_PART
   1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
   1322     answer, self.gateway_client, self.target_id, self.name)
   1324 for temp_arg in temp_args:
   1325     temp_arg._detach()

File /pyenv/versions/3.10.11/lib/python3.10/site-packages/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw)
    188 def deco(*a: Any, **kw: Any) -> Any:
    189     try:
--> 190         return f(*a, **kw)
    191     except Py4JJavaError as e:
    192         converted = convert_exception(e.java_exception)

File /pyenv/versions/3.10.11/lib/python3.10/site-packages/py4j/protocol.py:330, in get_return_value(answer, gateway_client, target_id, name)
    326         raise Py4JJavaError(
    327             "An error occurred while calling {0}{1}{2}.\n".
    328             format(target_id, ".", name), value)
    329     else:
--> 330         raise Py4JError(
    331             "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332             format(target_id, ".", name, value))
    333 else:
    334     raise Py4JError(
    335         "An error occurred while calling {0}{1}{2}".
    336         format(target_id, ".", name))

Py4JError: An error occurred while calling o122.hasPattern$default$2. Trace:
py4j.Py4JException: Method hasPattern$default$2([]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Unknown Source)

FYI : AnalysisRunner works well tho, thank you

@gerileka
Copy link
Author

gerileka commented Sep 7, 2023

Oh nevermind apparently assertion needs to be setted:

    .hasPattern(column = "concept_id",  
          pattern="[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}",
          assertion=lambda x: x == 1/1)

@chenliu0831
Copy link
Contributor

@gerileka nice! the error message seems obscure in that case.. like a red herring. I will start planning the next release this weekend

@mouadhelfekih
Copy link

@chenliu0831 The pull request has been merged. Do you think a new tag will be created soon to generate a new version on PyPI?

@chenliu0831
Copy link
Contributor

Yes, this seems a important bug-fix. Doing release now.

#155

@chenliu0831
Copy link
Contributor

Released to PYPI - https://pypi.org/project/pydeequ/1.1.1/. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants