[MAINTENANCE] [MAINTENANCE] Add force_reuse_spark_context to DatasourceConfigSchema #3126

pasmavie · 2021-07-28T08:22:15Z

With PR 2733 a parameter was added to SparkDFDatasource, to make GE's context reuse an existing Spark Context.

This was extremely useful.

However, when using a dynamic Data Context configuration (e.g. in EMR) like

data_context_config = DataContextConfig(
    config_version=2,
    plugins_directory=None,
    config_variables_file_path=None,
    datasources={
        "my_data_source": {
            "class_name": "SparkDFDatasource",
            "spark_config": dict(spark.sparkContext.getConf().getAll()),
            "force_reuse_spark_context": True,
            "module_name": "great_expectations.datasource",
            "batch_kwargs_generators": {},
        }
    },
    ...

I've found that the spark_config and force_reuse_spark_context parameters weren't actually passed to geat_expectations.core.util.get_or_create_spark_application().

In fact, the parameters are lost when the DataContextConfig object is dumped into a dictionary, because the DatasourceConfigSchema (elem of the DataContextConfigSchema) doesn't list these parameters.

Changes proposed in this pull request:

Add spark_config and force_reuse_spark_context to the DatasourceConfigSchema

Definition of Done

Please delete options that are not relevant.

My code follows the Great Expectations style guide
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added unit tests where applicable and made sure that new and existing tests are passing.
I have run any local integration tests and made sure that nothing is broken.

…to DataSourceConfigSchema

netlify · 2021-07-28T08:22:20Z

👷 Deploy request for niobium-lead-7998 pending review.
Visit the deploys page to approve it

🔨 Explore the source changes: a210a83

pasmavie · 2021-07-28T08:25:37Z

@talagluck here's the new attempt for the closed #2968

talagluck · 2021-07-28T13:09:20Z

Thanks for re-opening this, @gipaetusb ! We will review over the next week and be in touch.

mbakunze · 2021-08-05T20:21:09Z

This would be very useful to have!

pasmavie · 2021-08-06T07:43:11Z

Thanks @mbakunze.

@talagluck:

I've added a description of the changes to docs_rtd/changelog.rst
I took another look at the tests this morning but I can't figure why they're failing. I'm afraid I'll really need some help 🙏

mbakunze · 2021-08-12T20:20:50Z

I mentioned this issue in the Slack channel - this seems to currently block us to use GE with spark on k8s (at least in the way we wanted to use it :) ).

talagluck · 2021-08-12T20:33:45Z

Thanks so much for the prompt, @mbakunze ! By any chance would you have time to take a look and see why the tests are failing here? Otherwise I can prioritize this for next week.

mbakunze · 2021-08-12T21:32:06Z

I didn't understand why some tests failed when I glanced at it. But will try to take a look again.

…spark-context-emr # Conflicts: # docs_rtd/changelog.rst

mbakunze · 2021-08-14T08:22:25Z

@gipaetusb I started working on fixing the tests in #3245

…emr' into maintenance/reuse-spark-context-emr

…n CI

…emr' into maintenance/reuse-spark-context-emr

pasmavie · 2021-08-17T09:20:14Z

@mbakunze thank you very very much!

…tenance/reuse-spark-context-emr

…etusb/great_expectations into maintenance/reuse-spark-context-emr

pasmavie · 2021-08-20T10:57:37Z

Hi @talagluck, thanks to @mbakunze this is finally ready to be merged :) !

talagluck

Thank you so much this contribution and for your patience, @gipaetusb and @mbakunze ! Great work - LGTM!

fep2 · 2021-08-24T18:08:21Z

I've looked into this fix in my current setup, unfortunately if I changed my config from Datasource to SparkDFDatasource I get the following issue return datasource.get_batch_list_from_batch_request( AttributeError: 'SparkDFDatasource' object has no attribute 'get_batch_list_from_batch_request'

Next if I change it back to Datasource with the fix I get the following error datasource: Datasource = cast(Datasource, self.datasources[datasource_name]) KeyError: 'my_spark_datasource' this is do to the fact that the Datasource class when instantiated doesn't know what to do with the force_reuse_spark_context flag and the error get's hidden (this needs to be fix) and the my_spark_datasource is never instantiated causing it to throw this KeyError exception

Here a reference of what my data_source config looks like
{ "my_spark_datasource": { "class_name": "Datasource", "force_reuse_spark_context": True, "execution_engine": { "class_name": "SparkDFExecutionEngine" }, "data_connectors": { "my_runtime_data_connector": { "module_name": "great_expectations.datasource.data_connector", "class_name": "RuntimeDataConnector", "batch_identifiers": [ "some_key" ] } } } }

In this case I want a runtime batch following the directions laid out here -> https://discuss.greatexpectations.io/t/how-to-validate-spark-dataframes-in-0-13/582

I think the solution is to not only pass force_reuse_spark_context to the SparkDFDatasource but also pass it to SparkDFExecutionEngine I was able to get a working solution by adding the following to ExecutionEngineSchema

class ExecutionEngineConfigSchema(Schema):
    class Meta:
        unknown = INCLUDE

    class_name = fields.String(required=True)
    module_name = fields.String(missing="great_expectations.execution_engine")
    connection_string = fields.String(required=False, allow_none=True)
    credentials = fields.Raw(required=False, allow_none=True)
    spark_config = fields.Raw(required=False, allow_none=True)
    boto3_options = fields.Dict(
        keys=fields.Str(), values=fields.Str(), required=False, allow_none=True
    )
    caching = fields.Boolean(required=False, allow_none=True)
    batch_spec_defaults = fields.Dict(required=False, allow_none=True)
    force_reuse_spark_context = fields.Bool(required=False, missing=False)

This is what my data_source config looks like

{
        "my_spark_datasource": {
            "class_name": "Datasource",
            "execution_engine": {
                "class_name": "SparkDFExecutionEngine",
                "force_reuse_spark_context": True,
            },
            "data_connectors": {
                "my_runtime_data_connector": {
                    "module_name": "great_expectations.datasource.data_connector",
                    "class_name": "RuntimeDataConnector",
                    "batch_identifiers": [
                        "some_key"
                    ]
                }
            }
        }
    }

mbakunze · 2021-08-24T19:09:45Z

Nice - I guess @gipaetusb and I were still using the v2 approach. Great if we can get this working in v3 as well - I did not test it.

talagluck · 2021-08-24T21:31:20Z

Thanks for the feedback, @fep2! @mbakunze is exactly right here. This fix was just for V2. SparkDFDatasource is a V2 abstraction, and so mixing and matching with V3 abstractions like ExecutionEngines will cause errors. If you have any interest in making the fix for V3, we would welcome the contribution!

fep2 · 2021-08-25T14:54:32Z

As much as I would love to contribute, I have other commitments I must prioritize, that said I think I'll open a bug ticket or feature ticket and reference the following. I think that way the need is address and hopefully this is something user would want sooner rather than later.

talagluck · 2021-08-25T15:18:25Z

That sounds great, thank you @fep2!

[MAINTENANCE] Adds spark_config and force_reuse_spark_context fields …

7201b77

…to DataSourceConfigSchema

Merge branch 'develop' into maintenance/reuse-spark-context-emr

e16462d

pasmavie mentioned this pull request Jul 28, 2021

[MAINTENANCE] Add force_reuse_spark_context to DatasourceConfigSchema #2968

Closed

6 tasks

cdkini added the community label Jul 28, 2021

talagluck added the devrel This item is being addressed by the Developer Relations Team label Aug 5, 2021

pasmavie added 2 commits August 6, 2021 09:39

Merge branch 'develop' into maintenance/reuse-spark-context-emr

a4e4ff8

Adds changes bullet point to docs_rtd/changelog.rst

d2c80fa

Merge remote-tracking branch 'origin/develop' into maintenance/reuse-…

e40de89

…spark-context-emr # Conflicts: # docs_rtd/changelog.rst

mbakunze mentioned this pull request Aug 13, 2021

[BUGFIX] fix reusing a sparkSession using the force_reuse_spark_context keyword #3245

Closed

3 tasks

mbakunze added 6 commits August 13, 2021 18:19

Merge branch 'develop' into maintenance/reuse-spark-context-emr

1ce2ca1

remove spark_config from DatasourceConfigSchema

01e5e51

allow None for force_reuse_spark_context

851cd04

fix linting

9ec3735

remove missing entry for force_reuse_spark_context

c170c9b

adding back spark_config

c5a6c66

mbakunze added 6 commits August 15, 2021 17:42

Merge branch 'develop' into maintenance/reuse-spark-context-emr

d5ca8d5

adding unittests for force_reuse_spark_context

10aca2a

Merge remote-tracking branch 'origin/maintenance/reuse-spark-context-…

44cd676

…emr' into maintenance/reuse-spark-context-emr

Merge branch 'develop' into maintenance/reuse-spark-context-emr

1adde14

adding unittest for spark_config pass through

cb39f1b

Merge remote-tracking branch 'origin/maintenance/reuse-spark-context-…

bbc4ec0

…emr' into maintenance/reuse-spark-context-emr

mbakunze added 12 commits August 16, 2021 20:48

adding unittest for spark_config pass through + run black

cfb8c98

add docstrings to unittests to clarify intent

82abefe

add isort:skip for spark imports

d336103

rename test to be more meaningful

3ae3805

fix TypeError in unittest

16f1c71

adding a non-default config in order to hopefully see the Py4jError i…

0a4ff1d

…n CI

remove explicit P4jError test

6bc4ac7

Merge branch 'develop' into maintenance/reuse-spark-context-emr

014140c

add explicit test for sparkdf kwargs pass-through

e0a128e

Merge remote-tracking branch 'origin/maintenance/reuse-spark-context-…

c634a28

…emr' into maintenance/reuse-spark-context-emr

update changelog entry to reflect this is a BUGFIX

ff93e0b

fix linting

3ca3fb1

mbakunze and others added 5 commits August 19, 2021 09:51

Merge branch 'develop' into maintenance/reuse-spark-context-emr

2daf0eb

Merge branch 'develop' into maintenance/reuse-spark-context-emr

1ac14a7

Merge branch 'develop' into maintenance/reuse-spark-context-emr

bb4bbf1

Merge branch 'mbakunze-maintenance/reuse-spark-context-emr' into main…

bc520f4

…tenance/reuse-spark-context-emr

Merge branch 'maintenance/reuse-spark-context-emr' of github.com:gipa…

a210a83

…etusb/great_expectations into maintenance/reuse-spark-context-emr

talagluck approved these changes Aug 20, 2021

View reviewed changes

talagluck merged commit 5d818ee into great-expectations:develop Aug 20, 2021

fep2 mentioned this pull request Aug 31, 2021

Add force_reuse_spark_context to SparkDFExecutionEngine #3330

Closed

talagluck mentioned this pull request Nov 4, 2021

great_expectations.exceptions.exceptions.MetricResolutionError: 'NoneType' object has no attribute 'setCallSite' #3622

Closed

Shinnnyshinshin mentioned this pull request Jul 13, 2022

[BUGFIX] Spark Schema has unexpected field for spark.sql.warehouse.dir #5490

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAINTENANCE] [MAINTENANCE] Add force_reuse_spark_context to DatasourceConfigSchema #3126

[MAINTENANCE] [MAINTENANCE] Add force_reuse_spark_context to DatasourceConfigSchema #3126

pasmavie commented Jul 28, 2021 •

edited

Loading

netlify bot commented Jul 28, 2021 •

edited

Loading

pasmavie commented Jul 28, 2021

talagluck commented Jul 28, 2021

mbakunze commented Aug 5, 2021

pasmavie commented Aug 6, 2021

mbakunze commented Aug 12, 2021

talagluck commented Aug 12, 2021

mbakunze commented Aug 12, 2021

mbakunze commented Aug 14, 2021 •

edited

Loading

pasmavie commented Aug 17, 2021

pasmavie commented Aug 20, 2021

talagluck left a comment

fep2 commented Aug 24, 2021

mbakunze commented Aug 24, 2021

talagluck commented Aug 24, 2021

fep2 commented Aug 25, 2021

talagluck commented Aug 25, 2021

[MAINTENANCE] [MAINTENANCE] Add force_reuse_spark_context to DatasourceConfigSchema #3126

[MAINTENANCE] [MAINTENANCE] Add force_reuse_spark_context to DatasourceConfigSchema #3126

Conversation

pasmavie commented Jul 28, 2021 • edited Loading

Definition of Done

netlify bot commented Jul 28, 2021 • edited Loading

pasmavie commented Jul 28, 2021

talagluck commented Jul 28, 2021

mbakunze commented Aug 5, 2021

pasmavie commented Aug 6, 2021

mbakunze commented Aug 12, 2021

talagluck commented Aug 12, 2021

mbakunze commented Aug 12, 2021

mbakunze commented Aug 14, 2021 • edited Loading

pasmavie commented Aug 17, 2021

pasmavie commented Aug 20, 2021

talagluck left a comment

Choose a reason for hiding this comment

fep2 commented Aug 24, 2021

mbakunze commented Aug 24, 2021

talagluck commented Aug 24, 2021

fep2 commented Aug 25, 2021

talagluck commented Aug 25, 2021

pasmavie commented Jul 28, 2021 •

edited

Loading

netlify bot commented Jul 28, 2021 •

edited

Loading

mbakunze commented Aug 14, 2021 •

edited

Loading