Dataproc Spark Connect Client

A wrapper of the Apache Spark Connect client with additional functionalities that allow applications to communicate with a remote Dataproc Spark cluster using the Spark Connect protocol without requiring additional steps.

Install

pip install dataproc_spark_connect

Uninstall

pip uninstall dataproc_spark_connect

Setup

This client requires permissions to manage Dataproc sessions and session templates. If you are running the client outside of Google Cloud, you must set following environment variables:

GOOGLE_CLOUD_PROJECT - The Google Cloud project you use to run Spark workloads
GOOGLE_CLOUD_REGION - The Compute Engine region where you run the Spark workload.
GOOGLE_APPLICATION_CREDENTIALS - Your Application Credentials
DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG (Optional) - The config location, such as tests/integration/resources/session.textproto

Usage

Install the latest version of Dataproc Python client and Dataproc Spark Connect modules:

pip install google_cloud_dataproc --force-reinstall
pip install dataproc_spark_connect --force-reinstall

Add the required import into your PySpark application or notebook:

from google.cloud.dataproc_spark_connect import DataprocSparkSession

There are two ways to create a spark session,

Start a Spark session using properties defined in DATAPROC_SPARK_CONNECT_SESSION_DEFAULT_CONFIG:
```
spark = DataprocSparkSession.builder.getOrCreate()
```

Start a Spark session with the following code instead of using a config file:

from google.cloud.dataproc_v1 import SparkConnectConfig
from google.cloud.dataproc_v1 import Session
dataproc_config = Session()
dataproc_config.spark_connect_session = SparkConnectConfig()
dataproc_config.environment_config.execution_config.subnetwork_uri = "<subnet>"
dataproc_config.runtime_config.version = '3.0'
spark = DataprocSparkSession.builder.dataprocConfig(dataproc_config).getOrCreate()

Billing

As this client runs the spark workload on Dataproc, your project will be billed as per Dataproc Serverless Pricing. This will happen even if you are running the client from a non-GCE instance.

Contributing

Building and Deploying SDK

Install the requirements in virtual environment.

pip install -r requirements.txt

Build the code.

python setup.py sdist bdist_wheel

Copy the generated .whl file to Cloud Storage. Use the version specified in the setup.py file.

VERSION=<version> gsutil cp dist/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl gs://<your_bucket_name>

Download the new SDK on Vertex, then uninstall the old version and install the new one.

%%bash
export VERSION=<version>
gsutil cp gs://<your_bucket_name>/dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl .
yes | pip uninstall dataproc_spark_connect
pip install dataproc_spark_connect-${VERSION}-py2.py3-none-any.whl

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
google/cloud/dataproc_spark_connect		google/cloud/dataproc_spark_connect
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
contributing.md		contributing.md
pyproject.toml		pyproject.toml
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataproc Spark Connect Client

Install

Uninstall

Setup

Usage

Billing

Contributing

Building and Deploying SDK

About

Releases

Packages

Languages

License

GoogleCloudDataproc/dataproc-spark-connect-python

Folders and files

Latest commit

History

Repository files navigation

Dataproc Spark Connect Client

Install

Uninstall

Setup

Usage

Billing

Contributing

Building and Deploying SDK

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages