From 6b7db37208c5df73983819322a3c35156198c4f1 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Wed, 16 Oct 2024 10:03:51 -0700 Subject: [PATCH 01/13] Update databricks-configs.md for Python model config, work in progress Need to pause this and clean up code, but don't want to lose progress. --- .../reference/resource-configs/databricks-configs.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 5823fe7d9a4..4735a41bf09 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -60,6 +60,13 @@ We do not yet have a PySpark API to set tblproperties at table creation, so this \+ `databricks_tags` are currently only supported at the table level, and applied via `ALTER` statements. + + + + +### Python Model Config + + ## Incremental models @@ -553,7 +560,12 @@ Databricks adapter ... using compute resource . Materializing a python model requires execution of SQL as well as python. Specifically, if your python model is incremental, the current execution pattern involves executing python to create a staging table that is then merged into your target table using SQL. + The python code needs to run on an all purpose cluster, while the SQL code can run on an all purpose cluster or a SQL Warehouse. + + +The python code needs to run on an all purpose cluster (or serverless cluster, see [Python Model Config](#python-model-config)), while the SQL code can run on an all purpose cluster or a SQL Warehouse. + When you specify your `databricks_compute` for a python model, you are currently only specifying which compute to use when running the model-specific SQL. If you wish to use a different compute for executing the python itself, you must specify an alternate `http_path` in the config for the model. Please note that declaring a separate SQL compute and a python compute for your python dbt models is optional. If you wish to do this: From 7b7119ba4a5f426ee5d5b45ff31e34e714314213 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Wed, 23 Oct 2024 13:10:43 -0700 Subject: [PATCH 02/13] Update databricks-configs.md Add config matrix for python support --- .../resource-configs/databricks-configs.md | 99 ++++++++++++++++++- 1 file changed, 95 insertions(+), 4 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 1f1b0706cb4..2084e98110d 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -67,9 +67,99 @@ We do not yet have a PySpark API to set tblproperties at table creation, so this -### Python Model Config +### Python Submission Method + +As of 1.9, there are four options for `submission_method`: + +* `all_purpose_cluster`: execute the python model either directly using the [command api](https://docs.databricks.com/api/workspace/commandexecution) or by uploading a notebook and creating a one-off job run +* `job_cluster`: creates a new job cluster to execute an uploaded notebook as a one-off job run +* `serverless_cluster`: uses a [serverless cluster](https://docs.databricks.com/en/jobs/run-serverless-jobs.html) to execute an uploaded notebook as a one-off job run +* `workflow_job`: creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. :::caution This approach gives you maximum flexibility, but will create persistent artifacts (i.e. the workflow) in Databricks that users could run outside of dbt. + +We are currently in a transitionary period where there is a disconnect between old submission methods (which were grouped by compute), and the logically distinct submission methods (command, job run, workflow). +As such, the supported config matrix is somewhat complicated: + +| Config | Use | Default | `all_purpose_cluster`* | `job_cluster` | `serverless_cluster` | `workflow_job` | +| --------------------- | -------------------------------------------------------------------- | ---------------- | ---------------------- | ------------- | -------------------- | -------------- | +| `create_notebook` | if false, use Command API, otherwise upload notebook and use job run | false | ✅ | ❌ | ❌ | ❌ | +| `timeout` | maximum time to wait for command/job to run | 0 (No timeout) | ✅ | ✅ | ✅ | ✅ | +| `job_cluster_config` | configures a [new cluster](https://docs.databricks.com/api/workspace/jobs/submit#tasks-new_cluster) for running the model | {} | ❌ | ✅ | ❌ | ✅ | +| `access_control_list` | directly configures [access control](https://docs.databricks.com/api/workspace/jobs/submit#access_control_list) for the job | {} | ✅ | ✅ | ✅ | ✅ | +| `packages` | list of packages to install on the executing cluster | [] | ✅ | ✅ | ✅ | ✅ | +| `index_url` | url to install `packages` from | None (uses pypi) | ✅ | ✅ | ✅ | ✅ | +| `additional_libs` | directly configures [libraries](https://docs.databricks.com/api/workspace/jobs/submit#tasks-libraries) | [] | ✅ | ✅ | ✅ | ✅ | +| `python_job_config` | additional configuration for jobs/workflows (see table below) | {} | ✅ | ✅ | ✅ | ✅ | +| `cluster_id` | id of existing all purpose cluster to execute against | None | ✅ | ❌ | ❌ | ✅ | +| `http_path` | path to existing all purpose cluster to execute against | None | ✅ | ❌ | ❌ | ❌ | + +\* Only `timeout` and `cluster_id`/`http_path` are supported when `create_notebook` is false + +With the 1.9's introduction of the `workflow_job` submission method we chose to segregate further configuration of the python model submission under a top level configuration named `python_job_config`. +This keeps configuration options for jobs and workflows namespaced in such a way that they do not interfere with other model config, allowing us to be much more flexible with what is supported for job execution. +The support matrix for this feature is divided into `workflow_job` and all others (assuming `all_purpose_cluster` with `create_notebook`==true). +Each config option listed must be nested under `python_job_config`: + +| Config | Use | Default | `workflow_job` | All others | +| -------------------------- | ----------------------------------------------------------------------------------------------------------------------- | ------- | -------------- | ---------- | +| `name` | The name to give (or used to look up) the created workflow | None | ✅ | ❌ | +| `grants` | A simplified way to specify access control for the workflow | {} | ✅ | ✅ | +| `existing_job_id` | Id to use to look up the created workflow (in place of `name`) | None | ✅ | ❌ | +| `post_hook_tasks` | [Tasks](https://docs.databricks.com/api/workspace/jobs/create#tasks) to include after the model notebook execution | [] | ✅ | ❌ | +| `additional_task_settings` | Additional [task config])(https://docs.databricks.com/api/workspace/jobs/create#tasks) to include in the model task | {} | ✅ | ❌ | +| [Other job run settings](https://docs.databricks.com/api/workspace/jobs/submit) | Config will be copied into the request, outside of the model task | None | ❌ | ✅ | +| [Other workflow settings](https://docs.databricks.com/api/workspace/jobs/create) | Config will be copied into the request, outside of the model task | None | ✅ | ❌ | + +Here is an example using these new configuration options: + + + +```yaml +models: + - name: my_model + config: + submission_method: workflow_job + + # Define a job cluster to create for running this workflow + # Alternately, could specify cluster_id to use an existing cluster, or provide neither to use a serverless cluster + job_cluster_config: + spark_version: "15.3.x-scala2.12" + node_type_id: "rd-fleet.2xlarge" + runtime_engine: "{{ var('job_cluster_defaults.runtime_engine') }}" + data_security_mode: "{{ var('job_cluster_defaults.data_security_mode') }}" + autoscale: { "min_workers": 1, "max_workers": 4 } + + python_job_config: + # These settings are passed in, as is, to the request + email_notifications: { on_failure: ["me@example.com"] } + max_retries: 2 + + name: my_workflow_name + + # Override settings for your model's dbt task. For instance, you can + # change the task key + additional_task_settings: { "task_key": "my_dbt_task" } + + # Define tasks to run before/after the model + # This example assumes you have already uploaded a notebook to /my_notebook_path to perform optimize and vacuum + post_hook_tasks: + [ + { + "depends_on": [{ "task_key": "my_dbt_task" }], + "task_key": "OPTIMIZE_AND_VACUUM", + "notebook_task": + { "notebook_path": "/my_notebook_path", "source": "WORKSPACE" }, + }, + ] + + # Simplified structure, rather than having to specify permission separately for each user + grants: + view: [{ "group_name": "marketing-team" }] + run: [{ "user_name": "other_user@example.com" }] + manage: [] +``` + + - ## Incremental models @@ -567,10 +657,11 @@ Specifically, if your python model is incremental, the current execution pattern The python code needs to run on an all purpose cluster, while the SQL code can run on an all purpose cluster or a SQL Warehouse. -The python code needs to run on an all purpose cluster (or serverless cluster, see [Python Model Config](#python-model-config)), while the SQL code can run on an all purpose cluster or a SQL Warehouse. +The python code needs to run on an all purpose cluster (or serverless cluster, see [Python Submission Methods](#python-submission-methods)), while the SQL code can run on an all purpose cluster or a SQL Warehouse. When you specify your `databricks_compute` for a python model, you are currently only specifying which compute to use when running the model-specific SQL. -If you wish to use a different compute for executing the python itself, you must specify an alternate `http_path` in the config for the model. Please note that declaring a separate SQL compute and a python compute for your python dbt models is optional. If you wish to do this: +If you wish to use a different compute for executing the python itself, you must specify an alternate compute in the config for the model. +For example: From 796d9ffeff16bdd9efc7e53971eb0c84f48de842 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Wed, 23 Oct 2024 13:12:54 -0700 Subject: [PATCH 03/13] Update databricks-configs.md Wrap in quotes --- .../resource-configs/databricks-configs.md | 38 +++++++++---------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 2084e98110d..16ab7784cc0 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -79,18 +79,18 @@ As of 1.9, there are four options for `submission_method`: We are currently in a transitionary period where there is a disconnect between old submission methods (which were grouped by compute), and the logically distinct submission methods (command, job run, workflow). As such, the supported config matrix is somewhat complicated: -| Config | Use | Default | `all_purpose_cluster`* | `job_cluster` | `serverless_cluster` | `workflow_job` | -| --------------------- | -------------------------------------------------------------------- | ---------------- | ---------------------- | ------------- | -------------------- | -------------- | -| `create_notebook` | if false, use Command API, otherwise upload notebook and use job run | false | ✅ | ❌ | ❌ | ❌ | -| `timeout` | maximum time to wait for command/job to run | 0 (No timeout) | ✅ | ✅ | ✅ | ✅ | -| `job_cluster_config` | configures a [new cluster](https://docs.databricks.com/api/workspace/jobs/submit#tasks-new_cluster) for running the model | {} | ❌ | ✅ | ❌ | ✅ | -| `access_control_list` | directly configures [access control](https://docs.databricks.com/api/workspace/jobs/submit#access_control_list) for the job | {} | ✅ | ✅ | ✅ | ✅ | -| `packages` | list of packages to install on the executing cluster | [] | ✅ | ✅ | ✅ | ✅ | -| `index_url` | url to install `packages` from | None (uses pypi) | ✅ | ✅ | ✅ | ✅ | -| `additional_libs` | directly configures [libraries](https://docs.databricks.com/api/workspace/jobs/submit#tasks-libraries) | [] | ✅ | ✅ | ✅ | ✅ | -| `python_job_config` | additional configuration for jobs/workflows (see table below) | {} | ✅ | ✅ | ✅ | ✅ | -| `cluster_id` | id of existing all purpose cluster to execute against | None | ✅ | ❌ | ❌ | ✅ | -| `http_path` | path to existing all purpose cluster to execute against | None | ✅ | ❌ | ❌ | ❌ | +| Config | Use | Default | `all_purpose_cluster`* | `job_cluster` | `serverless_cluster` | `workflow_job` | +| --------------------- | -------------------------------------------------------------------- | ------------------ | ---------------------- | ------------- | -------------------- | -------------- | +| `create_notebook` | if false, use Command API, otherwise upload notebook and use job run | `false` | ✅ | ❌ | ❌ | ❌ | +| `timeout` | maximum time to wait for command/job to run | `0` (No timeout) | ✅ | ✅ | ✅ | ✅ | +| `job_cluster_config` | configures a [new cluster](https://docs.databricks.com/api/workspace/jobs/submit#tasks-new_cluster) for running the model | `{}` | ❌ | ✅ | ❌ | ✅ | +| `access_control_list` | directly configures [access control](https://docs.databricks.com/api/workspace/jobs/submit#access_control_list) for the job | `{}` | ✅ | ✅ | ✅ | ✅ | +| `packages` | list of packages to install on the executing cluster | `[]` | ✅ | ✅ | ✅ | ✅ | +| `index_url` | url to install `packages` from | `None` (uses pypi) | ✅ | ✅ | ✅ | ✅ | +| `additional_libs` | directly configures [libraries](https://docs.databricks.com/api/workspace/jobs/submit#tasks-libraries) | `[]` | ✅ | ✅ | ✅ | ✅ | +| `python_job_config` | additional configuration for jobs/workflows (see table below) | `{}` | ✅ | ✅ | ✅ | ✅ | +| `cluster_id` | id of existing all purpose cluster to execute against | `None` | ✅ | ❌ | ❌ | ✅ | +| `http_path` | path to existing all purpose cluster to execute against | `None` | ✅ | ❌ | ❌ | ❌ | \* Only `timeout` and `cluster_id`/`http_path` are supported when `create_notebook` is false @@ -101,13 +101,13 @@ Each config option listed must be nested under `python_job_config`: | Config | Use | Default | `workflow_job` | All others | | -------------------------- | ----------------------------------------------------------------------------------------------------------------------- | ------- | -------------- | ---------- | -| `name` | The name to give (or used to look up) the created workflow | None | ✅ | ❌ | -| `grants` | A simplified way to specify access control for the workflow | {} | ✅ | ✅ | -| `existing_job_id` | Id to use to look up the created workflow (in place of `name`) | None | ✅ | ❌ | -| `post_hook_tasks` | [Tasks](https://docs.databricks.com/api/workspace/jobs/create#tasks) to include after the model notebook execution | [] | ✅ | ❌ | -| `additional_task_settings` | Additional [task config])(https://docs.databricks.com/api/workspace/jobs/create#tasks) to include in the model task | {} | ✅ | ❌ | -| [Other job run settings](https://docs.databricks.com/api/workspace/jobs/submit) | Config will be copied into the request, outside of the model task | None | ❌ | ✅ | -| [Other workflow settings](https://docs.databricks.com/api/workspace/jobs/create) | Config will be copied into the request, outside of the model task | None | ✅ | ❌ | +| `name` | The name to give (or used to look up) the created workflow | `None` | ✅ | ❌ | +| `grants` | A simplified way to specify access control for the workflow | `{}` | ✅ | ✅ | +| `existing_job_id` | Id to use to look up the created workflow (in place of `name`) | `None` | ✅ | ❌ | +| `post_hook_tasks` | [Tasks](https://docs.databricks.com/api/workspace/jobs/create#tasks) to include after the model notebook execution | `[]` | ✅ | ❌ | +| `additional_task_settings` | Additional [task config])(https://docs.databricks.com/api/workspace/jobs/create#tasks) to include in the model task | `{}` | ✅ | ❌ | +| [Other job run settings](https://docs.databricks.com/api/workspace/jobs/submit) | Config will be copied into the request, outside of the model task | `None` | ❌ | ✅ | +| [Other workflow settings](https://docs.databricks.com/api/workspace/jobs/create) | Config will be copied into the request, outside of the model task | `None` | ✅ | ❌ | Here is an example using these new configuration options: From 1a6932a923b2083ebe255cdd2c7e71b2eeb9b77c Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Wed, 23 Oct 2024 13:19:17 -0700 Subject: [PATCH 04/13] Update databricks-configs.md Changing capitalization --- website/docs/reference/resource-configs/databricks-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 16ab7784cc0..fb0de24724a 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -67,7 +67,7 @@ We do not yet have a PySpark API to set tblproperties at table creation, so this -### Python Submission Method +### Python submission methods As of 1.9, there are four options for `submission_method`: From 5877b54187ee857b1e938c8594ac5c9ae74d9372 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Wed, 23 Oct 2024 13:24:00 -0700 Subject: [PATCH 05/13] Update website/docs/reference/resource-configs/databricks-configs.md Co-authored-by: Amy Chen <46451573+amychen1776@users.noreply.github.com> --- website/docs/reference/resource-configs/databricks-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index fb0de24724a..fbc3d3df324 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -69,7 +69,7 @@ We do not yet have a PySpark API to set tblproperties at table creation, so this ### Python submission methods -As of 1.9, there are four options for `submission_method`: +As of 1.9 (or versionless if you're on dbt Cloud), there are four options for `submission_method`: * `all_purpose_cluster`: execute the python model either directly using the [command api](https://docs.databricks.com/api/workspace/commandexecution) or by uploading a notebook and creating a one-off job run * `job_cluster`: creates a new job cluster to execute an uploaded notebook as a one-off job run From 82ffbe20ab1e3a743526ae64c94e3eadb400f7d8 Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Wed, 23 Oct 2024 14:58:40 -0700 Subject: [PATCH 06/13] Update databricks-configs.md Address i.e. --- website/docs/reference/resource-configs/databricks-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index fbc3d3df324..0fff22e814e 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -74,7 +74,7 @@ As of 1.9 (or versionless if you're on dbt Cloud), there are four options for `s * `all_purpose_cluster`: execute the python model either directly using the [command api](https://docs.databricks.com/api/workspace/commandexecution) or by uploading a notebook and creating a one-off job run * `job_cluster`: creates a new job cluster to execute an uploaded notebook as a one-off job run * `serverless_cluster`: uses a [serverless cluster](https://docs.databricks.com/en/jobs/run-serverless-jobs.html) to execute an uploaded notebook as a one-off job run -* `workflow_job`: creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. :::caution This approach gives you maximum flexibility, but will create persistent artifacts (i.e. the workflow) in Databricks that users could run outside of dbt. +* `workflow_job`: creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. :::caution This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt. We are currently in a transitionary period where there is a disconnect between old submission methods (which were grouped by compute), and the logically distinct submission methods (command, job run, workflow). As such, the supported config matrix is somewhat complicated: From 7f146cbb2dc22422466d3f09c9ae516bb8c5d8ca Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Wed, 23 Oct 2024 14:59:24 -0700 Subject: [PATCH 07/13] Update databricks-configs.md Address another lint issue --- website/docs/reference/resource-configs/databricks-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 0fff22e814e..443cb9140cc 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -94,7 +94,7 @@ As such, the supported config matrix is somewhat complicated: \* Only `timeout` and `cluster_id`/`http_path` are supported when `create_notebook` is false -With the 1.9's introduction of the `workflow_job` submission method we chose to segregate further configuration of the python model submission under a top level configuration named `python_job_config`. +With the introduction of the `workflow_job` submission method we chose to segregate further configuration of the python model submission under a top level configuration named `python_job_config`. This keeps configuration options for jobs and workflows namespaced in such a way that they do not interfere with other model config, allowing us to be much more flexible with what is supported for job execution. The support matrix for this feature is divided into `workflow_job` and all others (assuming `all_purpose_cluster` with `create_notebook`==true). Each config option listed must be nested under `python_job_config`: From 8fb85150c6671d077e93aa909810b07e86bf8b9e Mon Sep 17 00:00:00 2001 From: Ben Cassell <98852248+benc-db@users.noreply.github.com> Date: Fri, 25 Oct 2024 09:56:18 -0700 Subject: [PATCH 08/13] Update website/docs/reference/resource-configs/databricks-configs.md Co-authored-by: Amy Chen <46451573+amychen1776@users.noreply.github.com> --- website/docs/reference/resource-configs/databricks-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 443cb9140cc..1360a605189 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -94,7 +94,7 @@ As such, the supported config matrix is somewhat complicated: \* Only `timeout` and `cluster_id`/`http_path` are supported when `create_notebook` is false -With the introduction of the `workflow_job` submission method we chose to segregate further configuration of the python model submission under a top level configuration named `python_job_config`. +With the introduction of the `workflow_job` submission method, we chose to segregate further configuration of the python model submission under a top level configuration named `python_job_config`. This keeps configuration options for jobs and workflows namespaced in such a way that they do not interfere with other model config, allowing us to be much more flexible with what is supported for job execution. The support matrix for this feature is divided into `workflow_job` and all others (assuming `all_purpose_cluster` with `create_notebook`==true). Each config option listed must be nested under `python_job_config`: From 1d6cb8759d066a6e83d34acebb1aefb7d74695a6 Mon Sep 17 00:00:00 2001 From: "Leona B. Campbell" <3880403+runleonarun@users.noreply.github.com> Date: Fri, 25 Oct 2024 12:06:20 -0700 Subject: [PATCH 09/13] Update website/docs/reference/resource-configs/databricks-configs.md --- website/docs/reference/resource-configs/databricks-configs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 1360a605189..df97017a6da 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -69,7 +69,7 @@ We do not yet have a PySpark API to set tblproperties at table creation, so this ### Python submission methods -As of 1.9 (or versionless if you're on dbt Cloud), there are four options for `submission_method`: +In dbt v1.9 and higher, or in [Versionless](/docs/dbt-versions/versionless-cloud) dbt Cloud, you can use these four options for `submission_method`: * `all_purpose_cluster`: execute the python model either directly using the [command api](https://docs.databricks.com/api/workspace/commandexecution) or by uploading a notebook and creating a one-off job run * `job_cluster`: creates a new job cluster to execute an uploaded notebook as a one-off job run From 3885467350da87a0e95a1b8a9bd9e9fc2e03026b Mon Sep 17 00:00:00 2001 From: "Leona B. Campbell" <3880403+runleonarun@users.noreply.github.com> Date: Fri, 25 Oct 2024 12:08:29 -0700 Subject: [PATCH 10/13] Update website/docs/reference/resource-configs/databricks-configs.md --- .../docs/reference/resource-configs/databricks-configs.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index df97017a6da..cd0be4aa14b 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -74,7 +74,10 @@ In dbt v1.9 and higher, or in [Versionless](/docs/dbt-versions/versionless-cloud * `all_purpose_cluster`: execute the python model either directly using the [command api](https://docs.databricks.com/api/workspace/commandexecution) or by uploading a notebook and creating a one-off job run * `job_cluster`: creates a new job cluster to execute an uploaded notebook as a one-off job run * `serverless_cluster`: uses a [serverless cluster](https://docs.databricks.com/en/jobs/run-serverless-jobs.html) to execute an uploaded notebook as a one-off job run -* `workflow_job`: creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. :::caution This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt. +* `workflow_job`: creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. : +::caution +This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt. +::: We are currently in a transitionary period where there is a disconnect between old submission methods (which were grouped by compute), and the logically distinct submission methods (command, job run, workflow). As such, the supported config matrix is somewhat complicated: From 31412141fdd0906e03855e04666cb2c550062367 Mon Sep 17 00:00:00 2001 From: "Leona B. Campbell" <3880403+runleonarun@users.noreply.github.com> Date: Fri, 25 Oct 2024 12:48:25 -0700 Subject: [PATCH 11/13] Update databricks-configs.md --- .../resource-configs/databricks-configs.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index cd0be4aa14b..14058b80511 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -71,15 +71,17 @@ We do not yet have a PySpark API to set tblproperties at table creation, so this In dbt v1.9 and higher, or in [Versionless](/docs/dbt-versions/versionless-cloud) dbt Cloud, you can use these four options for `submission_method`: -* `all_purpose_cluster`: execute the python model either directly using the [command api](https://docs.databricks.com/api/workspace/commandexecution) or by uploading a notebook and creating a one-off job run -* `job_cluster`: creates a new job cluster to execute an uploaded notebook as a one-off job run -* `serverless_cluster`: uses a [serverless cluster](https://docs.databricks.com/en/jobs/run-serverless-jobs.html) to execute an uploaded notebook as a one-off job run -* `workflow_job`: creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. : -::caution -This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt. -::: +* `all_purpose_cluster`: Executes the python model either directly using the [command api](https://docs.databricks.com/api/workspace/commandexecution) or by uploading a notebook and creating a one-off job run +* `job_cluster`: Creates a new job cluster to execute an uploaded notebook as a one-off job run +* `serverless_cluster`: Uses a [serverless cluster](https://docs.databricks.com/en/jobs/run-serverless-jobs.html) to execute an uploaded notebook as a one-off job run +* `workflow_job`: Creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. + + :::caution + This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt. + ::: We are currently in a transitionary period where there is a disconnect between old submission methods (which were grouped by compute), and the logically distinct submission methods (command, job run, workflow). + As such, the supported config matrix is somewhat complicated: | Config | Use | Default | `all_purpose_cluster`* | `job_cluster` | `serverless_cluster` | `workflow_job` | From 94e106261f10c39e8c322c29ea7dd2191d1b88bb Mon Sep 17 00:00:00 2001 From: "Leona B. Campbell" <3880403+runleonarun@users.noreply.github.com> Date: Fri, 25 Oct 2024 12:52:52 -0700 Subject: [PATCH 12/13] Apply suggestions from code review --- website/docs/reference/resource-configs/databricks-configs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 14058b80511..93a64ab22e3 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -110,11 +110,11 @@ Each config option listed must be nested under `python_job_config`: | `grants` | A simplified way to specify access control for the workflow | `{}` | ✅ | ✅ | | `existing_job_id` | Id to use to look up the created workflow (in place of `name`) | `None` | ✅ | ❌ | | `post_hook_tasks` | [Tasks](https://docs.databricks.com/api/workspace/jobs/create#tasks) to include after the model notebook execution | `[]` | ✅ | ❌ | -| `additional_task_settings` | Additional [task config])(https://docs.databricks.com/api/workspace/jobs/create#tasks) to include in the model task | `{}` | ✅ | ❌ | +| `additional_task_settings` | Additional [task config](https://docs.databricks.com/api/workspace/jobs/create#tasks) to include in the model task | `{}` | ✅ | ❌ | | [Other job run settings](https://docs.databricks.com/api/workspace/jobs/submit) | Config will be copied into the request, outside of the model task | `None` | ❌ | ✅ | | [Other workflow settings](https://docs.databricks.com/api/workspace/jobs/create) | Config will be copied into the request, outside of the model task | `None` | ✅ | ❌ | -Here is an example using these new configuration options: +This example uses the new configuration options in the previous table: From 8632a1603618b95acbcdfa2d046eb72778478258 Mon Sep 17 00:00:00 2001 From: "Leona B. Campbell" <3880403+runleonarun@users.noreply.github.com> Date: Fri, 25 Oct 2024 13:20:47 -0700 Subject: [PATCH 13/13] Update databricks-configs.md --- .../reference/resource-configs/databricks-configs.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/website/docs/reference/resource-configs/databricks-configs.md b/website/docs/reference/resource-configs/databricks-configs.md index 93a64ab22e3..88014d0ac4d 100644 --- a/website/docs/reference/resource-configs/databricks-configs.md +++ b/website/docs/reference/resource-configs/databricks-configs.md @@ -75,10 +75,9 @@ In dbt v1.9 and higher, or in [Versionless](/docs/dbt-versions/versionless-cloud * `job_cluster`: Creates a new job cluster to execute an uploaded notebook as a one-off job run * `serverless_cluster`: Uses a [serverless cluster](https://docs.databricks.com/en/jobs/run-serverless-jobs.html) to execute an uploaded notebook as a one-off job run * `workflow_job`: Creates/updates a reusable workflow and uploaded notebook, for execution on all-purpose, job, or serverless clusters. - - :::caution - This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt. - ::: + :::caution + This approach gives you maximum flexibility, but will create persistent artifacts in Databricks (the workflow) that users could run outside of dbt. + ::: We are currently in a transitionary period where there is a disconnect between old submission methods (which were grouped by compute), and the logically distinct submission methods (command, job run, workflow). @@ -99,8 +98,8 @@ As such, the supported config matrix is somewhat complicated: \* Only `timeout` and `cluster_id`/`http_path` are supported when `create_notebook` is false -With the introduction of the `workflow_job` submission method, we chose to segregate further configuration of the python model submission under a top level configuration named `python_job_config`. -This keeps configuration options for jobs and workflows namespaced in such a way that they do not interfere with other model config, allowing us to be much more flexible with what is supported for job execution. +With the introduction of the `workflow_job` submission method, we chose to segregate further configuration of the python model submission under a top level configuration named `python_job_config`. This keeps configuration options for jobs and workflows namespaced in such a way that they do not interfere with other model config, allowing us to be much more flexible with what is supported for job execution. + The support matrix for this feature is divided into `workflow_job` and all others (assuming `all_purpose_cluster` with `create_notebook`==true). Each config option listed must be nested under `python_job_config`: