refactor(ingest/unity): Use databricks_sdk for all requests #8237

asikowitz · 2023-06-14T09:58:20Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…uct in client

asikowitz · 2023-06-14T10:29:18Z

metadata-ingestion/src/datahub/ingestion/source/unity/config.py

-            "Use if you only want to ingest one metastore and "
-            "do not want to grant your ingestion service account the admin role."
-        ),
+    _only_ingest_assigned_metastore_removed = pydantic_removed_field(


This is actually the only mode supported, because workspace_client queries via the workspace, and there is only one metastore per workspace

Haven't 100% been able to test this because I don't have the AWS perms to create a second metastore and workspace :| but based on the databricks api docs, it doesn't seem like you can get catalogs from multiple metastores like we were trying to do

Have confirmed this now. I think for us to support multiple metastores, we'll want to have the config take in a map from workspace url -> token

I think you are right, this is what I remember as well.

asikowitz · 2023-06-14T10:29:36Z

metadata-ingestion/src/datahub/ingestion/source/unity/profiler.py

@@ -39,7 +39,6 @@ def get_workunits(
                for future in as_completed(futures):
                    wu: Optional[MetadataWorkUnit] = future.result()
                    if wu:
-                        self.report.num_profile_workunits_emitted += 1


Removing this as it's handed by reporting workunits

asikowitz · 2023-06-14T10:30:49Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

-        if response.get("tables") is None:
-            logger.info(
-                f"Tables not found for schema {schema.catalog.name}.{schema.name}"
+        with patch("databricks.sdk.service.catalog.TableInfo", TableInfoWithGeneration):


Not sure if I went too far with this. This was my solution to the issue of (a) wanting to use the workspace client method and (b) still wanting to keep the generation field, in case we want it. That field is not documented anywhere but it seems potentially useful

Maybe worth to ask Serge about this

asikowitz · 2023-06-14T10:32:24Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

-            total_results = response["totalResults"]
-            for principal in response["Resources"]:
-                yield self._create_service_principal(principal)
+        for principal in self._workspace_client.service_principals.list():


I believe from our call, pagination is just not supported for this endpoint (even though it looks like it is). Ideally, I'd pull the list of service principals from ingestion, then call this endpoint to figure out what they are, but that would require a big code refactor and I'm not sure how we'd generate workunits in real time, rather than all at the end, if we did this. So for now, I think it's fine to just get them all at once

asikowitz · 2023-06-14T10:33:07Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy_types.py

@@ -64,12 +70,12 @@
 class CommonProperty:
    id: str
    name: str
-    type: str


Thought this got pretty confusing with all the different type values floating around, and it was barely used

asikowitz · 2023-06-14T10:33:28Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy_types.py

@@ -115,7 +122,7 @@ class ServicePrincipal:

 @dataclass(frozen=True, order=True)
 class TableReference:
-    metastore_id: str


Thought this was confusing too, because metastore (name) and metastore_id are different things

asikowitz · 2023-06-14T17:03:58Z

Note to self: remove databricks_cli dependency

treff7es

I just left a few small comment but overall looks good

treff7es · 2023-06-14T20:02:35Z

metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py

@@ -22,7 +22,7 @@ class DatasetContainerSubTypes(str, Enum):
    DATABASE = "Database"
    SCHEMA = "Schema"
    # System-Specific SubTypes
-    PRESTO_CATALOG = "Catalog"
+    CATALOG = "Catalog"  # Presto or Unity Catalog


treff7es · 2023-06-15T20:18:01Z

metadata-ingestion/src/datahub/ingestion/source/unity/config.py

-            "Use if you only want to ingest one metastore and "
-            "do not want to grant your ingestion service account the admin role."
-        ),
+    _only_ingest_assigned_metastore_removed = pydantic_removed_field(


I think you are right, this is what I remember as well.

treff7es · 2023-06-15T20:21:00Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

-        if response.get("tables") is None:
-            logger.info(
-                f"Tables not found for schema {schema.catalog.name}.{schema.name}"
+        with patch("databricks.sdk.service.catalog.TableInfo", TableInfoWithGeneration):


Maybe worth to ask Serge about this

treff7es · 2023-06-15T20:24:15Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

-                self._escape_sequence(obj["name"]),
-            ),
+            name=obj.name,
+            id="{}.{}".format(metastore.id, self._escape_sequence(obj.name)),


Suggested change

id="{}.{}".format(metastore.id, self._escape_sequence(obj.name)),

id=f"{metastore_id}.{self._escape_sequence(obj.name)}",

treff7es · 2023-06-15T20:25:31Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

-    def _create_table(self, schema: Schema, obj: Any) -> Table:
-        table_id: str = "{}.{}".format(schema.id, self._escape_sequence(obj["name"]))
+    def _create_table(self, schema: Schema, obj: TableInfoWithGeneration) -> Table:
+        table_id: str = "{}.{}".format(schema.id, self._escape_sequence(obj.name))


Suggested change

table_id: str = "{}.{}".format(schema.id, self._escape_sequence(obj.name))

table_id: str = f"{schema}.{self._escape_sequence(obj.name)}"

treff7es · 2023-06-15T20:29:11Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

+            properties=obj.properties or {},
+            owner=obj.owner,
+            generation=obj.generation,
+            created_at=datetime.utcfromtimestamp(obj.created_at / 1000),


Suggested change

created_at=datetime.utcfromtimestamp(obj.created_at / 1000),

created_at=datetime.fromtimestamp(obj.created_at / 1000, tz=timezone.utc),

treff7es · 2023-06-15T20:29:57Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

+            generation=obj.generation,
+            created_at=datetime.utcfromtimestamp(obj.created_at / 1000),
+            created_by=obj.created_by,
+            updated_at=datetime.utcfromtimestamp(obj.updated_at / 1000)


Suggested change

updated_at=datetime.utcfromtimestamp(obj.updated_at / 1000)

updated_at=datetime.fromtimestamp(obj.updated_at / 1000, tz=timezone.utc)

treff7es · 2023-06-15T20:30:56Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

-            display_name=display_name,
-            application_id=obj["applicationId"],
-            active=obj.get("active"),
+            id="{}.{}".format(obj.id, self._escape_sequence(obj.display_name)),


Suggested change

id="{}.{}".format(obj.id, self._escape_sequence(obj.display_name)),

id=f"{obj.id}.{obj.display_name}",

asikowitz · 2023-06-18T16:45:55Z

Closing as changed merged by #8238

refactor(ingest/unity): Use databricks_sdk for all requests; set prod…

f8e735d

…uct in client

asikowitz requested review from treff7es and hsheth2 June 14, 2023 09:58

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 14, 2023

vercel bot had a problem deploying to Preview June 14, 2023 10:16 Failure

asikowitz commented Jun 14, 2023

View reviewed changes

asikowitz changed the title ~~refactor(ingest/unity): Use databricks_sdk for all requests; set product in client~~ feat(ingest/unity): Use databricks_sdk for all requests; add external url to datasets Jun 14, 2023

asikowitz changed the title ~~feat(ingest/unity): Use databricks_sdk for all requests; add external url to datasets~~ feat(ingest/unity): Use databricks_sdk for all requests; set external url Jun 14, 2023

asikowitz force-pushed the unity-catalog-improvements branch from c6eed66 to f8e735d Compare June 14, 2023 10:51

asikowitz changed the title ~~feat(ingest/unity): Use databricks_sdk for all requests; set external url~~ feat(ingest/unity): Use databricks_sdk for all requests Jun 14, 2023

asikowitz changed the title ~~feat(ingest/unity): Use databricks_sdk for all requests~~ refactor(ingest/unity): Use databricks_sdk for all requests Jun 14, 2023

asikowitz mentioned this pull request Jun 14, 2023

feat(ingest/unity): Set external url for containers and datasets #8238

Merged

5 tasks

treff7es approved these changes Jun 15, 2023

View reviewed changes

asikowitz closed this Jun 18, 2023

asikowitz deleted the unity-catalog-improvements branch June 18, 2023 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(ingest/unity): Use databricks_sdk for all requests #8237

refactor(ingest/unity): Use databricks_sdk for all requests #8237

asikowitz commented Jun 14, 2023

asikowitz Jun 14, 2023

asikowitz Jun 14, 2023

treff7es Jun 15, 2023

asikowitz Jun 14, 2023

asikowitz Jun 14, 2023

treff7es Jun 15, 2023

asikowitz Jun 14, 2023

asikowitz Jun 14, 2023

asikowitz Jun 14, 2023

asikowitz commented Jun 14, 2023

treff7es left a comment

treff7es Jun 14, 2023

treff7es Jun 15, 2023

treff7es Jun 15, 2023

treff7es Jun 15, 2023

treff7es Jun 15, 2023

treff7es Jun 15, 2023

treff7es Jun 15, 2023

treff7es Jun 15, 2023

asikowitz commented Jun 18, 2023

	id="{}.{}".format(metastore.id, self._escape_sequence(obj.name)),
	id=f"{metastore_id}.{self._escape_sequence(obj.name)}",

	table_id: str = "{}.{}".format(schema.id, self._escape_sequence(obj.name))
	table_id: str = f"{schema}.{self._escape_sequence(obj.name)}"

	created_at=datetime.utcfromtimestamp(obj.created_at / 1000),
	created_at=datetime.fromtimestamp(obj.created_at / 1000, tz=timezone.utc),

	updated_at=datetime.utcfromtimestamp(obj.updated_at / 1000)
	updated_at=datetime.fromtimestamp(obj.updated_at / 1000, tz=timezone.utc)

	id="{}.{}".format(obj.id, self._escape_sequence(obj.display_name)),
	id=f"{obj.id}.{obj.display_name}",

refactor(ingest/unity): Use databricks_sdk for all requests #8237

refactor(ingest/unity): Use databricks_sdk for all requests #8237

Conversation

asikowitz commented Jun 14, 2023

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz commented Jun 14, 2023

treff7es left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz commented Jun 18, 2023