Dx 32 nested types #17

david-tempelmann · 2023-06-01T09:28:18Z

A first version supporting complex types (struct with any level and arrays up to the first level only). Structs are scanned recursively but as soon as an array (or map) is found we don't continue for that column.

Still requires some code refactoring but it would be great if you could review the approach.

Steps remaining:

fix all unit tests
add map-type
so far I've only developed and tested it locally so next step is to test on databricks and logfood
add more tests

edurdevic · 2023-06-02T11:03:32Z

discoverx/classification.py

@@ -43,6 +43,7 @@ def above_threshold(self):
                    "database": "table_schema",
                    "table": "table_name",
                    "column": "column_name",
+                    "type": "data_type",


I renamed the columns from the source SQL query, so you don't need this rename any more

edurdevic · 2023-06-02T11:05:56Z

discoverx/classification.py

        ).withColumn("effective_timestamp", func.current_timestamp())
        # merge using scd-typ2
        logger.friendly(f"Update classification table {self.classification_table_name}")

        self.classification_table.alias("target").merge(
            staged_updates_df.alias("source"),
-            "target.table_catalog <=> source.table_catalog AND target.table_schema = source.table_schema AND target.table_name = source.table_name AND target.column_name = source.column_name AND target.tag_name = source.tag_name AND target.current = true",
+            "target.table_catalog <=> source.table_catalog AND target.table_schema = source.table_schema AND target.table_name = source.table_name AND target.column_name = source.column_name AND target.data_type = source.data_type AND target.tag_name = source.tag_name AND target.current = true",


Why do we need to match on data_type?

edurdevic · 2023-06-02T20:20:01Z

discoverx/scanner.py



 @dataclass
 class TaggedColumn:
    name: str
+    data_type: str


We also need the full_data_type, which contains the full definition of the composed columns

edurdevic · 2023-06-02T20:21:34Z

discoverx/msql.py

@@ -66,6 +66,15 @@ def compile_msql(self, table_info: TableInfo) -> list[SQLRow]:
            temp_sql = msql
            for tagged_col in tagged_cols:
                temp_sql = temp_sql.replace(f"[{tagged_col.tag}]", tagged_col.name)
+                # TODO: Can we avoid "replacing strings" for the different types in the future? This is due to the generation of MSQL. Maybe we should rather generate SQL directly from the search method...


Yes, we should do that instead.

edurdevic · 2023-06-02T20:25:37Z

discoverx/scanner.py

+        col_name_splitted = col_name.split(".")
+        return ".".join(["`" + col + "`" for col in col_name_splitted])
+
+    def recursive_flatten_complex_type(self, col_name, schema, column_list):


Oh, wow! This was more complex than I expected.

edurdevic · 2023-06-02T20:27:26Z

discoverx/scanner.py

+                        {"col_name": self.backtick_col_name(col_name + "." + field.name), "type": "string"}
+                    )
+                elif type(field.dataType) in self.COMPLEX_TYPES:
+                    column_list = self.recursive_flatten_complex_type(col_name + "." + field.name, field, column_list)


I think you should be appending to column_list instead of replacing column_list. Otherwise you overwrite previously appended string types

edurdevic · 2023-06-03T12:55:17Z

tests/unit/data/columns_mock.csv

@@ -23,3 +23,6 @@ hive_metastore,default,tb_all_types,str_part_col,STRING,1
 ,default,tb_1,ip,STRING,
 ,default,tb_1,mac,STRING,
 ,default,tb_1,description,STRING,
+,default,tb_2,active,BOOLEAN,
+,default,tb_2,categories,"map<string,string>",


Here we need to add a "full_data_type" column to reflect the UC structure

CLAassistant · 2023-11-27T20:19:01Z

All committers have signed the CLA.

david-tempelmann added 7 commits May 25, 2023 09:33

initial changes for nested type support

c6d2ebe

working solution for strings in structs

5d3deb1

working solution for strings in structs

dd0dafe

scan support for array-type

264210c

add data type to scan result

7739396

add data type to classification result

524af43

add complex type support for search

4485038

david-tempelmann marked this pull request as draft June 1, 2023 09:28

david-tempelmann requested a review from edurdevic June 1, 2023 09:28

edurdevic reviewed Jun 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dx 32 nested types #17

Dx 32 nested types #17

david-tempelmann commented Jun 1, 2023

edurdevic Jun 2, 2023

edurdevic Jun 2, 2023

edurdevic Jun 2, 2023

edurdevic Jun 2, 2023

edurdevic Jun 2, 2023

edurdevic Jun 2, 2023

edurdevic Jun 3, 2023

CLAassistant commented Nov 27, 2023 •

edited

Loading

Dx 32 nested types #17

Are you sure you want to change the base?

Dx 32 nested types #17

Conversation

david-tempelmann commented Jun 1, 2023

edurdevic Jun 2, 2023

Choose a reason for hiding this comment

edurdevic Jun 2, 2023

Choose a reason for hiding this comment

edurdevic Jun 2, 2023

Choose a reason for hiding this comment

edurdevic Jun 2, 2023

Choose a reason for hiding this comment

edurdevic Jun 2, 2023

Choose a reason for hiding this comment

edurdevic Jun 2, 2023

Choose a reason for hiding this comment

edurdevic Jun 3, 2023

Choose a reason for hiding this comment

CLAassistant commented Nov 27, 2023 • edited Loading

CLAassistant commented Nov 27, 2023 •

edited

Loading