Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dx 32 nested types #17

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Dx 32 nested types #17

wants to merge 7 commits into from

Conversation

david-tempelmann
Copy link
Contributor

A first version supporting complex types (struct with any level and arrays up to the first level only). Structs are scanned recursively but as soon as an array (or map) is found we don't continue for that column.

Still requires some code refactoring but it would be great if you could review the approach.

Steps remaining:

  • fix all unit tests
  • add map-type
  • so far I've only developed and tested it locally so next step is to test on databricks and logfood
  • add more tests

@david-tempelmann david-tempelmann marked this pull request as draft June 1, 2023 09:28
@@ -43,6 +43,7 @@ def above_threshold(self):
"database": "table_schema",
"table": "table_name",
"column": "column_name",
"type": "data_type",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the columns from the source SQL query, so you don't need this rename any more

).withColumn("effective_timestamp", func.current_timestamp())
# merge using scd-typ2
logger.friendly(f"Update classification table {self.classification_table_name}")

self.classification_table.alias("target").merge(
staged_updates_df.alias("source"),
"target.table_catalog <=> source.table_catalog AND target.table_schema = source.table_schema AND target.table_name = source.table_name AND target.column_name = source.column_name AND target.tag_name = source.tag_name AND target.current = true",
"target.table_catalog <=> source.table_catalog AND target.table_schema = source.table_schema AND target.table_name = source.table_name AND target.column_name = source.column_name AND target.data_type = source.data_type AND target.tag_name = source.tag_name AND target.current = true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to match on data_type?



@dataclass
class TaggedColumn:
name: str
data_type: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need the full_data_type, which contains the full definition of the composed columns

@@ -66,6 +66,15 @@ def compile_msql(self, table_info: TableInfo) -> list[SQLRow]:
temp_sql = msql
for tagged_col in tagged_cols:
temp_sql = temp_sql.replace(f"[{tagged_col.tag}]", tagged_col.name)
# TODO: Can we avoid "replacing strings" for the different types in the future? This is due to the generation of MSQL. Maybe we should rather generate SQL directly from the search method...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do that instead.

col_name_splitted = col_name.split(".")
return ".".join(["`" + col + "`" for col in col_name_splitted])

def recursive_flatten_complex_type(self, col_name, schema, column_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wow! This was more complex than I expected.

{"col_name": self.backtick_col_name(col_name + "." + field.name), "type": "string"}
)
elif type(field.dataType) in self.COMPLEX_TYPES:
column_list = self.recursive_flatten_complex_type(col_name + "." + field.name, field, column_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be appending to column_list instead of replacing column_list. Otherwise you overwrite previously appended string types

@@ -23,3 +23,6 @@ hive_metastore,default,tb_all_types,str_part_col,STRING,1
,default,tb_1,ip,STRING,
,default,tb_1,mac,STRING,
,default,tb_1,description,STRING,
,default,tb_2,active,BOOLEAN,
,default,tb_2,categories,"map<string,string>",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need to add a "full_data_type" column to reflect the UC structure

@CLAassistant
Copy link

CLAassistant commented Nov 27, 2023

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants