-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delta housekeeping notebooks #95
Delta housekeeping notebooks #95
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lorenzorubi-db Nice work!
I added a number of comments, they are mostly intended to align this functionality with the rest of the library.
This looks very promising!
+ improved unit tests
@edurdevic thanks for your comments, I'd say that all of them are covered now, please let me know if you find anything missing or new ideas come up note that I've removed the implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @lorenzorubi-db!
|
||
# DBTITLE 1,Run the discoverx DeltaHousekeeping operation -generates an output object you can apply operations to | ||
output = ( | ||
dx.from_tables("lorenzorubi.*.*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, I missed this.
Could you please move "lorenzorubi.." to a widget, and if you want replace it with another example catalog name?
returns a pandas DataFrame, and converts Spark internal dfs to pandas as soon as they are manageable | ||
the reason being that DESCRIBE HISTORY / DESCRIBE DETAIL cannot be cached | ||
|
||
TODO reconsider if it is better outside of the class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this TODO
Would make sense only if using map_chunked from the `DataExplorer` object | ||
(otherwise tables are writen one by one into Delta with overhead) | ||
|
||
TODO create function in `DataExplorer` that uses this for a chunked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the TODO
need_optimize_df.reset_index().loc[:, ["catalog", "database", "tableName"]], | ||
expected_need_optimize.loc[:, ["catalog", "database", "tableName"]], | ||
) | ||
# TODO complete all the tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the TODO
def test_process_describe_history_empty_history(spark, dd_click_sales, dh_click_sales): | ||
dh = DeltaHousekeeping(spark) | ||
describe_detail_df = spark.createDataFrame(dd_click_sales) | ||
describe_history_df = spark.createDataFrame(dh_click_sales) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: I like to define the DFs inline inside the tests, in order to make the tests more readable.
But that's for the next time :)
Closing this PR so that @lorenzorubi-db can re-open from his account |
No description provided.