Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta housekeeping notebooks #95

Conversation

edurdevic
Copy link
Contributor

No description provided.

@CLAassistant
Copy link

CLAassistant commented Dec 22, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor Author

@edurdevic edurdevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorenzorubi-db Nice work!
I added a number of comments, they are mostly intended to align this functionality with the rest of the library.
This looks very promising!

examples/exec_delta_housekeeping.py Show resolved Hide resolved
examples/exec_delta_housekeeping.py Show resolved Hide resolved
examples/exec_delta_housekeeping.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
discoverx/delta_housekeeping.py Outdated Show resolved Hide resolved
examples/exec_delta_housekeeping.py Outdated Show resolved Hide resolved
examples/exec_delta_housekeeping.py Outdated Show resolved Hide resolved
examples/exec_delta_housekeeping.py Outdated Show resolved Hide resolved
examples/exec_delta_housekeeping.py Outdated Show resolved Hide resolved
discoverx/delta_housekeeping.py Outdated Show resolved Hide resolved
@lorenzorubi-db
Copy link

@edurdevic thanks for your comments, I'd say that all of them are covered now, please let me know if you find anything missing or new ideas come up

note that I've removed the implementation of map_chunked and added it into #99

Copy link
Contributor Author

@edurdevic edurdevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution @lorenzorubi-db!


# DBTITLE 1,Run the discoverx DeltaHousekeeping operation -generates an output object you can apply operations to
output = (
dx.from_tables("lorenzorubi.*.*")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I missed this.
Could you please move "lorenzorubi.." to a widget, and if you want replace it with another example catalog name?

returns a pandas DataFrame, and converts Spark internal dfs to pandas as soon as they are manageable
the reason being that DESCRIBE HISTORY / DESCRIBE DETAIL cannot be cached

TODO reconsider if it is better outside of the class
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this TODO

Would make sense only if using map_chunked from the `DataExplorer` object
(otherwise tables are writen one by one into Delta with overhead)

TODO create function in `DataExplorer` that uses this for a chunked
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the TODO

need_optimize_df.reset_index().loc[:, ["catalog", "database", "tableName"]],
expected_need_optimize.loc[:, ["catalog", "database", "tableName"]],
)
# TODO complete all the tests
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the TODO

def test_process_describe_history_empty_history(spark, dd_click_sales, dh_click_sales):
dh = DeltaHousekeeping(spark)
describe_detail_df = spark.createDataFrame(dd_click_sales)
describe_history_df = spark.createDataFrame(dh_click_sales)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I like to define the DFs inline inside the tests, in order to make the tests more readable.
But that's for the next time :)

@edurdevic
Copy link
Contributor Author

Closing this PR so that @lorenzorubi-db can re-open from his account

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants