Delta housekeeping notebooks #95

edurdevic · 2023-12-22T08:54:14Z

No description provided.

CLAassistant · 2023-12-22T08:54:22Z

All committers have signed the CLA.

edurdevic

@lorenzorubi-db Nice work!
I added a number of comments, they are mostly intended to align this functionality with the rest of the library.
This looks very promising!

examples/exec_delta_housekeeping.py

setup.py

discoverx/delta_housekeeping.py

examples/exec_delta_housekeeping.py

discoverx/delta_housekeeping.py

+ improved unit tests

…ons`

lorenzorubi-db · 2024-01-03T18:44:48Z

@edurdevic thanks for your comments, I'd say that all of them are covered now, please let me know if you find anything missing or new ideas come up

note that I've removed the implementation of map_chunked and added it into #99

edurdevic

Thank you for your contribution @lorenzorubi-db!

edurdevic · 2024-01-08T20:20:35Z

examples/exec_delta_housekeeping.py

+
+# DBTITLE 1,Run the discoverx DeltaHousekeeping operation -generates an output object you can apply operations to
+output = (
+  dx.from_tables("lorenzorubi.*.*")


Uh, I missed this.
Could you please move "lorenzorubi.." to a widget, and if you want replace it with another example catalog name?

edurdevic · 2024-01-08T20:24:51Z

discoverx/delta_housekeeping.py

+        returns a pandas DataFrame, and converts Spark internal dfs to pandas as soon as they are manageable
+        the reason being that DESCRIBE HISTORY / DESCRIBE DETAIL cannot be cached
+
+        TODO reconsider if it is better outside of the class


Please remove this TODO

edurdevic · 2024-01-08T20:25:38Z

discoverx/delta_housekeeping.py

+        Would make sense only if using map_chunked from the `DataExplorer` object
+        (otherwise tables are writen one by one into Delta with overhead)
+
+        TODO create function in `DataExplorer` that uses this for a chunked


Please remove the TODO

edurdevic · 2024-01-08T20:30:14Z

tests/unit/delta_housekeeping_actions_test.py

+        need_optimize_df.reset_index().loc[:, ["catalog", "database", "tableName"]],
+        expected_need_optimize.loc[:, ["catalog", "database", "tableName"]],
+    )
+    # TODO complete all the tests


Please remove the TODO

edurdevic · 2024-01-08T20:32:34Z

tests/unit/delta_housekeeping_test.py

+def test_process_describe_history_empty_history(spark, dd_click_sales, dh_click_sales):
+    dh = DeltaHousekeeping(spark)
+    describe_detail_df = spark.createDataFrame(dd_click_sales)
+    describe_history_df = spark.createDataFrame(dh_click_sales)


NIT: I like to define the DFs inline inside the tests, in order to make the tests more readable.
But that's for the next time :)

edurdevic · 2024-01-08T20:44:38Z

Closing this PR so that @lorenzorubi-db can re-open from his account

lorenzorubi-db and others added 7 commits December 18, 2023 12:05

delta housekeeping initial commit

5d7889e

debugging initial version

90bab27

convert output to pandas

94629e0

debugging -convert output to pandas

543f852

DeltaHousekeepingActions object and tests

567b303

added more insights to housekeeping and refactored tests

bded305

regression and cleanup

cf4ef07

edurdevic requested a review from david-tempelmann December 22, 2023 08:55

edurdevic commented Dec 27, 2023

View reviewed changes

lorenzorubi-db and others added 4 commits January 3, 2024 14:17

move implementation of map_chunked to a separated branch

bc303cd

+ improved unit tests

readability, cleanup, follow discoverx patterns

feeafaf

debugging on cluster + adding spark session to `DeltaHousekeepingActi…

e8a1b66

…ons`

simplify scan implementation & remove dependency to BeautifulSoup

e177ef4

lorenzorubi-db added 2 commits January 5, 2024 18:31

faster implementation + unit tests

023b02f

cleanup

c2b028f

edurdevic commented Jan 8, 2024

View reviewed changes

edurdevic closed this Jan 8, 2024

lorenzorubi-db mentioned this pull request Jan 9, 2024

Delta housekeeping initial version #101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delta housekeeping notebooks #95

Delta housekeeping notebooks #95

edurdevic commented Dec 22, 2023

CLAassistant commented Dec 22, 2023 •

edited

Loading

edurdevic left a comment

lorenzorubi-db commented Jan 3, 2024

edurdevic left a comment

edurdevic Jan 8, 2024

edurdevic Jan 8, 2024

edurdevic Jan 8, 2024

edurdevic Jan 8, 2024

edurdevic Jan 8, 2024

edurdevic commented Jan 8, 2024

Delta housekeeping notebooks #95

Delta housekeeping notebooks #95

Conversation

edurdevic commented Dec 22, 2023

CLAassistant commented Dec 22, 2023 • edited Loading

edurdevic left a comment

Choose a reason for hiding this comment

lorenzorubi-db commented Jan 3, 2024

edurdevic left a comment

Choose a reason for hiding this comment

edurdevic Jan 8, 2024

Choose a reason for hiding this comment

edurdevic Jan 8, 2024

Choose a reason for hiding this comment

edurdevic Jan 8, 2024

Choose a reason for hiding this comment

edurdevic Jan 8, 2024

Choose a reason for hiding this comment

edurdevic Jan 8, 2024

Choose a reason for hiding this comment

edurdevic commented Jan 8, 2024

CLAassistant commented Dec 22, 2023 •

edited

Loading