Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta housekeeping initial version #101

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
5d7889e
delta housekeeping initial commit
lorenzorubi-db Dec 18, 2023
90bab27
debugging initial version
Dec 18, 2023
94629e0
convert output to pandas
lorenzorubi-db Dec 18, 2023
543f852
debugging -convert output to pandas
lorenzorubi-db Dec 18, 2023
567b303
DeltaHousekeepingActions object and tests
lorenzorubi-db Dec 19, 2023
bded305
added more insights to housekeeping and refactored tests
lorenzorubi-db Dec 21, 2023
cf4ef07
regression and cleanup
Dec 21, 2023
bc303cd
move implementation of map_chunked to a separated branch
lorenzorubi-db Jan 3, 2024
feeafaf
readability, cleanup, follow discoverx patterns
lorenzorubi-db Jan 3, 2024
e8a1b66
debugging on cluster + adding spark session to `DeltaHousekeepingActi…
Jan 3, 2024
e177ef4
simplify scan implementation & remove dependency to BeautifulSoup
lorenzorubi-db Jan 3, 2024
023b02f
faster implementation + unit tests
lorenzorubi-db Jan 5, 2024
c2b028f
cleanup
lorenzorubi-db Jan 5, 2024
0e4c8e5
cleanup and PR comments
lorenzorubi-db Jan 9, 2024
9a9fe6b
proper use of dbwidgets
lorenzorubi-db Jan 12, 2024
6c5ecf2
refactoring apply to return a single dataframe
lorenzorubi-db Jan 28, 2024
5359876
add test datasets for all housekeeping checks + bug fixes
lorenzorubi-db Feb 4, 2024
613a290
Merge branch 'master' into delta-housekeeping-notebooks
lorenzorubi-db Feb 4, 2024
9758a00
fix explain / apply methods
Feb 10, 2024
59760f9
refactoring to control output column names
Feb 10, 2024
a0d434e
refactoring to spark API -intermediate commit
Feb 11, 2024
0abe9a2
tests with DBR -nan's & timestamps
Feb 11, 2024
16e7ec6
failing test + cleanup
Feb 11, 2024
24edacb
cleanup
Feb 11, 2024
1b1de40
cleanup
Feb 11, 2024
aa671a2
remove 'reason' column from the output dfs
lorenzorubi-db Feb 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,11 @@ The properties available in table_info are
* **Maintenance**
* [VACUUM all tables](docs/Vacuum.md) ([example notebook](examples/vacuum_multiple_tables.py))
* Detect tables having too many small files ([example notebook](examples/detect_small_files.py))
* Delta housekeeping analysis ([example notebook](examples/exec_delta_housekeeping.py)) which provide:
* stats (size of tables and number of files, timestamps of latest OPTIMIZE & VACUUM operations, stats of OPTIMIZE)
* recommendations on tables that need to be OPTIMIZED/VACUUM'ed
* are tables OPTIMIZED/VACUUM'ed often enough
* tables that have small files / tables for which ZORDER is not being effective
* Deep clone a catalog ([example notebook](examples/deep_clone_schema.py))
* **Governance**
* PII detection with Presidio ([example notebook](examples/pii_detection_presidio.py))
Expand Down Expand Up @@ -91,7 +96,7 @@ from discoverx import DX
dx = DX(locale="US")
```

You can now run operations across multiple tables.
You can now run operations across multiple tables.

## Available functionality

Expand Down Expand Up @@ -128,4 +133,3 @@ After a `with_sql` or `unpivot_string_columns` command, you can apply the follow
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

Loading