-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging performance improvements and GCP logging fix #2755
Logging performance improvements and GCP logging fix #2755
Conversation
@htahir1 , FYI - this is the GCP fix we need |
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe recent updates enhance ZenML's logging capabilities by shifting from file-based to folder-based log handling. Key changes include the introduction of functions and properties to manage log folders, refactoring of existing methods, and new tests to verify logging behavior. This ensures more efficient log retrieval and storage within ZenML pipelines. Changes
Sequence Diagram(s) (Beta)sequenceDiagram
participant User
participant StepLogsStorage
participant ArtifactStore
participant StepLauncher
User->>StepLauncher: launch()
StepLauncher->>StepLogsStorage: prepare_logs_folder_uri()
StepLogsStorage->>ArtifactStore: create_log_folder()
StepLauncher->>StepLogsStorage: write()
StepLogsStorage->>ArtifactStore: save_logs()
User->>StepLogsStorage: fetch_logs()
StepLogsStorage->>ArtifactStore: retrieve_logs()
ArtifactStore->>User: return logs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
@avishniakov Thanks!! This is an interesting fix. Can you maybe try to run a pipeline with some sort of tqdm progress bar (any keras example with .fit() will do)... this will also tell us if this branch fixes that problem? Also, for a standard pipeline, can you make a table of how performance degrades over log size and perhaps we can leave it here for future reference? |
The Keras logs looks like this. Some TQDM events are lost, but I would not care much.
|
@avishniakov Sounds good. Can we do it for more epochs even just to test it? |
Linked to #2211 |
@htahir1 @strickvl added some numbers into the PR description, as requested. I can confirm that TQDM is resolved and overall performance for logging intense tasks is way better. For not logging heavy jobs without TQDM the improvement would be quite minimal though, but the GCP logging is fixed from now on. |
@avishniakov thats fantastic !! Love it |
@coderabbitai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Files selected for processing (5)
- src/zenml/logging/step_logging.py (8 hunks)
- src/zenml/orchestrators/step_launcher.py (1 hunks)
- src/zenml/zen_server/routers/steps_endpoints.py (3 hunks)
- tests/integration/functional/steps/test_logging.py (1 hunks)
- tests/integration/functional/zen_stores/test_zen_store.py (4 hunks)
Additional context used
Path-based instructions (5)
tests/integration/functional/steps/test_logging.py (1)
Pattern
tests/**/*.py
: "Assess the unit test code employing the PyTest testing framework. Confirm that:
- The tests adhere to PyTest's established best practices.
- Test descriptions are sufficiently detailed to clarify the purpose of each test."
src/zenml/zen_server/routers/steps_endpoints.py (1)
Pattern
src/zenml/**/*.py
: Review the Python code for conformity with Python best practices.src/zenml/logging/step_logging.py (1)
Pattern
src/zenml/**/*.py
: Review the Python code for conformity with Python best practices.src/zenml/orchestrators/step_launcher.py (1)
Pattern
src/zenml/**/*.py
: Review the Python code for conformity with Python best practices.tests/integration/functional/zen_stores/test_zen_store.py (1)
Pattern
tests/**/*.py
: "Assess the unit test code employing the PyTest testing framework. Confirm that:
- The tests adhere to PyTest's established best practices.
- Test descriptions are sufficiently detailed to clarify the purpose of each test."
Additional comments not posted (10)
tests/integration/functional/steps/test_logging.py (4)
15-20
: LGTM! The test functionsteps_writing_above_the_count_limit
correctly simulates logging above the count limit.
22-27
: LGTM! The test functionstep_writing_above_the_time_limit
correctly simulates logging above the time limit.
29-60
: LGTM! The functiontest_that_save_to_file_called_multiple_times_on_exceeding_limits
effectively tests thesave_to_file
method's call frequency under specified conditions.
62-94
: LGTM! The functiontest_that_small_files_are_merged_together
correctly tests the merging of small log files into a single file.src/zenml/zen_server/routers/steps_endpoints.py (1)
30-30
: LGTM! The updatedget_step_logs
function correctly utilizes thefetch_logs
method to retrieve logs, aligning with the PR's objectives to improve log handling.Also applies to: 262-262
src/zenml/logging/step_logging.py (4)
Line range hint
62-99
: LGTM! The functionprepare_logs_folder_uri
effectively handles the creation and management of log directories, aligning with the PR's objectives to improve log handling.
102-147
: LGTM! The functionfetch_logs
correctly retrieves logs from the artifact store and handles errors appropriately, improving the robustness of log retrieval.
Line range hint
155-298
: LGTM! TheStepLogsStorage
class has been effectively refactored to enhance log buffering, file saving, and merging functionalities, aligning with the PR's objectives.
Line range hint
298-351
: LGTM! TheStepLogsStorageContext
class correctly manages the logging context, ensuring logs are captured, saved, and merged appropriately.src/zenml/orchestrators/step_launcher.py (1)
163-163
: LGTM! The integration ofprepare_logs_folder_uri
in thelaunch
method correctly handles log directories, enhancing the logging capabilities during step execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing changes! The benchmarks look great!
@schustmi @stefannica I quite heavily revisited fetching logic, used lighter ops, like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested two minor improvements. Good to go either way.
A few tests failed during the Install steps - will run all on release already. |
Describe changes
I fixed how we store log files to overcome the limitations of some filesystems, where files are immutable, so appending to them is not working properly. Specifically, it can be a GCS filesystem.
With the fix, we work with logs as follows:
Performance improvements
There was one issue identified with the previous logging flow: the flush wrapper was calling save to file every time, basically dumping every line. Flush happens on any call of logger, for instance (didn't check for prints).
I changed that to respect the logic of the buffer we have and on the GCP stack, it shows a runtime of 1.959s for a pipeline running just 100 log calls versus 33.216s using the code from the develop branch.
If I run
time python3 run.py --feature-pipeline --training-pipeline --inference-pipeline --no-cache
on local with GCP artifact store it will be 1:02.79 total fordevelop
versus 52.673 total for this branch, so you can see that it is not that dramatic, but it gets worse the more logging you do, as the synthetic example showed.Some more numbers
For the pipeline with just one step running with GCP artifact store as follows:
For the Keras MNIST example (also connected to GCP):
For non-log-heavy payload (also with GCP):
Pre-requisites
Please ensure you have done the following:
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes
Summary by CodeRabbit
New Features
Refactor
prepare_logs_uri
toprepare_logs_folder_uri
.Tests