Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable control over MLFlowLogger run_name str to match a pre-existing tag run_name in MLFlow and resume model training #3263

Open
wolliq opened this issue May 7, 2024 · 1 comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.

Comments

@wolliq
Copy link

wolliq commented May 7, 2024

🚀 Feature Request

We would like to resume a model training, passing the run_name from YAML, using the MLFlowLogger.

Motivation

Today MLFlowLogger receives the run_name string from the YAML config but it has no control over it as the str automatically append a random str to it, i.e. my-test => my-test-sgftKr at runtime.

In the MLFlowLogger docs:

        run_name: (str, optional): MLflow run name. If not set it will be the same as the Trainer run name

but it always gets overridden by the random value after YAML parsing.

In the MLFlowLogger we have the filter string that captures the passed run_name randomly generated and it will not possible to match with a pre-existing run:

    def _start_mlflow_run(self, state):
        import mlflow

        env_run_id = os.getenv(
            mlflow.environment_variables.MLFLOW_RUN_ID.name,  # pyright: ignore[reportGeneralTypeIssues]
            None,
        )
        if env_run_id is not None:
            self._run_id = env_run_id
        elif self.resume:
            # Search for an existing run tagged with this Composer run if `self.resume=True`.
            assert self._experiment_id is not None
            run_name = self.tags['run_name']
            existing_runs = mlflow.search_runs(
                experiment_ids=[self._experiment_id],
                filter_string=f'tags.run_name = "{run_name}"',    # <<< HERE
                output_format='list',
            )
...

As explained in the {run_name} we will always find a random str appended to it for each new run.

[Optional] Implementation

Possible solution could be disable the random string generation by defining another environmental variable during YAML parsing, such as:

mlflow_tag_run_name=True

so that when the resume action is called, the run name is given to match the tag run name in MLFlow,

or directly

mlflow_tag_run_name="my-run-asdasd"

so that the str run_name is passed as is to MLFlowLogger to handle the resume.

Additional context

This is for a use case where we run training on the MosaicML platform and we log into MLFlow on Databricks platform.
Checkpointing is working fine, but the loss logging is wrong and separated because the unmatch of the random run_name force MLFlow to create a new run id for the resumed training.

@wolliq wolliq added the enhancement New (engineering) enhancements, such as features or API changes. label May 7, 2024
@wolliq wolliq changed the title Enable control over MLFlowLogger run_name str to use a pre-existing run_name and resume a model training Enable control over MLFlowLogger run_name str to match a pre-existing tag run_name in MLFlow and resume model training May 7, 2024
@mvpatel2000
Copy link
Contributor

@wolliq woud you mind sharing your YAML? are you saying you directly pass run name to mlflow logger but it is always overridden?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New (engineering) enhancements, such as features or API changes.
Projects
None yet
Development

No branches or pull requests

2 participants