Allow additional keyword args to be passed to optuna hyperparameter search #31923

JanetVictorious · 2024-07-12T06:48:08Z

Feature request

The issue with CUDA out of memory due to hyperparameter optimization has been addressed before (old issue) but no implementation has been made to remedy this.

The fix would be quite simple to allow for the additional argument gc_after_trial to be passed to hyperparameter_search():

def run_hp_search_optuna(trainer, n_trials: int, direction: str, **kwargs) -> BestRun:
    import optuna

    if trainer.args.process_index == 0:

        def _objective(trial, checkpoint_dir=None):
            checkpoint = None
            if checkpoint_dir:
                for subdir in os.listdir(checkpoint_dir):
                    if subdir.startswith(PREFIX_CHECKPOINT_DIR):
                        checkpoint = os.path.join(checkpoint_dir, subdir)
            trainer.objective = None
            if trainer.args.world_size > 1:
                if trainer.args.parallel_mode != ParallelMode.DISTRIBUTED:
                    raise RuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
                trainer._hp_search_setup(trial)
                torch.distributed.broadcast_object_list(pickle.dumps(trainer.args), src=0)
                trainer.train(resume_from_checkpoint=checkpoint)
            else:
                trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
            # If there hasn't been any evaluation during the training loop.
            if getattr(trainer, "objective", None) is None:
                metrics = trainer.evaluate()
                trainer.objective = trainer.compute_objective(metrics)
            return trainer.objective

        timeout = kwargs.pop("timeout", None)
        n_jobs = kwargs.pop("n_jobs", 1)
        gc_after_trial = kwargs.pop("gc_after_trial", False)  # <--- Added arg
        directions = direction if isinstance(direction, list) else None
        direction = None if directions is not None else direction
        study = optuna.create_study(direction=direction, directions=directions, **kwargs)
        study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs, gc_after_trial=gc_after_trial)  # <--- Added arg
        if not study._is_multi_objective():
            best_trial = study.best_trial
            return BestRun(str(best_trial.number), best_trial.value, best_trial.params)
        else:
            best_trials = study.best_trials
            return [BestRun(str(best.number), best.values, best.params) for best in best_trials]
    else:
        for i in range(n_trials):
            trainer.objective = None
            args_main_rank = list(pickle.dumps(trainer.args))
            if trainer.args.parallel_mode != ParallelMode.DISTRIBUTED:
                raise RuntimeError("only support DDP optuna HPO for ParallelMode.DISTRIBUTED currently.")
            torch.distributed.broadcast_object_list(args_main_rank, src=0)
            args = pickle.loads(bytes(args_main_rank))
            for key, value in asdict(args).items():
                if key != "local_rank":
                    setattr(trainer.args, key, value)
            trainer.train(resume_from_checkpoint=None)
            # If there hasn't been any evaluation during the training loop.
            if getattr(trainer, "objective", None) is None:
                metrics = trainer.evaluate()
                trainer.objective = trainer.compute_objective(metrics)
        return None

Then in the training script, calling trainer.hyperparameter_search() would allow for the additional arg gc_after_trial:

best_trial = trainer.hyperparameter_search(
    direction='minimize',
    backend='optuna',
    hp_space=_hp_space,
    n_trials=model_args.hpo_trials,
    gc_after_trial=False,
)

Motivation

I'm experiencing CUDA out of memory issues when running trainer.hyperparameter_search() with the optuna backend. After investigating a bit, the recommendation from optuna is to pass gc_after_trial=True as parameter to study.optimize() (optuna reference).

My proposal is to allow gc_after_trial to be passed as a kwarg and picked up in the study.optimize() call in run_hp_search_optuna() method (source code).

I would like to be able to pass this argument as part of trainer.hyperparameter_search() with optuna backend.

Your contribution

I can submit a PR for this change if there is a value in this feature.

The text was updated successfully, but these errors were encountered:

JanetVictorious added the Feature request Request for a new feature label Jul 12, 2024

DeF0017 mentioned this issue Jul 12, 2024

Added additional kwarg for successful running of optuna hyperparameter search #31924

Merged

4 tasks

amyeroberts closed this as completed in #31924 Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow additional keyword args to be passed to optuna hyperparameter search #31923

Allow additional keyword args to be passed to optuna hyperparameter search #31923

JanetVictorious commented Jul 12, 2024

Allow additional keyword args to be passed to optuna hyperparameter search #31923

Allow additional keyword args to be passed to optuna hyperparameter search #31923

Comments

JanetVictorious commented Jul 12, 2024

Feature request

Motivation

Your contribution