Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm integration #1974

Open
rhc54 opened this issue May 2, 2024 · 2 comments
Open

Slurm integration #1974

rhc54 opened this issue May 2, 2024 · 2 comments

Comments

@rhc54
Copy link
Contributor

rhc54 commented May 2, 2024

We need to define a better abstraction layer between Slurm and PRRTE for discovering allocated resources and launching PRRTE's daemons. Slurm needs freedom to innovate, changing srun cmd line options and Slurm envars as needed - without being constrained by backward compatibility inside of PRRTE. On the other hand, PRRTE needs something stable by which it can discover allocations and reliably launch daemons, without encountering future changes in Slurm.

After discussion, we have agreed that creation of a new library whose APIs can provide the necessary abstraction should resolve the dilemma. This issue is being created so that folks can track progress on the solution.

@rhc54
Copy link
Contributor Author

rhc54 commented May 2, 2024

See https://bugs.schedmd.com/show_bug.cgi?id=19774 for progress on the Slurm side.

@rhc54
Copy link
Contributor Author

rhc54 commented Aug 5, 2024

As an interim step towards helping people understand observed problems, we have added a warning to PRRTE when operating under Slurm for the case where the srun cmd line args param is set in the environment - looks like this:

The Slurm process starter for PRTE detected the presence of an MCA
parameter in the environment that assigns custom command line arguments
to the `srun` command used to start PRTE's daemons on remote nodes:

  Paramater value: %s

This warning is provided to alert you (the user) to a perhaps
unintentional setting of command line arguments, or the unseen
overriding of your intended arguments by Slurm.

Background: Starting with Slurm version 23.11, a command line argument
(`--external-launcher`) was added to `srun` to indicate that the
command was being initiated from within a third-party launcher (e.g.,
`prte` or `prterun`). This allows Slurm to essentially freely modify
the `srun` command line while retaining a backward compatibility
capability when explicitly told to use it.  Notably, the Slurm
environment does this by automatically setting the
PRTE_MCA_plm_slurm_args environment variable to pass in its own
command line arguments.  This has the side effect of overriding most
user- or system-level settings.  Note that arguments passed on the
PRTE command line will override any Slurm setting of the
PRTE_MCA_plm_slurm_args environment variable, but with potentially
undesirable side effects if newer versions of `srun` misinterpret or
fail to understand the user-specified arguments.

If the setting of the MCA parameter was intentional, or if the
parameter looks acceptable to you, then please set the following
MCA parameter to disable this warning:

  Environment: PRTE_MCA_plm_slurm_disable_warning=true
  Cmd line: --prtemca plm_slurm_disable_warning 1
  Default MCA param file: plm_slurm_disable_warning = true

If you did not intentionally set the identified command line
arguments and do not wish them to be used, then set the
following MCA param to have them ignored:

  Environment: PRTE_MCA_plm_slurm_ignore_args=true
  Cmd line: --prtemca plm_slurm_ignore_args 1
  Default MCA param file: plm_slurm_ignore_args = true

Note that if you wish to provide custom `srun` command line
arguments and are finding them being overridden by Slurm, you
can ensure that your values are used by setting them with the
following param:

  Environment: PRTE_MCA_plm_slurm_force_args=foo
  Cmd line: --prtemca plm_slurm_force_args foo
  Default MCA param file: plm_slurm_force_args = foo

Note that you may need to add the `--external-launcher` option
to your provided args to ensure that `srun` properly functions
if you are using a relatively recent release of Slurm.

This will be removed once we have the library integration - but at least until then, may hopefully relieve some confusion and provide a path forward for affected organizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant