Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no qos driving to invalid qos specification #258

Closed
solene-evain opened this issue Jul 22, 2024 · 6 comments
Closed

no qos driving to invalid qos specification #258

solene-evain opened this issue Jul 22, 2024 · 6 comments

Comments

@solene-evain
Copy link

Hi everyone,
I want to do deduplication so, for now, I'm running tests using minhash_deduplication.py. I'm using a server where I need to add account and contraint info so I added it in the script (modifying slurm.py also). My problem now, is that I cannot specify any qos for that server. This is set automatically...
I tried commenting everything related to qos is those two scripts, but I still have this error:

2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh3" 2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh2" 2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh1" 2024-07-22 21:46:08.591 | INFO | datatrove.executor.slurm:launch_job:270 - Launching Slurm job mh1 (1 tasks) with launch script "/lus/work/CT10/lig3801/sevain/try_datatrove//signatures/launch_script.slurm" sbatch: error: INFO : As you didn't ask threads_per_core in your request: 2 was taken as default sbatch: error: INFO : As you didn't ask ntasks or ntasks_per-node in your request, 1 task was taken as default sbatch: error: Batch job submission failed: Invalid qos specification Traceback (most recent call last): File "/lus/work/CT10/lig3801/sevain/try_datatrove/./minhash_deduplication.py", line 116, in <module> stage4.run() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 188, in run self.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 283, in launch_job self.job_id = launch_slurm_job(launch_file_contents, *args) File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 375, in launch_slurm_job return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1] File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/subprocess.py", line 421, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['sbatch', '--export=NONE,RUN_OFFSET=0', '/tmp/tmpnif55bvt']' returned non-zero exit status 1.
How can I have qos problem when I need not to specify one?
It's driving me insane. If anyone could provide any help, I would be grateful!
Thanks

@guipenedo
Copy link
Collaborator

Hi,
can you try adding a print here https://github.com/huggingface/datatrove/blob/main/src/datatrove/executor/slurm.py#L367 so that we can see the contents of the generated sbatch script?

@solene-evain
Copy link
Author

solene-evain commented Jul 23, 2024

Hi @guipenedo,

here is the content of the generated sbatch script:

`#!/bin/bash

#SBATCH --account=XXX(hiddenAccount)XXX

#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
#SBATCH --constraint=XXX(hiddenPartitionName)XXX
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --job-name=mh1
#SBATCH --time=00:20:00
#SBATCH --output=.//signatures/slurm_logs/%A_%a.out
#SBATCH --error=.//signatures/slurm_logs/%A_%a.out
#SBATCH --array=0-0
#SBATCH --mail-type=ALL
#SBATCH --mail-user=XXX(hiddenMail)XXX
echo "Starting data processing job mh1"
conda init bash
conda activate datatrove
source ~/.bashrc
set -xe
export PYTHONUNBUFFERED=TRUE
srun -l launch_pickled_pipeline /lus/work/try_datatrove//signatures/executor.pik
`

@guipenedo
Copy link
Collaborator

guipenedo commented Jul 23, 2024

It seems that indeed there is no --qos being set, but maybe you actually have to set one on your cluster? Also it seems the actual nb of total tasks isn't being set, did you also comment out that part? ok just saw --array

@solene-evain
Copy link
Author

If I consider the server documentation, it's written "do not try to specify any qos, it's done automatically". So, just like me, you can't see any other mention to qos than what I already commented in the two scripts I mentioned? I tried to have a look to imported libraries just in case but I couldn't find anything.

(many thanks for the help)

@guipenedo
Copy link
Collaborator

I think your error message can also mean the specific combination of resources you are requesting is not allowed, I suggest you send the cluster admins your sbatch script and ask them if they can spot any issues

@solene-evain
Copy link
Author

Thank you for the advice. I contacted them yesterday, I'm waiting for an answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants