Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error when cancelling jobs that are submitted but not yet queued #2

Closed
lupreCSC opened this issue May 20, 2024 · 2 comments
Closed

Comments

@lupreCSC
Copy link

According to the HEAppE documentation:

The number of functional accounts depends on the number of isolated jobs you plan to apply. For instance, if you want to run five jobs in parallel, you need to have one ‘’master functional account’’ and five ‘’functional accounts’’.

In practice, if more than one job is submitted while only one HPC "functional account" is available, it appears that the HEAppE API submits them one after the other.
E.g., after submitting two jobs (say with id 10 and 11) using /heappe/JobManagement/SubmitJob, we observe on the cluster that only one is running, while HEAppE API logs state repeatedly

HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic - User <API user> is submitting the job with info Id 11

While in this state, if we try to cancel job 11 using /heappe/JobManagement/CancelJob we get an internal server error reply (500) with message Problem occured! Contact the administrators..

From the API logs it appears that HEAppE tries to cancel the job in the cluster, but because it was not yet queued there, it is not assigned a slurm job id, which leads to an error with the scancel command:

INFO  2024-05-20 12:53:27 HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic - User <API user> is canceling the job with info Id 11
INFO  2024-05-20 12:53:27 HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.SlurmSchedulerAdapter - Cancel jobs "", command "bash -c 'scancel ';", message "Job cancelled manually by the client." 
ERROR 2024-05-20 12:53:28 HEAppE.RestApi.ExceptionMiddleware - SSH command error: 'scancel: error: No job identification provided
' Error code: '1' SSH command: 'bash -c 'scancel ';'.
@jkonvicka
Copy link
Contributor

Dear @lupreCSC,
Thank you for your feedback regarding the job submission and cancellation process in HEAppE in Exclusive Account pool mode.

I am pleased to inform you that a fix for this problem will be included in the next release, which we are currently working on.

Thank you for your patience and understanding.

Best regards,
Jakub Konvicka

@jkonvicka
Copy link
Contributor

Hi @lupreCSC,
the fix was included in the new release HEAppE (V4.3.0).

@vsvaton vsvaton closed this as completed Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants