Internal error when cancelling jobs that are submitted but not yet queued #2

lupreCSC · 2024-05-20T11:06:02Z

According to the HEAppE documentation:

The number of functional accounts depends on the number of isolated jobs you plan to apply. For instance, if you want to run five jobs in parallel, you need to have one ‘’master functional account’’ and five ‘’functional accounts’’.

In practice, if more than one job is submitted while only one HPC "functional account" is available, it appears that the HEAppE API submits them one after the other.
E.g., after submitting two jobs (say with id 10 and 11) using /heappe/JobManagement/SubmitJob, we observe on the cluster that only one is running, while HEAppE API logs state repeatedly

HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic - User <API user> is submitting the job with info Id 11

While in this state, if we try to cancel job 11 using /heappe/JobManagement/CancelJob we get an internal server error reply (500) with message Problem occured! Contact the administrators..

From the API logs it appears that HEAppE tries to cancel the job in the cluster, but because it was not yet queued there, it is not assigned a slurm job id, which leads to an error with the scancel command:

INFO  2024-05-20 12:53:27 HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic - User <API user> is canceling the job with info Id 11
INFO  2024-05-20 12:53:27 HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.SlurmSchedulerAdapter - Cancel jobs "", command "bash -c 'scancel ';", message "Job cancelled manually by the client." 
ERROR 2024-05-20 12:53:28 HEAppE.RestApi.ExceptionMiddleware - SSH command error: 'scancel: error: No job identification provided
' Error code: '1' SSH command: 'bash -c 'scancel ';'.

The text was updated successfully, but these errors were encountered:

jkonvicka · 2024-05-31T07:48:04Z

Dear @lupreCSC,
Thank you for your feedback regarding the job submission and cancellation process in HEAppE in Exclusive Account pool mode.

I am pleased to inform you that a fix for this problem will be included in the next release, which we are currently working on.

Thank you for your patience and understanding.

Best regards,
Jakub Konvicka

jkonvicka · 2024-09-16T10:28:51Z

Hi @lupreCSC,
the fix was included in the new release HEAppE (V4.3.0).

vsvaton closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal error when cancelling jobs that are submitted but not yet queued #2

Internal error when cancelling jobs that are submitted but not yet queued #2

lupreCSC commented May 20, 2024

jkonvicka commented May 31, 2024

jkonvicka commented Sep 16, 2024

Internal error when cancelling jobs that are submitted but not yet queued #2

Internal error when cancelling jobs that are submitted but not yet queued #2

Comments

lupreCSC commented May 20, 2024

jkonvicka commented May 31, 2024

jkonvicka commented Sep 16, 2024