Update DOCKER_IMAGE tag in testslurm.yml #697

adi611 · 2023-09-08T18:08:06Z

Types of changes

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to change)

Summary

Update DOCKER_IMAGE tag in testslurm.yml from latest to 21.08.6

Checklist

I have added tests to cover my changes (if necessary)
I have updated documentation (if necessary)

codecov · 2023-09-08T18:12:27Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.32% ⚠️

Comparison is base (31aea01) 83.31% compared to head (e797007) 82.99%.
Report is 15 commits behind head on master.

❗ Current head e797007 differs from pull request most recent head afeb705. Consider uploading reports for the commit afeb705 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #697      +/-   ##
==========================================
- Coverage   83.31%   82.99%   -0.32%     
==========================================
  Files          22       22              
  Lines        4873     4894      +21     
  Branches     1401        0    -1401     
==========================================
+ Hits         4060     4062       +2     
- Misses        809      832      +23     
+ Partials        4        0       -4

Flag	Coverage Δ
unittests	`82.99% <100.00%> (-0.32%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
pydra/utils/hash.py	`95.03% <100.00%> (+2.17%)`	⬆️

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

satra · 2023-09-08T18:22:25Z

btw, you may want to see if you can build this dockerfile: https://github.com/tazend/docker-centos7-slurm/blob/1cdc401df445ecf00e2db431a99c583eda950300/Dockerfile as it contains the latest slurm

you could drop the python versions lower than 3.8.

adi611 · 2023-09-09T15:50:54Z

btw, you may want to see if you can build this dockerfile: https://github.com/tazend/docker-centos7-slurm/blob/1cdc401df445ecf00e2db431a99c583eda950300/Dockerfile as it contains the latest slurm

you could drop the python versions lower than 3.8.

Sure, I'll start working on it.

adi611 · 2023-09-11T12:06:18Z

I built the docker image, it can be found here .

Removed the python versions 3.6 and 3.7.
Getting the error This cluster linux already exists. Not adding. at the step Display previous jobs with sacct (logs here). So I removed the following:

docker exec slurm bash -c "sacctmgr -i add cluster name=linux \
  && supervisorctl restart slurmdbd \
  && supervisorctl restart slurmctld"

Some tests are failing with the following errors:

rc, stdout, stderr = await read_and_display_async(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: coroutine ignored GeneratorExit

RuntimeError: Could not extract job ID

Overall pytest summary: (complete logs)

=========================== short test summary info ============================
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-list-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-array-splitter0-state_splitter0-state_rpn0-expected0-expected_ind0] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_inputspec_state_1a[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-array-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-mixed-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_6[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_6a[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter2-a-state_splitter2-state_rpn2-state_combiner2-state_combiner_all2-NA.b-state_rpn_final2-expected2-expected_val2] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter3-b-state_splitter3-state_rpn3-state_combiner3-state_combiner_all3-NA.a-state_rpn_final3-expected3-expected_val3] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter4-combiner4-state_splitter4-state_rpn4-state_combiner4-state_combiner_all4-None-state_rpn_final4-expected4-expected_val4] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_outputspec_8c[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_state_outputspec_1[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_6[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_workflow.py::test_wf_3nd_st_2[slurm] - RuntimeError: Could not find results of 'mult' node in a sub-directory name...
FAILED pydra/pydra/engine/tests/test_workflow.py::test_wfasnd_st_2[slurm] - ValueError: Tasks ['wfnd'] raised an error

= 16 failed, 910 passed, 88 skipped, 7 xfailed, 7 warnings, 1 rerun in 935.96s (0:15:35) =

The GA workflow file I used can be found here

satra · 2023-09-11T18:30:03Z

nice work @adi611 - it may be good to track down those specific tests to see what's going on perhaps executing and debugging why the slurm return doesn't have a job id.

at least we have an up to date slurm container now! for this PR perhaps change to your newly built slurm container.

djarecka · 2023-09-11T22:55:51Z

@adi611 - Have you tried to repeat the run? Do you have the errors every time you run? I've run the test with your new image on my laptop and can't reproduce the errors...

But I run twice the GA in this PR and seems to work fine.

fyi. still don't have access to the MIT slurm computers, so can't compare.

.github/workflows/testslurm.yml

adi611 · 2023-09-11T23:43:26Z

Should I create a PR to run the tests using all the available python versions (3.8.16, 3.9.16, 3.10.9, 3.11.1) for the container or using just the default version (3.11.1)?

satra · 2023-09-11T23:46:59Z

let's just get the default working. it may be overkill to try all at the moment. they are already being tested normally outside of slurm.

adi611 · 2023-09-11T23:50:36Z

@adi611 - Have you tried to repeat the run? Do you have the errors every time you run? I've run the test with your new image on my laptop and can't reproduce the errors...

But I run twice the GA in this PR and seems to work fine.

fyi. still don't have access to the MIT slurm computers, so can't compare.

Yes I did re-run the workflow but I still got the same errors

adi611 · 2023-09-11T23:51:39Z

let's just get the default working. it may be overkill to try all at the moment. they are already being tested normally outside of slurm.

Ok sure.

satra · 2023-09-12T00:05:05Z

@adi611 - it looks like this now returns the same error as your list. perhaps you can check if you can reproduce one of those errors by limiting pytest to just check that test. also i think @djarecka tested the original slurm container not the new one.

djarecka · 2023-09-12T02:01:06Z

@satra - I also tested the new one

@adi611 - you can also try to remove -n auto from the pytest command

djarecka · 2023-09-12T02:22:58Z

just want to confirm that with -n auto I also see errors running in the container on my laptop (earlier missed the fact that GA runs with -n).

I don't understand why -n leads to the error in this case, but I would say that if running serially all the tests don't lead to the issue, we could go with it for now

adi611 · 2023-09-12T08:21:33Z

I think there may be some confusion.

I mentioned having issues with -n auto in Draft: Adding new worker which uses PSI/J to run tasks #694
I tried running the current GA workflow without -n auto and I still get the same errors

adi611 · 2023-09-12T09:48:54Z

@adi611 - it looks like this now returns the same error as your list. perhaps you can check if you can reproduce one of those errors by limiting pytest to just check that test. also i think @djarecka tested the original slurm container not the new one.

Yes the errors exist even while limiting pytest to a single test from the list of failed tests, as can be seen here.

djarecka · 2023-09-12T14:13:07Z

I was suggesting removing the `-n` option just based on what i see when running the test with the container on my laptop

…

On Tue, Sep 12, 2023, 04:21 Aditya Agarwal ***@***.***> wrote: I think there may be some confusion. - I mentioned having issues with -n auto in #694 <#694> - I tried running the current GA workflow without -n auto and I still get the same errors — Reply to this email directly, view it on GitHub <#697 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMV6GR34SNJIZT2EBN6RWTX2ALRRANCNFSM6AAAAAA4QXHBXE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

djarecka · 2023-09-12T16:34:42Z

@adi611 - I've just checked the GA reports and I see that there is -n auto in the pytest command

adi611 · 2023-09-12T17:04:41Z

I ran it separately here. But I should update the pytest command in the current PR to check if it runs fine without the -n option.

djarecka · 2023-09-12T19:54:03Z

ok, it looks like for GA, the option doesn't really make any difference...

Do you also see the same error when running on your laptop?

adi611 · 2023-09-13T12:55:09Z

Yes I tried it on my laptop and I get the same error RuntimeError: Could not extract job ID for failed tests.

adi611 · 2023-09-13T16:56:29Z

The issue seems to be more random, with the two stdouts in lines 292 and 341 of workers.py. Sometimes the first stdout doesn't return anything, the exception is then Could not extract job ID otherwise when the second stdout doesn't return anything the exception is Job information not found, or they both might not return anything. When they both return something the test which previously failed actually pass.

adi611 · 2023-09-13T17:24:57Z

was that the stdout for a task that failed? if so, you could perhaps look at where the exception is being raised and check if there is an intervening check like asking sacct or scontrol that the job has been queued.

I am unable to find such a check

adi611 · 2023-09-13T17:38:52Z

Logs:

=========================== short test summary info ============================
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] - RuntimeError: Could not extract job ID
========================= 1 failed, 1 warning in 9.04s =========================
Exception ignored in: <coroutine object SlurmWorker._submit_job at 0x7f9bc66d8ea0>
Traceback (most recent call last):
  File "/pydra/pydra/engine/workers.py", line 314, in _submit_job
  File "/pydra/pydra/engine/workers.py", line 336, in _poll_job
  File "/pydra/pydra/engine/helpers.py", line 331, in read_and_display_async
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/subprocess.py", line 218, in create_subprocess_exec
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 1688, in subprocess_exec
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <coroutine object SlurmWorker._submit_job at 0x7f9bc66d8bc0>
Traceback (most recent call last):
  File "/pydra/pydra/engine/workers.py", line 293, in _submit_job
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f9bdc8958a0>
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_subprocess.py", line 126, in __del__
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_subprocess.py", line 104, in close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/unix_events.py", line 558, in close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/unix_events.py", line 582, in _close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 761, in call_soon
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 519, in _check_closed
RuntimeError: Event loop is closed

(The line numbers for workers.py may not exactly match the remote due to my local changes)

adi611 · 2023-09-13T17:43:03Z

The same test:

============================= test session starts ==============================
platform linux -- Python 3.11.1, pytest-7.4.2, pluggy-1.3.0 -- /root/.pyenv/versions/3.11.1/bin/python3.11
cachedir: .pytest_cache
rootdir: /pydra
plugins: rerunfailures-12.0, cov-4.1.0, forked-1.6.0, timeout-2.1.0, env-1.0.1, xdist-1.34.0
collecting ... collected 1 item

pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] PASSED

=============================== warnings summary ===============================
pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm]
  /pydra/pydra/engine/helpers.py:469: DeprecationWarning: There is no current event loop
    loop = asyncio.get_event_loop()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

---------- coverage: platform linux, python 3.11.1-final-0 -----------
Coverage XML written to file /pydra/cov.xml

======================== 1 passed, 1 warning in 10.45s =========================

satra · 2023-09-13T17:48:44Z

instead of just running it, you can run the test with the pdb option, you can also enable debug logging around the relevant parts of the code. that may give you more insight into the state of the system. i suspect this is a resource contention with the slurm database issue. and couple of retries could help or looking at stderr and stdout.

djarecka · 2023-09-14T02:46:02Z

@adi611 - I've noticed that some tests do not use pytest's fixture: tmp_path as cache_dir (see for cache_dir=tmp_path in most of the tests), this can sometimes leads to issues. Could you fix the tests and see if that helps?

adi611 · 2023-09-14T09:00:44Z

@adi611 - I've noticed that some tests do not use pytest's fixture: tmp_path as cache_dir (see for cache_dir=tmp_path in most of the tests), this can sometimes leads to issues. Could you fix the tests and see if that helps?

I checked and many of the failed tests, like test_node_task.py::test_task_state_2 and test_shelltask.py::test_shell_cmd_7 already have the cache_dir=tmp_path.

adi611 · 2023-09-14T14:25:38Z

Currently I am unable to reproduce the issue for single tests

adi611 · 2023-09-14T15:51:57Z

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

ghisvail · 2023-09-14T16:09:02Z

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

If you're testing on the current Python branch, it is best to try out the latest published version first and then bissect with previous versions if you notice a regression.

Could you reference the exact issues you think may be of interest on cpython?

adi611 · 2023-09-14T16:18:43Z

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

If you're testing on the current Python branch, it is best to try out the latest published version first and then bissect with previous versions if you notice a regression.

Could you reference the exact issues you think may be of interest on cpython?

This is one such issue

djarecka · 2023-09-14T18:50:18Z

thanks for tracking this! yes, please check for 3.11.5!

djarecka · 2023-09-15T13:13:55Z

it looks like it works for 3.11.5! :) just remove 3.11.1, and we will merge it.
great job!

satra · 2023-09-15T13:55:01Z

is there a way to exclude a series of python dependencies in pydra's python config? we should add that to the PR so we know why we did this.

adi611 · 2023-09-15T17:10:00Z

it looks like it works for 3.11.5! :) just remove 3.11.1, and we will merge it. great job!

Thanks!

djarecka · 2023-09-17T13:54:04Z

@adi611 - could you please exclude the python version in pyproject.toml as @satra suggested

refer nipype#697

adi611 · 2023-09-17T19:59:54Z

I added !=3.11.1 to requires-python in pyproject.toml which specifies 3.11.1 should be excluded from the list of acceptable python versions.

adi611 · 2023-09-17T20:14:35Z

The Slurm workflow for 3.11.5 is failing at the Display previous jobs with sacct step:

Run echo "Allowing ports/daemons time to start" && sleep 10
Allowing ports/daemons time to start
 This cluster 'linux' doesn't exist.
        Contact your admin to add it to accounting.
Error: Process completed with exit code 1.

Is it possible this is due to the sleep time of 10 seconds not being enough?

djarecka · 2023-09-18T01:46:46Z

I've just restarted and it seems to work, if we have this issue again, we can increase the time

djarecka · 2023-09-18T02:06:44Z

@adi611 - I've realized that your name is not in the zenodo file, please open a PR if you want your name to be included!

adi611 · 2023-09-18T06:00:05Z

@adi611 - I've realized that your name is not in the zenodo file, please open a PR if you want your name to be included!

Thanks I'll do that, but I need some help since I've never done it before. Is there a preferred order for including my name? Also, what should I specify as my affiliation?

djarecka · 2023-09-18T22:42:37Z

we should think about the order, for now we don't use any rule except that Satra is at the end, so you can put your name before him.
It's up to you what you want to use as your affiliation, you could use your university or you can leave it empty

satra · 2023-11-08T22:05:22Z

@adi611 - just a quick thing. can you post the slurm dockerfile somewhere?

adi611 · 2023-11-14T07:17:01Z

@satra - Sorry for the delay. I have added the dockerfile as a public github gist here.

djarecka · 2023-11-14T16:45:51Z

Thank you, perhaps you can add this to .github directory for the references

adi611 added 2 commits September 8, 2023 23:34

Update testslurm.yml

18cd7ba

Update testslurm.yml

d171f6b

satra reviewed Sep 11, 2023

View reviewed changes

.github/workflows/testslurm.yml Outdated Show resolved Hide resolved

satra added 6 commits September 11, 2023 18:57

test with new image

d910c93

remove adding cluster in slurm

e286273

fix: execute slurm command inside container

5499861

fix: set python 3.9 to be global

73b1cf3

trying running global in all commands.

368600c

drop version specific call

61fb722

remove -n auto from pytest command

5168155

check for all python versions for the container

10548b0

add python version 3.11.5

06276d8

remove python 3.11.1

12c65d6

exclude python 3.11.1 in pyproject.toml

afeb705

refer nipype#697

djarecka merged commit fa4d4f9 into nipype:master Sep 18, 2023
34 of 35 checks passed

Update DOCKER_IMAGE tag in testslurm.yml #697

Update DOCKER_IMAGE tag in testslurm.yml #697

Conversation

adi611 commented Sep 8, 2023

Types of changes

Summary

Checklist

codecov bot commented Sep 8, 2023 • edited Loading

Codecov Report

satra commented Sep 8, 2023

adi611 commented Sep 9, 2023

adi611 commented Sep 11, 2023

satra commented Sep 11, 2023

djarecka commented Sep 11, 2023

adi611 commented Sep 11, 2023

satra commented Sep 11, 2023

adi611 commented Sep 11, 2023

adi611 commented Sep 11, 2023

satra commented Sep 12, 2023

djarecka commented Sep 12, 2023 • edited Loading

djarecka commented Sep 12, 2023

adi611 commented Sep 12, 2023

adi611 commented Sep 12, 2023

djarecka commented Sep 12, 2023 via email

djarecka commented Sep 12, 2023

adi611 commented Sep 12, 2023

djarecka commented Sep 12, 2023 • edited Loading

adi611 commented Sep 13, 2023

adi611 commented Sep 13, 2023

adi611 commented Sep 13, 2023

adi611 commented Sep 13, 2023

adi611 commented Sep 13, 2023

satra commented Sep 13, 2023

djarecka commented Sep 14, 2023

adi611 commented Sep 14, 2023

adi611 commented Sep 14, 2023

adi611 commented Sep 14, 2023

ghisvail commented Sep 14, 2023

adi611 commented Sep 14, 2023

djarecka commented Sep 14, 2023

djarecka commented Sep 15, 2023

satra commented Sep 15, 2023

adi611 commented Sep 15, 2023

djarecka commented Sep 17, 2023

adi611 commented Sep 17, 2023

adi611 commented Sep 17, 2023

djarecka commented Sep 18, 2023

djarecka commented Sep 18, 2023

adi611 commented Sep 18, 2023

djarecka commented Sep 18, 2023

satra commented Nov 8, 2023

adi611 commented Nov 14, 2023

djarecka commented Nov 14, 2023

codecov bot commented Sep 8, 2023 •

edited

Loading

djarecka commented Sep 12, 2023 •

edited

Loading

djarecka commented Sep 12, 2023 •

edited

Loading