-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Legion: collective instance freeze on slingshot-11 #1729
Comments
|
The subranks branch starts to freeze at 8 nodes. The tdb branch starts to freeze at 4 nodes, 4 ranks/node, 2 gpus/rank. I can try tdb on sapling. It was working on blaze the last time I tried, but blaze is currently down. |
Actually looks like blaze is back up. Will try it there. |
The network drive on blaze is still down so I couldnt run there, but I built and ran tdb on sapling. Ran on all 4 nodes, 4 ranks/node, 1 gpu/rank. It started up and ran fine. This is probably a slingshot issue. |
Sapling has 4 GPUs per node. Could we run 16 ranks, 4 ranks/node, 1 GPU/rank? |
Sorry I meant 4 ranks/node. Not 1 rank/node. |
I'm going to assume this is a Slingshot issue unless we can reproduce it on an Infiniband machine. |
I believe using collective instances results in a startup freeze on slingshot-11. I have one commit of S3D that uses them (https://gitlab.com/legion_s3d/legion_s3d/-/commit/e797d71367683580933166a0080a3dbf3f98b978) and freezes at startup and another commit (https://gitlab.com/legion_s3d/legion_s3d/-/commit/5455c8c03e67c32f2fcbee1120d7a40c37486823) where I specifically backed out those changes and it no longer freezes. We will probably need to investigate this with HPE.
The text was updated successfully, but these errors were encountered: