Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS GPUs - Disable GSP from userData script #176

Merged
merged 1 commit into from
Jan 16, 2024
Merged

Conversation

chiragjn
Copy link
Member

@chiragjn chiragjn commented Jan 3, 2024

This is a hacky but working (so far) solution to NVIDIA/open-gpu-kernel-modules#446 and awslabs/amazon-eks-ami#1523

Comment on lines +230 to +233
rmmod nvidia_drm
rmmod nvidia_modeset
rmmod nvidia_uvm
rmmod nvidia
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on a best-effort basis which is why this section is at the end of the script. rmmod for a module fails if some processes are actively using the module or any of its dependents have not been unloaded before.

echo "Writing NVreg_EnableGpuFirmware=0 to /etc/modprobe.d/nvidia-gsp.conf"
echo "options nvidia NVreg_EnableGpuFirmware=0" | tee --append /etc/modprobe.d/nvidia-gsp.conf
echo "Running dracut"
dracut -f
Copy link
Member Author

@chiragjn chiragjn Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have no effect until reboot if the module was not successfully unloaded before we edit the kernel params

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is EKS fine with this?
How much latency are we adding here?

Copy link
Member Author

@chiragjn chiragjn Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is EKS fine with this?

Does not cause any problems as such but the problem is this works now but if some config changes under our feet then it would stop working.

How much latency are we adding here?

Maybe 1-2s, I can measure it out

@chiragjn chiragjn merged commit a053e92 into main Jan 16, 2024
1 check passed
@chiragjn chiragjn deleted the cj_disable_gsp branch January 16, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants