-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_config_directory returns read-only paths when using NNI from Singularity container #3924
Comments
We use Overriding config path with environment variable is a nice suggestion. We will implement it sooner or later. |
There's no guarantees by conda that the directory is writable! There's many scenarios in which it's not (this is just one of them). For example, if you install conda as root and install packages system-wide, then drop down to an unprivileged user to run them, you get this behavior. This can happen a lot in production environments where packages should be read-only. Conda is used then not to provide virtual environments but as an alternative for pip as a package manager. What is the reason for detecting a conda environment in the first place? Conda does not provide APIs to do that, as package behavior should not change depending on it. I'll send a pull request with desired behavior. |
Because there might be multiple NNI instances in different environments, and they must not share config file. The config file is about installed packages, and each environment must install its own packages separately. |
Describe the issue:
When running NNI from inside a container made with Singularity, NNI tries to write config files in /usr/local. As opposed to Docker, processes inside these containers do not run as root and the root filesystem is read-only. This means /usr/local is read-only and non-writable. If conda is used to install packages to this container when it was built, /usr/local/conda-meta exists and unwritable, regardless of how NNI is installed. Note that Singularity is a popular container tool in the academic/HPC environment.
The error occurs here:
nni/nni/runtime/config.py
Line 16 in 3943239
As in my containers, sys.prefix = sys.base_prefix = /usr/local, and /usr/local/conda-meta exists
nni/runtime/config.py tries to find a directory to write configuration files but does not check if it can actually write in those directories. It is also impossible to override this behavior without editing NNI source code from a user perspective.
Solution would be to check if the directory is actually writable, or allow user to set an environment variable like NNI_CONFIG_DIRECTORY=/some/writable/directory to override this behavior.
Environment:
NNI version: 2.3
Training service (local|remote|pai|aml|etc): local
Client OS: Ubuntu 20.04
Server OS (for remote mode only):
Python version: 3.8
PyTorch/TensorFlow version: N/A
Is conda/virtualenv/venv used?: conda
Is running in Docker?: Singularity is used.
nnictl stdout and stderr:
Singularity> nnictl create --port 8081 --config config.yml INFO: expand codeDir: . to [privatedir]/. Traceback (most recent call last): File "/usr/local/bin/nnictl", line 8, in <module> sys.exit(parse_args()) File "/usr/local/lib/python3.8/site-packages/nni/tools/nnictl/nnictl.py", line 278, in parse_args args.func(args) File "/usr/local/lib/python3.8/site-packages/nni/tools/nnictl/launcher.py", line 515, in create_experiment config_v2 = convert.to_v2(config_yml).json() File "/usr/local/lib/python3.8/site-packages/nni/experiment/config/convert.py", line 20, in to_v2 v2 = ExperimentConfig(platform) File "/usr/local/lib/python3.8/site-packages/nni/experiment/config/common.py", line 85, in __init__ kwargs['trainingservice'] = util.training_service_config_factory( File "/usr/local/lib/python3.8/site-packages/nni/experiment/config/util.py", line 43, in training_service_config_factory custom_ts_config_path = nni.runtime.config.get_config_file('training_services.json') File "/usr/local/lib/python3.8/site-packages/nni/runtime/config.py", line 33, in get_config_file shutil.copyfile(default, config_file) File "/usr/local/lib/python3.8/shutil.py", line 261, in copyfile with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst: OSError: [Errno 30] Read-only file system: '/usr/local/nni/training_services.json'
How to reproduce it?:
See above. Run NNI, installed in a conda installed as root, on a filesystem as an unpriviledged user.
The text was updated successfully, but these errors were encountered: