Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible overwriting scenario with Jinja #8994

Closed
jychoi-hpc opened this issue Feb 29, 2024 · 3 comments · Fixed by #9001
Closed

Possible overwriting scenario with Jinja #8994

jychoi-hpc opened this issue Feb 29, 2024 · 3 comments · Fixed by #9001
Labels

Comments

@jychoi-hpc
Copy link

jychoi-hpc commented Feb 29, 2024

🐛 Describe the bug

I am getting the following error, not always but from time to time:

  File "/usr/local/lib/python3.8/dist-packages/torch_geometric/nn/conv/cg_conv.py", line 57, in __init__
    super().__init__(aggr=aggr, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch_geometric/nn/conv/message_passing.py", line 193, in __init__
    self.__class__._jinja_propagate = module.propagate
AttributeError: module 'torch_geometric.nn.conv.cg_conv_CGConv_propagate' has no attribute 'propagate'

I am using pyg in a parallel setting with mpi. I think there is possibility of overwriting when pyg uses jina template here:

module = module_from_template(

I put some following debug message around line 186:

print ("module:", module)
print ("dir(module):", dir(module))

Here is what I got from one process:

module: <module 'torch_geometric.nn.conv.pna_conv_PNAConv_propagate' from '/root/.cache/pyg/message_passing/torch_geometric.nn.conv.pna_conv_PNAConv_propagate.py'>
dir(module): ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__']

And this is the output from the other process:

module: <module 'torch_geometric.nn.conv.pna_conv_PNAConv_propagate' from '/root/.cache/pyg/message_passing/torch_geometric.nn.conv.pna_conv_PNAConv_propagate.py'>
dir(module): ['Adj', 'Any', 'Callable', 'CollectArgs', 'DataLoader', 'DegreeScalerAggregation', 'Dict', 'Linear', 'List', 'MessagePassing', 'ModuleList', 'NamedTuple', 'OptTensor', 'Optional', 'PNAConv', 'Sequential', 'Size', 'SparseTensor', 'Tensor', 'Union', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'activation_resolver', 'collect', 'degree', 'is_compiling', 'is_sparse', 'is_torch_sparse_tensor', 'propagate', 'ptr2index', 'reset', 'torch', 'torch_geometric', 'typing']

It looks to me this can happen when two processes in the same node generate the same template file. One process read the python script in the middle, while the other process overwrites it.

This is just my thought. Anyhow, I am getting such error when using with MPI. Any help will be appreciated.

Versions

PyTorch version: 2.0.1+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 22 2023, 10:22:35)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-6.6.12-linuxkit-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.4
[pip3] torch==2.0.1+cpu
[pip3] torch-cluster==1.6.3+pt20cpu
[pip3] torch_geometric==2.5.0
[pip3] torch-scatter==2.1.2+pt20cpu
[pip3] torch-sparse==0.6.18+pt20cpu
[pip3] torch-spline-conv==1.2.2+pt20cpu
[pip3] torchaudio==2.0.2+cpu
[pip3] torchvision==0.15.2+cpu
[conda] Could not collect

@rusty1s
Copy link
Member

rusty1s commented Mar 1, 2024

Thanks for the issue. I will take a look :)

@rusty1s
Copy link
Member

rusty1s commented Mar 1, 2024

I have a potential fix in #9001 by making use of temporary files so that we don't write to the same file:

with tempfile.NamedTemporaryFile(
            mode='w',
            prefix=f'{module_name}_',
            suffix='.py',
            delete=False,
    ) as tmp:
        tmp.write(module_repr)

    spec = importlib.util.spec_from_file_location(module_name, tmp.name)

Do you mind trying this and see if it resolves your issues?

@rusty1s
Copy link
Member

rusty1s commented Mar 1, 2024

If this doesn't help, then the issue is that importlib is not thread-safe, and there's not super much to do about it. You would need to work with locks on your end :(

rusty1s added a commit that referenced this issue Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants