Possible overwriting scenario with Jinja #8994

jychoi-hpc · 2024-02-29T15:31:31Z

🐛 Describe the bug

I am getting the following error, not always but from time to time:

  File "/usr/local/lib/python3.8/dist-packages/torch_geometric/nn/conv/cg_conv.py", line 57, in __init__
    super().__init__(aggr=aggr, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch_geometric/nn/conv/message_passing.py", line 193, in __init__
    self.__class__._jinja_propagate = module.propagate
AttributeError: module 'torch_geometric.nn.conv.cg_conv_CGConv_propagate' has no attribute 'propagate'

I am using pyg in a parallel setting with mpi. I think there is possibility of overwriting when pyg uses jina template here:

pytorch_geometric/torch_geometric/nn/conv/message_passing.py

Line 170 in 9b660ac

module = module_from_template(

I put some following debug message around line 186:

print ("module:", module)
print ("dir(module):", dir(module))

Here is what I got from one process:

module: <module 'torch_geometric.nn.conv.pna_conv_PNAConv_propagate' from '/root/.cache/pyg/message_passing/torch_geometric.nn.conv.pna_conv_PNAConv_propagate.py'>
dir(module): ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__']

And this is the output from the other process:

module: <module 'torch_geometric.nn.conv.pna_conv_PNAConv_propagate' from '/root/.cache/pyg/message_passing/torch_geometric.nn.conv.pna_conv_PNAConv_propagate.py'>
dir(module): ['Adj', 'Any', 'Callable', 'CollectArgs', 'DataLoader', 'DegreeScalerAggregation', 'Dict', 'Linear', 'List', 'MessagePassing', 'ModuleList', 'NamedTuple', 'OptTensor', 'Optional', 'PNAConv', 'Sequential', 'Size', 'SparseTensor', 'Tensor', 'Union', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'activation_resolver', 'collect', 'degree', 'is_compiling', 'is_sparse', 'is_torch_sparse_tensor', 'propagate', 'ptr2index', 'reset', 'torch', 'torch_geometric', 'typing']

It looks to me this can happen when two processes in the same node generate the same template file. One process read the python script in the middle, while the other process overwrites it.

This is just my thought. Anyhow, I am getting such error when using with MPI. Any help will be appreciated.

Versions

PyTorch version: 2.0.1+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 22 2023, 10:22:35)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-6.6.12-linuxkit-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.4
[pip3] torch==2.0.1+cpu
[pip3] torch-cluster==1.6.3+pt20cpu
[pip3] torch_geometric==2.5.0
[pip3] torch-scatter==2.1.2+pt20cpu
[pip3] torch-sparse==0.6.18+pt20cpu
[pip3] torch-spline-conv==1.2.2+pt20cpu
[pip3] torchaudio==2.0.2+cpu
[pip3] torchvision==0.15.2+cpu
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

rusty1s · 2024-03-01T14:33:50Z

Thanks for the issue. I will take a look :)

rusty1s · 2024-03-01T16:59:59Z

I have a potential fix in #9001 by making use of temporary files so that we don't write to the same file:

with tempfile.NamedTemporaryFile(
            mode='w',
            prefix=f'{module_name}_',
            suffix='.py',
            delete=False,
    ) as tmp:
        tmp.write(module_repr)

    spec = importlib.util.spec_from_file_location(module_name, tmp.name)

Do you mind trying this and see if it resolves your issues?

rusty1s · 2024-03-01T17:00:45Z

If this doesn't help, then the issue is that importlib is not thread-safe, and there's not super much to do about it. You would need to work with locks on your end :(

Fixes #8994

jychoi-hpc added the bug label Feb 29, 2024

jychoi-hpc mentioned this issue Feb 29, 2024

Update software ORNL/HydraGNN#210

Merged

rusty1s mentioned this issue Mar 1, 2024

Make MessagePassing interface thread-safe #9001

Merged

rusty1s closed this as completed in #9001 Mar 1, 2024

rusty1s added a commit that referenced this issue Mar 1, 2024

Make MessagePassing interface thread-safe (#9001)

849ca0a

Fixes #8994

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible overwriting scenario with Jinja #8994

Possible overwriting scenario with Jinja #8994

jychoi-hpc commented Feb 29, 2024 •

edited

Loading

rusty1s commented Mar 1, 2024

rusty1s commented Mar 1, 2024 •

edited

Loading

rusty1s commented Mar 1, 2024

Possible overwriting scenario with Jinja #8994

Possible overwriting scenario with Jinja #8994

Comments

jychoi-hpc commented Feb 29, 2024 • edited Loading

🐛 Describe the bug

Versions

rusty1s commented Mar 1, 2024

rusty1s commented Mar 1, 2024 • edited Loading

rusty1s commented Mar 1, 2024

jychoi-hpc commented Feb 29, 2024 •

edited

Loading

rusty1s commented Mar 1, 2024 •

edited

Loading