Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backward problem with using DDP #375

Open
clearlyzero opened this issue Jan 9, 2024 · 1 comment
Open

Backward problem with using DDP #375

clearlyzero opened this issue Jan 9, 2024 · 1 comment

Comments

@clearlyzero
Copy link

Encountered when using DDP. How should I locate the warning at this location?

Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/home/ps/anaconda3/envs/py38/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [32, 64, 1, 1, 1], strides() = [64, 1, 64, 64, 64]
bucket_view.sizes() = [32, 64, 1, 1, 1], strides() = [64, 1, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)

@Siddharth-Latthe-07
Copy link

Here are some steps to locate and address the source of this warning:

  1. Check Tensor Creation:
    Ensure that the tensors are created and manipulated in a way that respects their memory layout. Avoid operations that may inadvertently change the tensor strides.

  2. Verify DDP Initialization:
    Make sure that the DDP module is initialized after all model parameters have been correctly set up and that no operations change the parameter strides after DDP initialization.

  3. Consistent Tensor Manipulation:
    Ensure that all operations on the tensors are consistent and do not change the underlying memory layout or strides. For example, avoid in-place operations that might alter the tensor's strides.

  4. Use Contiguous Tensors:
    If you suspect that the strides might have been changed, you can make tensors contiguous before passing them to the DDP module. You can do this by calling .contiguous() on the tensors before using them in the backward pass.

sample code snippet to make tensors contiguous:

for param in model.parameters():
    param.grad = param.grad.contiguous()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants