Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using Trainer.compile_config={} in DDP mode #3227

Open
Ghelfi opened this issue Apr 30, 2024 · 4 comments
Open

Error when using Trainer.compile_config={} in DDP mode #3227

Ghelfi opened this issue Apr 30, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Ghelfi
Copy link
Contributor

Ghelfi commented Apr 30, 2024

Training a toy example on DDP mode with the composer runtime while using both using torch.compile through Trainer.compile_config={} and BlurPool algorithm raises a dynamo error.

** To reproduce
From develop on a 2 GPU environmment.

Code:

from composer import Trainer
from composer.algorithms import ChannelsLast, CutMix, LabelSmoothing, BlurPool
from composer.core import DataSpec
from composer.models import ComposerClassifier
from composer.utils import dist
import torch
import torch.nn as nn
import torchvision 
from torchvision import datasets, transforms

# Define Model
num_classes: int = 10
resnet = torchvision.models.resnet18()
resnet.fc = nn.Linear(512, num_classes)
model = ComposerClassifier(module=resnet, num_classes=num_classes)


# Normalization constants
mean = (0.507, 0.487, 0.441)
std = (0.267, 0.256, 0.276)
batch_size = 1024
cifar10_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean, std)])

# Download Data
data_directory = "./data"
train_dataset = datasets.CIFAR10(data_directory, train=True, download=True, transform=cifar10_transforms)

# Build DataSpec
train_dataloader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, sampler=dist.get_sampler(train_dataset, drop_last=True, shuffle=True)
)
spec = DataSpec(train_dataloader, device_transforms=None, get_num_samples_in_batch=lambda batch: len(batch[0]))

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    max_duration="2ep",
    algorithms=[BlurPool(), LabelSmoothing(smoothing=0.1), CutMix(alpha=1.0), ChannelsLast()],
    compile_config={},
)
trainer.fit()

Steps to reproduce the behavior:

  1. Install from dev
  2. run composer -n 2 example.py (see code above)

Dynamo Error:

[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
[rank0]: AttributeError: 'Conv2d' object has no attribute 'requires_grad'

[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

It works if I remove either DDP, BlurPool, or torch.compile.

@Ghelfi Ghelfi added the bug Something isn't working label Apr 30, 2024
@mvpatel2000
Copy link
Contributor

@Skylion007 do you think this is a torch error or something we can do differently?

@mvpatel2000
Copy link
Contributor

@Ghelfi do you know if this works for you elsewhere, e.g. if you compile outside Composer? Will help us narrow down if its a Composer issue or PyTorch issue, as the trace looks more like a Pytorch issue to me

@Ghelfi
Copy link
Contributor Author

Ghelfi commented May 3, 2024

This is not clear to me. The provided example above works if you remove the BlurPool algorithm, which is only on the composer side.

I'll try to redefine some model layer before feeding it to the trainer to mimic the behaviour outside of any composer scope.

@Ghelfi
Copy link
Contributor Author

Ghelfi commented May 14, 2024

On torch 2.3, adding torch._dynamo.config.optimize_ddp = False at the start of the file seems to fix it.

I am having issue with DDP and torch.compile on other leads also. I'll keep investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants