Error when using `Trainer.compile_config={}` in DDP mode #3227

Ghelfi · 2024-04-30T09:37:39Z

Training a toy example on DDP mode with the composer runtime while using both using torch.compile through Trainer.compile_config={} and BlurPool algorithm raises a dynamo error.

** To reproduce
From develop on a 2 GPU environmment.

Code:

from composer import Trainer
from composer.algorithms import ChannelsLast, CutMix, LabelSmoothing, BlurPool
from composer.core import DataSpec
from composer.models import ComposerClassifier
from composer.utils import dist
import torch
import torch.nn as nn
import torchvision 
from torchvision import datasets, transforms

# Define Model
num_classes: int = 10
resnet = torchvision.models.resnet18()
resnet.fc = nn.Linear(512, num_classes)
model = ComposerClassifier(module=resnet, num_classes=num_classes)


# Normalization constants
mean = (0.507, 0.487, 0.441)
std = (0.267, 0.256, 0.276)
batch_size = 1024
cifar10_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean, std)])

# Download Data
data_directory = "./data"
train_dataset = datasets.CIFAR10(data_directory, train=True, download=True, transform=cifar10_transforms)

# Build DataSpec
train_dataloader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, sampler=dist.get_sampler(train_dataset, drop_last=True, shuffle=True)
)
spec = DataSpec(train_dataloader, device_transforms=None, get_num_samples_in_batch=lambda batch: len(batch[0]))

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    max_duration="2ep",
    algorithms=[BlurPool(), LabelSmoothing(smoothing=0.1), CutMix(alpha=1.0), ChannelsLast()],
    compile_config={},
)
trainer.fit()

Steps to reproduce the behavior:

Install from dev
run composer -n 2 example.py (see code above)

Dynamo Error:

[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
[rank0]: AttributeError: 'Conv2d' object has no attribute 'requires_grad'

[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

It works if I remove either DDP, BlurPool, or torch.compile.

The text was updated successfully, but these errors were encountered:

mvpatel2000 · 2024-04-30T16:55:37Z

@Skylion007 do you think this is a torch error or something we can do differently?

mvpatel2000 · 2024-05-01T19:29:34Z

@Ghelfi do you know if this works for you elsewhere, e.g. if you compile outside Composer? Will help us narrow down if its a Composer issue or PyTorch issue, as the trace looks more like a Pytorch issue to me

Ghelfi · 2024-05-03T11:39:26Z

This is not clear to me. The provided example above works if you remove the BlurPool algorithm, which is only on the composer side.

I'll try to redefine some model layer before feeding it to the trainer to mimic the behaviour outside of any composer scope.

Ghelfi · 2024-05-14T06:10:23Z

On torch 2.3, adding torch._dynamo.config.optimize_ddp = False at the start of the file seems to fix it.

I am having issue with DDP and torch.compile on other leads also. I'll keep investigating.

Ghelfi added the bug Something isn't working label Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using `Trainer.compile_config={}` in DDP mode #3227

Error when using `Trainer.compile_config={}` in DDP mode #3227

Ghelfi commented Apr 30, 2024

mvpatel2000 commented Apr 30, 2024

mvpatel2000 commented May 1, 2024

Ghelfi commented May 3, 2024

Ghelfi commented May 14, 2024

Error when using Trainer.compile_config={} in DDP mode #3227

Error when using Trainer.compile_config={} in DDP mode #3227

Comments

Ghelfi commented Apr 30, 2024

mvpatel2000 commented Apr 30, 2024

mvpatel2000 commented May 1, 2024

Ghelfi commented May 3, 2024

Ghelfi commented May 14, 2024

Error when using `Trainer.compile_config={}` in DDP mode #3227

Error when using `Trainer.compile_config={}` in DDP mode #3227