Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend docs with multiple dataloader with common cases #1089

Closed
ylsung opened this issue Mar 8, 2020 · 20 comments
Closed

Extend docs with multiple dataloader with common cases #1089

ylsung opened this issue Mar 8, 2020 · 20 comments
Labels
feature Is an improvement or enhancement good first issue Good for newcomers question Further information is requested
Milestone

Comments

@ylsung
Copy link
Contributor

ylsung commented Mar 8, 2020

I notice that one can evaluate the model on a list of validation/test data loaders. Is it also possible to extract data from multiple train_data_loader in the training step in the current version? This feature might be useful in tasks like transfer learning or semi-supervised learning, which usually maintain multiple datasets in the training stage (e.g., source and target datasets in transfer learning, labeled and unlabeled datasets in semi-supervised learning).

It will be nice if one could obtain list of batch data as follow,

def training_step(self, batch_list, batch_nb_list):
    # batch_list = [batch_1, batch_2]
    x_1, y_1 = batch_list[0]
    x_2, y_2 = batch_list[1]
    loss = self.compute_some_loss(x_1, x_2, y_1, y_2)     
    tensorboard_logs = {'train_loss': loss}
    return {'loss': loss, 'log': tensorboard_logs}

def train_dataloader(self):
    return [data_loader_1, data_loader_2]
@ylsung ylsung added the question Further information is requested label Mar 8, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2020

Hi! thanks for your contribution!, great first issue!

@Borda
Copy link
Member

Borda commented Mar 11, 2020

Good point having support also for multiple training dataloders would be great, mind send a PR?
just be aware that there is another open PR on dataloaders... #1104
cc: @PyTorchLightning/core-contributors

@Borda Borda added feature Is an improvement or enhancement good first issue Good for newcomers labels Mar 11, 2020
@Dref360
Copy link
Contributor

Dref360 commented Mar 12, 2020

I'm interested in this task, but I have some questions.

1- Do we assume the data loaders are of the same length? What should we do if one runs out of data.
2- How long would be an epoch? The length of the shortest data loader?

3- Would a more sensible design be:

def training_step(self, batch, batch_idx:int, dataloader_idx: int):
	if dataloader_idx == 0:
		# Supervised loss for example
	elif dataloader == 1:
		# Unsupervised loss
	...

@ylsung
Copy link
Contributor Author

ylsung commented Mar 13, 2020

Thanks for all the replies.

To @Dref360,

  1. I think that the lengths of the data loaders can be different is more flexible, and each data loader can have its batch size. It is my opinion that the loader can just reload the dataset after running out of the data, so it doesn't depend on other data loaders.

  2. My previous experience is to use the length of the longest data loader (the smallest epoch of all data loaders). But this needs more discussion.

@ylsung
Copy link
Contributor Author

ylsung commented Mar 15, 2020

I found a related discussion in here. The first reply provided a solution for multi-datasets using torch.utils.data.Dataset. However, it assumes that the lengths of the data loaders are the same, and the index's relationships between datasets are fixed.

Therefore, I modified the provided codes to be more flexible such as follows,

class CustomDataset(Dataset):
    def __init__(self, datasets):
        self.datasets = datasets

        self.map_indexes = [[] for _ in self.datasets]

        self.min_length = min(len(d) for d in self.datasets)
        self.max_length = max(len(d) for d in self.datasets)

    def __getitem__(self, i):
        return tuple(d[m[i]] for d, m in zip(self.datasets, self.map_indexes))

    def construct_map_index(self):
        def update_indices(original_indexes, target_len, max_len):
            # map max_len to target_len (large to small)

            # return: a list, which maps the range(max_len) to the valid index in the dataset
            
            original_indexes = original_indexes[max_len:] # remove used indices
            fill_num = max_len - len(original_indexes)
            batch = fill_num // target_len

            if fill_num % target_len != 0:
                # to let the fill_num + len(original_indexes) greater than max_len
                batch += 1

            additional_indexes = list(range(target_len)) * batch
            random.shuffle(additional_indexes)

            original_indexes += additional_indexes

            assert len(original_indexes) >= max_len, "the length of matcing indexes is too small"

            return original_indexes

        self.map_indexes = [update_indices(m, len(d), self.max_length) 
            for m, d in zip(self.map_indexes, self.datasets)]

    def __len__(self):
        # will be called every epoch
        self.construct_map_index()
        return self.max_length

In this case, the indexes of the CustomDataset are set to the largest length of the dataset. Therefore, some indexes might be not valid for some datasets. construct_map_index is used to build lists for mapping excess indexes to available indexes, and it will be updated when calling self.__len__().

Construct one train_loader using CustomDataset.

from torch.utils.data import TensorDataset

dataset_1 = TensorDataset(torch.arange(2))
dataset_2 = TensorDataset(torch.arange(3, 8))

dataset = CustomDataset([dataset_1, dataset_2])

dataloader = DataLoader(dataset, batch_size=3, shuffle=True)

for epoch in range(3):
    for batch in dataloader:
        print(batch)

Outputs

[[tensor([1, 1, 1])], [tensor([4, 7, 6])]]
[[tensor([0, 0])], [tensor([3, 5])]]

[[tensor([0, 0, 1])], [tensor([7, 3, 4])]]
[[tensor([1, 0])], [tensor([5, 6])]]

[[tensor([0, 0, 1])], [tensor([5, 7, 4])]]
[[tensor([1, 1])], [tensor([6, 3])]]

The primary deficiency of the codes is the batch sizes of datasets will be the same and might be a little bit hard to read for users. I hope this is helpful for you to develop the feature!

@Borda
Copy link
Member

Borda commented Mar 15, 2020

@williamFalcon @tullie pls ^^

@Borda Borda added this to the 0.7.2 milestone Mar 15, 2020
@williamFalcon
Copy link
Contributor

williamFalcon commented Mar 15, 2020

  1. in this case a custom dataloader that has two datasets in it is probably the best thing.

  2. if we do support multiple dataloaders, the way to keep it consistent with val and test (which already support that), is to call training_step with alternating batches.

  3. in the case of your own dataloader, you can just cycle through the smallest dataset multiple times while cycling the large one.

ssss ssss ssss ssss
LLLLLLLLLLLL LLLLLLLLLLLL

@tullie
Copy link
Contributor

tullie commented Mar 15, 2020

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

@williamFalcon
Copy link
Contributor

why don’t we make the output of this a common use case page?

Add a new page for multiple dataloaders

  • training (show the example on building two)
  • val, test: describe how it happens ij lightning today and add examples with validation_step, test_step

@williamFalcon williamFalcon changed the title Multiple train_data_loader Extend docs with multiple dataloader with common cases Mar 15, 2020
@Borda
Copy link
Member

Borda commented Mar 15, 2020

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled...

@ylsung
Copy link
Contributor Author

ylsung commented Mar 16, 2020

why don’t we make the output of this a common use case page?

Add a new page for multiple dataloaders

  • training (show the example on building two)
  • val, test: describe how it happens ij lightning today and add examples with validation_step, test_step

I totally agree with the idea about the new doc for data loaders.

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled...

Is it because we would like to extract the data from multiple datasets simultaneously in the training phase, while we usually sequentially loop datasets in the validation/testing phase (like evaluation step)?

@williamFalcon
Copy link
Contributor

exactly. i could be wrong, but in training we usually want to use both batches at once. in val/test we use them sequentially

@soupault
Copy link

if we do support multiple dataloaders, the way to keep it consistent with val and test (which already support that), is to call training_step with alternating batches.

In semi-supervised learning, domain adaptation, consistency training, etc it is typical that one uses the samples from different loaders in the same training step to compute various cross-losses. Thus, alternating behaviour of the training step does not bring much usability improvement.
I understand that it is possible to shift the issue to one step back and implement custom Dataset and/or Sampler for such cases, but from my experience having multiple dataloaders is just more explicit and convenient.

@williamFalcon
Copy link
Contributor

maybe the way to go is to support multiple dataloaders and add a way (maybe an arg) to decide whether it should be sequential or simultaneous. if simultaneous, lightning auto loops or truncates to the shorter length?

@M1F1
Copy link

M1F1 commented Apr 3, 2020

Quick fix to get different batch size on labeled and unlabeled dataloaders during training might be:

def prepare_data(self):
 ...
 self.train_unlabeled_dataloader = torch.utils.data.DataLoader(train_unlabeled_dataset, ...)
 self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader) 
 ...

def training_step(self, batch, batch_idx):
 inputs_x, targets = batch
 try:
   unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
 except StopIteration:
   self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader)
   unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
 unlabeled_x =  unlabeled_x.type_as(inputs_x) 
 ...

But as @soupault said, it will be much more convenient to have multiple train dataloaders.

@Dref360
Copy link
Contributor

Dref360 commented Apr 6, 2020

In our active learning library baal, we are currently trying to come up with a solution to the same problem. In our case, one of the DataLoader will be massively larger than the other. In consequence, we added some optional features:

  • We put a probability of selecting data loader A vs B.
  • We put a maximum number of steps otherwise, we stop when the smallest iterator is completed. This assumes that both loaders are using random selection.

Those two features are optional and if they are not provided, we only alternate between the two loaders.

We provide an implementation in this gist: https://gist.github.com/Dref360/2524e524244569ed47428f19c487f264

I would appreciate your feedback! Thank you!

@Borda Borda modified the milestones: 0.7.2, 0.7.3 Apr 8, 2020
@Borda Borda modified the milestones: 0.7.4, 0.7.5 Apr 24, 2020
@Dref360
Copy link
Contributor

Dref360 commented Apr 25, 2020

I see that #1416 has been merged. Should we close this as well?

If we want to make this a new feature, I think we have 3 cases to support

  1. Sequentially
  2. Alternate (same behavior as test_dataloader)
  3. Simultaneous (Draw from all dataloader for each batch)

Could we propose those three cases as Iterator and the user would pick one?

def train_dataloader(self):
	return SimultaneousIterator([dataloader1, dataloader2])

Or we add an argument:

trainer = Trainer(train_multiple_dataloader_type='alternate')

I would be happy to work on this as soon as we reach a decision :)

@Borda Borda modified the milestones: 0.7.6, 0.8.0 May 12, 2020
@Borda Borda added this to the 0.7.7 milestone May 15, 2020
@Borda Borda modified the milestones: 0.7.7, 0.8.0 May 26, 2020
@Borda Borda modified the milestones: 0.8.0, 0.9.0 Jun 9, 2020
@edenlightning
Copy link
Contributor

This was added here. Closing.

@jlehrer1
Copy link

jlehrer1 commented Apr 6, 2022

Hi, sorry to re-open this but I'm facing this precise problem currently. I'd like to sample continously from multiple dataloader, not have batches contain {'loader1' : batch_1, 'loader2': batch_2} but rather loader1batch_1,...,loader1batch_n, loader1batch_1,...,loader2batch_m. It doesn't seem like this is the default behavior when passing multiple DataLoaders in LightningDataModule, but is this possible?

@jlehrer1
Copy link

jlehrer1 commented Apr 6, 2022

In fact, ideally this would work for M DataLoaders, not just two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants