Extend docs with multiple dataloader with common cases #1089

ylsung · 2020-03-08T09:58:58Z

I notice that one can evaluate the model on a list of validation/test data loaders. Is it also possible to extract data from multiple train_data_loader in the training step in the current version? This feature might be useful in tasks like transfer learning or semi-supervised learning, which usually maintain multiple datasets in the training stage (e.g., source and target datasets in transfer learning, labeled and unlabeled datasets in semi-supervised learning).

It will be nice if one could obtain list of batch data as follow,

def training_step(self, batch_list, batch_nb_list):
    # batch_list = [batch_1, batch_2]
    x_1, y_1 = batch_list[0]
    x_2, y_2 = batch_list[1]
    loss = self.compute_some_loss(x_1, x_2, y_1, y_2)     
    tensorboard_logs = {'train_loss': loss}
    return {'loss': loss, 'log': tensorboard_logs}

def train_dataloader(self):
    return [data_loader_1, data_loader_2]

The text was updated successfully, but these errors were encountered:

github-actions · 2020-03-08T09:59:39Z

Hi! thanks for your contribution!, great first issue!

Borda · 2020-03-11T22:46:19Z

Good point having support also for multiple training dataloders would be great, mind send a PR?
just be aware that there is another open PR on dataloaders... #1104
cc: @PyTorchLightning/core-contributors

Dref360 · 2020-03-12T17:19:11Z

I'm interested in this task, but I have some questions.

1- Do we assume the data loaders are of the same length? What should we do if one runs out of data.
2- How long would be an epoch? The length of the shortest data loader?

3- Would a more sensible design be:

def training_step(self, batch, batch_idx:int, dataloader_idx: int):
	if dataloader_idx == 0:
		# Supervised loss for example
	elif dataloader == 1:
		# Unsupervised loss
	...

ylsung · 2020-03-13T05:17:29Z

Thanks for all the replies.

To @Dref360,

I think that the lengths of the data loaders can be different is more flexible, and each data loader can have its batch size. It is my opinion that the loader can just reload the dataset after running out of the data, so it doesn't depend on other data loaders.
My previous experience is to use the length of the longest data loader (the smallest epoch of all data loaders). But this needs more discussion.

ylsung · 2020-03-15T08:44:25Z

I found a related discussion in here. The first reply provided a solution for multi-datasets using torch.utils.data.Dataset. However, it assumes that the lengths of the data loaders are the same, and the index's relationships between datasets are fixed.

Therefore, I modified the provided codes to be more flexible such as follows,

class CustomDataset(Dataset):
    def __init__(self, datasets):
        self.datasets = datasets

        self.map_indexes = [[] for _ in self.datasets]

        self.min_length = min(len(d) for d in self.datasets)
        self.max_length = max(len(d) for d in self.datasets)

    def __getitem__(self, i):
        return tuple(d[m[i]] for d, m in zip(self.datasets, self.map_indexes))

    def construct_map_index(self):
        def update_indices(original_indexes, target_len, max_len):
            # map max_len to target_len (large to small)

            # return: a list, which maps the range(max_len) to the valid index in the dataset
            
            original_indexes = original_indexes[max_len:] # remove used indices
            fill_num = max_len - len(original_indexes)
            batch = fill_num // target_len

            if fill_num % target_len != 0:
                # to let the fill_num + len(original_indexes) greater than max_len
                batch += 1

            additional_indexes = list(range(target_len)) * batch
            random.shuffle(additional_indexes)

            original_indexes += additional_indexes

            assert len(original_indexes) >= max_len, "the length of matcing indexes is too small"

            return original_indexes

        self.map_indexes = [update_indices(m, len(d), self.max_length) 
            for m, d in zip(self.map_indexes, self.datasets)]

    def __len__(self):
        # will be called every epoch
        self.construct_map_index()
        return self.max_length

In this case, the indexes of the CustomDataset are set to the largest length of the dataset. Therefore, some indexes might be not valid for some datasets. construct_map_index is used to build lists for mapping excess indexes to available indexes, and it will be updated when calling self.__len__().

Construct one train_loader using CustomDataset.

from torch.utils.data import TensorDataset

dataset_1 = TensorDataset(torch.arange(2))
dataset_2 = TensorDataset(torch.arange(3, 8))

dataset = CustomDataset([dataset_1, dataset_2])

dataloader = DataLoader(dataset, batch_size=3, shuffle=True)

for epoch in range(3):
    for batch in dataloader:
        print(batch)

Outputs

[[tensor([1, 1, 1])], [tensor([4, 7, 6])]]
[[tensor([0, 0])], [tensor([3, 5])]]

[[tensor([0, 0, 1])], [tensor([7, 3, 4])]]
[[tensor([1, 0])], [tensor([5, 6])]]

[[tensor([0, 0, 1])], [tensor([5, 7, 4])]]
[[tensor([1, 1])], [tensor([6, 3])]]

The primary deficiency of the codes is the batch sizes of datasets will be the same and might be a little bit hard to read for users. I hope this is helpful for you to develop the feature!

Borda · 2020-03-15T10:51:23Z

@williamFalcon @tullie pls ^^

williamFalcon · 2020-03-15T11:37:07Z

in this case a custom dataloader that has two datasets in it is probably the best thing.
if we do support multiple dataloaders, the way to keep it consistent with val and test (which already support that), is to call training_step with alternating batches.
in the case of your own dataloader, you can just cycle through the smallest dataset multiple times while cycling the large one.

ssss ssss ssss ssss
LLLLLLLLLLLL LLLLLLLLLLLL

tullie · 2020-03-15T12:47:07Z

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

williamFalcon · 2020-03-15T13:00:07Z

why don’t we make the output of this a common use case page?

Add a new page for multiple dataloaders

training (show the example on building two)
val, test: describe how it happens ij lightning today and add examples with validation_step, test_step

Borda · 2020-03-15T14:26:28Z

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled...

ylsung · 2020-03-16T14:47:06Z

why don’t we make the output of this a common use case page?

Add a new page for multiple dataloaders

training (show the example on building two)

val, test: describe how it happens ij lightning today and add examples with validation_step, test_step

I totally agree with the idea about the new doc for data loaders.

Agreed that in this case the custom dataloader with two datasets seems best. Pytorch's dataloader/dataset classes are flexible enough that the user can control exactly what is coming out of them at each epoch (including which node they go to) and batch size.

Then why we have multiple dataloaders for test and valid? Just feeling like a bit puzzled...

Is it because we would like to extract the data from multiple datasets simultaneously in the training phase, while we usually sequentially loop datasets in the validation/testing phase (like evaluation step)?

williamFalcon · 2020-03-16T15:01:38Z

exactly. i could be wrong, but in training we usually want to use both batches at once. in val/test we use them sequentially

soupault · 2020-03-25T13:08:02Z

if we do support multiple dataloaders, the way to keep it consistent with val and test (which already support that), is to call training_step with alternating batches.

In semi-supervised learning, domain adaptation, consistency training, etc it is typical that one uses the samples from different loaders in the same training step to compute various cross-losses. Thus, alternating behaviour of the training step does not bring much usability improvement.
I understand that it is possible to shift the issue to one step back and implement custom Dataset and/or Sampler for such cases, but from my experience having multiple dataloaders is just more explicit and convenient.

williamFalcon · 2020-03-25T13:17:13Z

maybe the way to go is to support multiple dataloaders and add a way (maybe an arg) to decide whether it should be sequential or simultaneous. if simultaneous, lightning auto loops or truncates to the shorter length?

M1F1 · 2020-04-03T11:35:01Z

Quick fix to get different batch size on labeled and unlabeled dataloaders during training might be:

def prepare_data(self):
 ...
 self.train_unlabeled_dataloader = torch.utils.data.DataLoader(train_unlabeled_dataset, ...)
 self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader) 
 ...

def training_step(self, batch, batch_idx):
 inputs_x, targets = batch
 try:
   unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
 except StopIteration:
   self.train_unlabeled_dataloader_iterator = iter(self.train_unlabeled_dataloader)
   unlabeled_x, _ = next(self.train_unlabeled_dataloader_iterator)
 unlabeled_x =  unlabeled_x.type_as(inputs_x) 
 ...

But as @soupault said, it will be much more convenient to have multiple train dataloaders.

Dref360 · 2020-04-06T17:57:55Z

In our active learning library baal, we are currently trying to come up with a solution to the same problem. In our case, one of the DataLoader will be massively larger than the other. In consequence, we added some optional features:

We put a probability of selecting data loader A vs B.
We put a maximum number of steps otherwise, we stop when the smallest iterator is completed. This assumes that both loaders are using random selection.

Those two features are optional and if they are not provided, we only alternate between the two loaders.

We provide an implementation in this gist: https://gist.github.com/Dref360/2524e524244569ed47428f19c487f264

I would appreciate your feedback! Thank you!

Dref360 · 2020-04-25T14:23:56Z

I see that #1416 has been merged. Should we close this as well?

If we want to make this a new feature, I think we have 3 cases to support

Sequentially
Alternate (same behavior as test_dataloader)
Simultaneous (Draw from all dataloader for each batch)

Could we propose those three cases as Iterator and the user would pick one?

def train_dataloader(self):
	return SimultaneousIterator([dataloader1, dataloader2])

Or we add an argument:

trainer = Trainer(train_multiple_dataloader_type='alternate')

I would be happy to work on this as soon as we reach a decision :)

edenlightning · 2020-07-27T18:05:03Z

This was added here. Closing.

jlehrer1 · 2022-04-06T22:38:41Z

Hi, sorry to re-open this but I'm facing this precise problem currently. I'd like to sample continously from multiple dataloader, not have batches contain {'loader1' : batch_1, 'loader2': batch_2} but rather loader1batch_1,...,loader1batch_n, loader1batch_1,...,loader2batch_m. It doesn't seem like this is the default behavior when passing multiple DataLoaders in LightningDataModule, but is this possible?

jlehrer1 · 2022-04-06T22:39:48Z

In fact, ideally this would work for M DataLoaders, not just two.

ylsung added the question Further information is requested label Mar 8, 2020

Borda added feature Is an improvement or enhancement good first issue Good for newcomers labels Mar 11, 2020

Borda added this to the 0.7.2 milestone Mar 15, 2020

williamFalcon changed the title ~~Multiple train_data_loader~~ Extend docs with multiple dataloader with common cases Mar 15, 2020

Borda mentioned this issue Mar 22, 2020

Add multiple dataloaders documentation to "common use cases" #1210

Closed

Borda modified the milestones: 0.7.2, 0.7.3 Apr 8, 2020

Borda modified the milestones: 0.7.4, 0.7.5 Apr 24, 2020

Borda modified the milestones: 0.7.6, 0.8.0 May 12, 2020

Borda added this to the 0.7.7 milestone May 15, 2020

Borda modified the milestones: 0.7.7, 0.8.0 May 26, 2020

Borda modified the milestones: 0.8.0, 0.9.0 Jun 9, 2020

christofer-f mentioned this issue Jul 1, 2020

Using multiple dataloaders in the training_step? #2457

Closed

edenlightning closed this as completed Jul 27, 2020

srib mentioned this issue Sep 2, 2020

Use two separate dataloaders #3336

Closed

xiadingZ mentioned this issue Oct 26, 2020

How to use multiple loaders? #4358

Closed

jlehrer1 mentioned this issue Apr 6, 2022

Sample sequentially from multiple DataLoaders in LightningDataModule #12650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend docs with multiple dataloader with common cases #1089

Extend docs with multiple dataloader with common cases #1089

ylsung commented Mar 8, 2020

github-actions bot commented Mar 8, 2020

Borda commented Mar 11, 2020

Dref360 commented Mar 12, 2020

ylsung commented Mar 13, 2020 •

edited

Loading

ylsung commented Mar 15, 2020

Borda commented Mar 15, 2020

williamFalcon commented Mar 15, 2020 •

edited

Loading

tullie commented Mar 15, 2020 •

edited

Loading

williamFalcon commented Mar 15, 2020

Borda commented Mar 15, 2020 •

edited

Loading

ylsung commented Mar 16, 2020

williamFalcon commented Mar 16, 2020

soupault commented Mar 25, 2020

williamFalcon commented Mar 25, 2020

M1F1 commented Apr 3, 2020 •

edited

Loading

Dref360 commented Apr 6, 2020

Dref360 commented Apr 25, 2020

edenlightning commented Jul 27, 2020

jlehrer1 commented Apr 6, 2022 •

edited

Loading

jlehrer1 commented Apr 6, 2022

Extend docs with multiple dataloader with common cases #1089

Extend docs with multiple dataloader with common cases #1089

Comments

ylsung commented Mar 8, 2020

github-actions bot commented Mar 8, 2020

Borda commented Mar 11, 2020

Dref360 commented Mar 12, 2020

ylsung commented Mar 13, 2020 • edited Loading

ylsung commented Mar 15, 2020

Borda commented Mar 15, 2020

williamFalcon commented Mar 15, 2020 • edited Loading

tullie commented Mar 15, 2020 • edited Loading

williamFalcon commented Mar 15, 2020

Borda commented Mar 15, 2020 • edited Loading

ylsung commented Mar 16, 2020

williamFalcon commented Mar 16, 2020

soupault commented Mar 25, 2020

williamFalcon commented Mar 25, 2020

M1F1 commented Apr 3, 2020 • edited Loading

Dref360 commented Apr 6, 2020

Dref360 commented Apr 25, 2020

edenlightning commented Jul 27, 2020

jlehrer1 commented Apr 6, 2022 • edited Loading

jlehrer1 commented Apr 6, 2022

ylsung commented Mar 13, 2020 •

edited

Loading

williamFalcon commented Mar 15, 2020 •

edited

Loading

tullie commented Mar 15, 2020 •

edited

Loading

Borda commented Mar 15, 2020 •

edited

Loading

M1F1 commented Apr 3, 2020 •

edited

Loading

jlehrer1 commented Apr 6, 2022 •

edited

Loading