Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up data loading and randomization; add scheduler #29

Conversation

pokey
Copy link
Contributor

@pokey pokey commented Dec 23, 2022

Comment on lines +104 to +105
dataset_rng = torch.Generator().manual_seed(self.data_seed)
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [self.dataset_size - split, split], generator=dataset_rng)
Copy link
Contributor Author

@pokey pokey Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the dataset randomizer now has its own seed, and we do the split once, rather than doing it once per ensemble member

self.optimizers.append(optimizer)
self.schedulers.append( torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min'))

self.train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a shuffling data loader, so we don't need to do shuffling during training; that happens for us automatically

self.validation_loaders.append(torch.utils.data.DataLoader(dataset, batch_size=self.batch_size, sampler=valid_sampler))
optimizer = optim.SGD(self.nets[i].parameters(), lr=0.003, momentum=0.9, nesterov=True)
self.optimizers.append(optimizer)
self.schedulers.append( torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min'))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a scheduler

@chaosparrot
Copy link
Owner

I initially placed a different validation set / train set on each net to make sure each net would see different data ( thus performing better as an ensemble than if they all saw the same data ). Otherwise the only variation they would have after many epochs would be the starting position right?

My question would be: since 3 nets using different train / validation sets can use more data for their training, wouldn't they perform better on novel data? Given a testset which none of the models have seen, wouldn't 3 ensembled models with different training data perform better on than an ensemble created where each model saw exactly the same data?

( From a combined ensemble validation score I can definitely see the benefits of a single split train / validation set, because with an ensemble where the validation set has been seen by some of the models there's obvious data pollution and the results will be skewed )

@pokey
Copy link
Contributor Author

pokey commented Dec 26, 2022

Huh interesting idea. Makes me a bit nervous, eg we'd want to make sure we know which data seed each ensemble member was using in case we want to resume from checkpoint. But I guess could work?

@ym-han any thoughts? Is this something you've seen before? Reminds me of k-fold cross validation tbh, tho not exactly the same

@ym-han
Copy link
Contributor

ym-han commented Dec 26, 2022

I haven't looked at the code so I can't be sure, but it sounds like this could be bagging (or something similar). There's some discussion and a link to some references on the sklearn bagging classifier page (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html), and which I'm going to quote:

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [1]. If samples are drawn with replacement, then the method is known as Bagging [2]. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces [3]. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches [4].

Breiman 1996 "Bagging Predictors", the chapter on bagging in Richard Berk's Statistical Learning from a Regression Perspective, and the section on bagging in Elements of Statistical Learning seem useful if you want to read up more.

@pokey
Copy link
Contributor Author

pokey commented Dec 26, 2022

Yeah that sounds right. But I presume all of those methods assume you're sampling the training data with a fixed held out validation set, whereas here the validation set of one ensemble member is used as training data for another

@chaosparrot
Copy link
Owner

chaosparrot commented Dec 28, 2022

Did some reading on KFold and cross validation and Pokey is right in the sense that there is a held out set kept separate from the validation sets for each model. I.e. given we have a data set of A, B, C ... Z, and two models, these are the current splits ( note the overlap ):

Model Training Validation
Model A A, B, C, D, E, F, G, H, I, O, P, Q, R, J K Y L M N Z
Model B N, O, P, Q, R, S, T, U, V, F, G, H, I, W X Y Z D E L
Ensemble - J K L M N O W X Y Z D E

Whereas the, in my opinion, best split would be:

Model Training Validation
Model A A, B, C, D, E, F, G, H, I, O, P, Q, R, S, T J K L M N O
Model B N, O, P, Q, R, S, T, U, J, K, L, M, G, H, I A B C D E F
Ensemble - V W X Y Z

Where we keep 10 percent of the total data set held out to test the ensemble on, use the remaining 90 percent for training and validation, and have the models have validation sets that do not overlap with one another.

The issue then is still that we need to keep the random seed persisted to keep the split available for checkpointing, but I think this would generate the best model results that do not have any data pollution going on in the validation sets.

@ym-han
Copy link
Contributor

ym-han commented Dec 28, 2022

This table is helpful. Yes, I agree that it's important that whatever data is in the test set for the ensemble is not data that has been used to either train or tune the hyperparams of any of the models of the ensemble.

@pokey
Copy link
Contributor Author

pokey commented Jan 13, 2023

hmm I'd be tempted to either

  • go with a simple / standard ensemble setup like the one proposed in this PR, or
  • stick with the existing approach (with some tweaks for reproducibility eg capturing seeds). I can see the argument for utilising all available data for the ensemble, as the average user isn't publishing a paper or anything so doesn't really need proper held-out test set

Happy to hash this one out on Discord tho

@chaosparrot
Copy link
Owner

The second option seems fine for now, we can revisit the held out data set at a later date

@pokey
Copy link
Contributor Author

pokey commented Jan 17, 2023

Ok I'll close this one for now; at some point prob worth cleaning up the split code and pulling the LR schedule stuff out of here, but this PR would need a lot of tweaking to get there. I pointed to this PR from the relevant issues

@pokey pokey closed this Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a learning rate schedule Consistent train / dev split
3 participants