Clean up data loading and randomization; add scheduler #29

pokey · 2022-12-23T13:13:10Z

Fixes Implement a learning rate schedule #18
Fixes Consistent train / dev split #15

pokey · 2022-12-23T13:21:32Z

lib/audio_net.py

+        dataset_rng = torch.Generator().manual_seed(self.data_seed)
+        train_dataset, val_dataset = torch.utils.data.random_split(dataset, [self.dataset_size - split, split], generator=dataset_rng)


Note that the dataset randomizer now has its own seed, and we do the split once, rather than doing it once per ensemble member

pokey · 2022-12-23T13:22:08Z

lib/audio_net.py

+            self.optimizers.append(optimizer)
+            self.schedulers.append( torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min'))
+
+            self.train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)


We create a shuffling data loader, so we don't need to do shuffling during training; that happens for us automatically

pokey · 2022-12-23T13:22:47Z

lib/audio_net.py

-            self.validation_loaders.append(torch.utils.data.DataLoader(dataset, batch_size=self.batch_size, sampler=valid_sampler))
+            optimizer = optim.SGD(self.nets[i].parameters(), lr=0.003, momentum=0.9, nesterov=True)
+            self.optimizers.append(optimizer)
+            self.schedulers.append( torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min'))


Added a scheduler

chaosparrot · 2022-12-23T14:38:04Z

I initially placed a different validation set / train set on each net to make sure each net would see different data ( thus performing better as an ensemble than if they all saw the same data ). Otherwise the only variation they would have after many epochs would be the starting position right?

My question would be: since 3 nets using different train / validation sets can use more data for their training, wouldn't they perform better on novel data? Given a testset which none of the models have seen, wouldn't 3 ensembled models with different training data perform better on than an ensemble created where each model saw exactly the same data?

( From a combined ensemble validation score I can definitely see the benefits of a single split train / validation set, because with an ensemble where the validation set has been seen by some of the models there's obvious data pollution and the results will be skewed )

pokey · 2022-12-26T17:55:33Z

Huh interesting idea. Makes me a bit nervous, eg we'd want to make sure we know which data seed each ensemble member was using in case we want to resume from checkpoint. But I guess could work?

@ym-han any thoughts? Is this something you've seen before? Reminds me of k-fold cross validation tbh, tho not exactly the same

ym-han · 2022-12-26T18:24:20Z

I haven't looked at the code so I can't be sure, but it sounds like this could be bagging (or something similar). There's some discussion and a link to some references on the sklearn bagging classifier page (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html), and which I'm going to quote:

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

This algorithm encompasses several works from the literature. When random subsets of the dataset are drawn as random subsets of the samples, then this algorithm is known as Pasting [1]. If samples are drawn with replacement, then the method is known as Bagging [2]. When random subsets of the dataset are drawn as random subsets of the features, then the method is known as Random Subspaces [3]. Finally, when base estimators are built on subsets of both samples and features, then the method is known as Random Patches [4].

Breiman 1996 "Bagging Predictors", the chapter on bagging in Richard Berk's Statistical Learning from a Regression Perspective, and the section on bagging in Elements of Statistical Learning seem useful if you want to read up more.

pokey · 2022-12-26T21:45:56Z

Yeah that sounds right. But I presume all of those methods assume you're sampling the training data with a fixed held out validation set, whereas here the validation set of one ensemble member is used as training data for another

chaosparrot · 2022-12-28T09:58:49Z

Did some reading on KFold and cross validation and Pokey is right in the sense that there is a held out set kept separate from the validation sets for each model. I.e. given we have a data set of A, B, C ... Z, and two models, these are the current splits ( note the overlap ):

Model	Training	Validation
Model A	A, B, C, D, E, F, G, H, I, O, P, Q, R,	J K Y L M N Z
Model B	N, O, P, Q, R, S, T, U, V, F, G, H, I,	W X Y Z D E L
Ensemble	-	J K L M N O W X Y Z D E

Whereas the, in my opinion, best split would be:

Model	Training	Validation
Model A	A, B, C, D, E, F, G, H, I, O, P, Q, R, S, T	J K L M N O
Model B	N, O, P, Q, R, S, T, U, J, K, L, M, G, H, I	A B C D E F
Ensemble	-	V W X Y Z

Where we keep 10 percent of the total data set held out to test the ensemble on, use the remaining 90 percent for training and validation, and have the models have validation sets that do not overlap with one another.

The issue then is still that we need to keep the random seed persisted to keep the split available for checkpointing, but I think this would generate the best model results that do not have any data pollution going on in the validation sets.

ym-han · 2022-12-28T10:51:22Z

This table is helpful. Yes, I agree that it's important that whatever data is in the test set for the ensemble is not data that has been used to either train or tune the hyperparams of any of the models of the ensemble.

pokey · 2023-01-13T12:52:46Z

hmm I'd be tempted to either

go with a simple / standard ensemble setup like the one proposed in this PR, or
stick with the existing approach (with some tweaks for reproducibility eg capturing seeds). I can see the argument for utilising all available data for the ensemble, as the average user isn't publishing a paper or anything so doesn't really need proper held-out test set

Happy to hash this one out on Discord tho

chaosparrot · 2023-01-16T14:22:18Z

The second option seems fine for now, we can revisit the held out data set at a later date

pokey · 2023-01-17T12:48:40Z

Ok I'll close this one for now; at some point prob worth cleaning up the split code and pulling the LR schedule stuff out of here, but this PR would need a lot of tweaking to get there. I pointed to this PR from the relevant issues

Clean up data loading and randomization; add scheduler

89258c4

pokey commented Dec 23, 2022

View reviewed changes

This was referenced Jan 17, 2023

Consistent train / dev split #15

Open

Implement a learning rate schedule #18

Open

pokey closed this Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up data loading and randomization; add scheduler #29

Clean up data loading and randomization; add scheduler #29

pokey commented Dec 23, 2022 •

edited

Loading

pokey Dec 23, 2022 •

edited

Loading

pokey Dec 23, 2022

pokey Dec 23, 2022

chaosparrot commented Dec 23, 2022

pokey commented Dec 26, 2022

ym-han commented Dec 26, 2022 •

edited

Loading

pokey commented Dec 26, 2022

chaosparrot commented Dec 28, 2022 •

edited

Loading

ym-han commented Dec 28, 2022

pokey commented Jan 13, 2023

chaosparrot commented Jan 16, 2023

pokey commented Jan 17, 2023

		dataset_rng = torch.Generator().manual_seed(self.data_seed)
		train_dataset, val_dataset = torch.utils.data.random_split(dataset, [self.dataset_size - split, split], generator=dataset_rng)

Clean up data loading and randomization; add scheduler #29

Clean up data loading and randomization; add scheduler #29

Conversation

pokey commented Dec 23, 2022 • edited Loading

pokey Dec 23, 2022 • edited Loading

Choose a reason for hiding this comment

pokey Dec 23, 2022

Choose a reason for hiding this comment

pokey Dec 23, 2022

Choose a reason for hiding this comment

chaosparrot commented Dec 23, 2022

pokey commented Dec 26, 2022

ym-han commented Dec 26, 2022 • edited Loading

pokey commented Dec 26, 2022

chaosparrot commented Dec 28, 2022 • edited Loading

ym-han commented Dec 28, 2022

pokey commented Jan 13, 2023

chaosparrot commented Jan 16, 2023

pokey commented Jan 17, 2023

pokey commented Dec 23, 2022 •

edited

Loading

pokey Dec 23, 2022 •

edited

Loading

ym-han commented Dec 26, 2022 •

edited

Loading

chaosparrot commented Dec 28, 2022 •

edited

Loading