Added a class balanced distributed sampler #1844

NatanBagrov · 2024-02-18T15:49:07Z

Context: in cases where some classes are less frequent than others, we'd want to sample images containing these classes more often.

Current implementation:

IndexMappingDatasetWrapper - gets a dataset and a list of indices (mapping), and wraps the dataset: wrapper[i] == self.dataset[self.mapping[item]]
An implementation of https://arxiv.org/pdf/1908.03195.pdf, that returns a float per index (image) indicating the scarcity of classes that are in that image (larger = more scarce = repeat it more often).
DetectionClassBalancedDistributedSampler that uses (1) with (2) for Detection datasets.

Discussion:
Current implementation supports distributed mode. Non-distributed mode can also be supported, but will result more code.
Some alternative implementations I can spot:

Do not wrap original dataset, but rather override __iter__ from DistributedSampler - it is possible, however, causes coupling and code duplication. Current impelentation (super().__init__(dataset=IndexMappingDatasetWrapper(dataset, repeat_indices), *args, **kwargs)) is a bit less verbose.
Eventually, we might use WeightedSampler, however, it is not supported in distributed mode. There are some implementations of DistributedWeightedSampler in-the-wild. We can try these.
Perhaps we should consider a different approach where the heavy lifting is on the dataset size. For example:
```
dataset: class_balanced_wrapper
   inner_dataset: coco
   ...
```
The issue, however, is that we cannot inherit inner_class, no inherit DetectionDataset because we don't want to pass all the __init__ parameters.
Another option is to modify DetectionDataset and add a parameter for class balancing (i.e., oversampling_threshold) and do the heavy lifting inside DetectionDataset. This is less verbose, however, adds more responsibility to DetectionDataset (compared to Composition)

Open for discussion :)

shaydeci · 2024-02-18T18:17:33Z

..._gradients/training/datasets/detection_datasets/detection_dataset_class_balancing_wrapper.py

+
+
+@register_sampler(Samplers.DISTRIBUTED_DETECTION_CLASS_BALANCING)
+class DetectionClassBalancedDistributedSampler(DistributedSampler):


We have a dedicated module for our samplers, please move it there.

shaydeci · 2024-02-18T18:22:16Z

src/super_gradients/training/datasets/balancing_classes_utils.py

+    return repeat_factors
+
+
+class IndexMappingDatasetWrapper(Dataset):


I feel a bit unease about having a different dataset passed to the sampler then what we actually have in our data loader.
If anything happens under the hood (which I think does) things could go wrong.
I got a feeling @BloodAxe would be on the same page as me here, but what do you say ?

shaydeci · 2024-02-18T18:23:49Z

tests/unit_tests/class_balancing_test.py

+        return torch.tensor([idx, 0])  # class 0 appears everywhere, other classes appear only once.
+
+
+class ClassBalancingTest(unittest.TestCase):


Please add to suite.

shaydeci · 2024-02-18T18:24:11Z

tests/unit_tests/dataset_index_mapping_test.py

+        return torch.tensor(idx)
+
+
+class DatasetIndexMappingTest(unittest.TestCase):


Please add to suite.

shaydeci · 2024-02-18T18:24:21Z

tests/unit_tests/detection_class_balancing_distributed_sampler_test.py

+from super_gradients.training.datasets.detection_datasets.detection_dataset_class_balancing_wrapper import DetectionClassBalancedDistributedSampler
+
+
+class DummyDetectionDataset(Dataset):  # NOTE: we implement the needed stuff from DetectionDataset, but we do not inherit it because the ctor is massive


Please add to suite.

shaydeci · 2024-02-18T18:25:47Z

tests/unit_tests/class_balancing_test.py

+
+from super_gradients.training.datasets.balancing_classes_utils import get_repeat_factors
+
+


I think a comprehensive test that actually runs in DDP is missing.
Please add one (can be done through the ci config - see sanity_tests workflow).

Pff.. I agree. I'll try to implement it once we agree on the approach (i.e., is will be redundant if we decide on option 3 or 4)

shaydeci · 2024-02-18T18:27:48Z

src/super_gradients/training/datasets/balancing_classes_utils.py

+    2. For each category c, compute the category-level repeat factor: :math:`r(c) = max(1, sqrt(t/f(c)))`
+    3. For each image I, compute the image-level repeat factor: :math:`r(I) = max_{c in I} r(c)`
+
+    Returns a list of repeat factors (length = dataset_length). How to read: result[i] is a float, indicates the repeat factor of image i.


Out of curiosity - I guess this would mean that using Mosaic / mixup would mess things up here?
We use them in most of our recipes, so maybe there's something to tak in account here.

So, you are half correct :) I believe things should work ok because the balancing is done (currently) in the sampler-level, while mixup/mosaic are dataset-level. On the other hand, the sampler will "balance" and ask for more indices with scarce classes (this is good), but the dataset does not know that it should sample non uniformly, thus 3/4 images will be taken at random, regardless of class scarcity. Hope it makes sense...

shaydeci · 2024-02-18T18:33:43Z

..._gradients/training/datasets/detection_datasets/detection_dataset_class_balancing_wrapper.py

+        for dataset_idx, repeat_factor in enumerate(repeat_factors):
+            repeat_indices.extend([dataset_idx] * math.ceil(repeat_factor))
+
+        super().__init__(dataset=IndexMappingDatasetWrapper(dataset, repeat_indices), *args, **kwargs)


Following my precious comment regarding the dataset wrapper, I am in favor moving the logic introduced there (mapping etc) to this class.
Also the helper methods which I don't think have much context outside this speific sampler.

BloodAxe · 2024-02-19T06:50:10Z

First of all - great initiative on adding a samplers support! I feel this can be really helpful in many cases.
Regarding the suggested implementation - I do have a few comments:

I will start with a most simple one. I think we should have a generic wrapper around any sampler for DPP case: DistributedSamplerWrapper

Suppose we have a sampler (any subclass of a Sampler), then for DDP case we would do:
sampler = DistributedSamplerWrapper(sampler) and that would make sampler compatible with DDP.

The suggested design of wrapper was proposed on pytorch forums, been tested for years and myself and does not require any special knowledge of underlying sampler implementation.

class DatasetFromSampler(Dataset):
    """Dataset to create indexes from `Sampler`.

    Args:
        sampler: PyTorch sampler
    """

    def __init__(self, sampler: Sampler):
        """Initialisation for DatasetFromSampler."""
        self.sampler = sampler
        self.sampler_list = None

    def __getitem__(self, index: int):
        """Gets element of the dataset.

        Args:
            index: index of the element in the dataset

        Returns:
            Single element by index
        """
        if self.sampler_list is None:
            self.sampler_list = list(self.sampler)
        return self.sampler_list[index]

    def __len__(self) -> int:
        """
        Returns:
            int: length of the dataset
        """
        return len(self.sampler)


class DistributedSamplerWrapper(DistributedSampler):
    """
    Wrapper over `Sampler` for distributed training.
    Allows you to use any sampler in distributed mode.

    It is especially useful in conjunction with
    `torch.nn.parallel.DistributedDataParallel`. In such case, each
    process can pass a DistributedSamplerWrapper instance as a DataLoader
    sampler, and load a subset of subsampled data of the original dataset
    that is exclusive to it.

    .. note::
        Sampler is assumed to be of constant size.
    """

    def __init__(
        self,
        sampler,
        num_replicas: Optional[int] = None,
        rank: Optional[int] = None,
        shuffle: bool = True,
    ):
        """

        Args:
            sampler: Sampler used for subsampling
            num_replicas (int, optional): Number of processes participating in
              distributed training
            rank (int, optional): Rank of the current process
              within ``num_replicas``
            shuffle (bool, optional): If true (default),
              sampler will shuffle the indices
        """
        super(DistributedSamplerWrapper, self).__init__(
            DatasetFromSampler(sampler),
            num_replicas=num_replicas,
            rank=rank,
            shuffle=shuffle,
        )
        self.sampler = sampler

    def __iter__(self):

        self.dataset = DatasetFromSampler(self.sampler)
        indexes_of_indexes = super().__iter__()
        subsampler_indexes = self.dataset
        return iter(itemgetter(*indexes_of_indexes)(subsampler_indexes))

BloodAxe · 2024-02-19T06:54:35Z

Secondly, I think we should introduce some sort of the interface:

class HasSamplingInformation(ABC):
   def getLabelsPresense() -> np.ndarray:
      """
      :returns: A Numpy array of [Dataset Length, Num Classes] with values corresponding to
                a number of objects at the current sample index.
      """

Why we want to have interface?

We can check whether dataset we are trying to use with sampler implements it (If not - we show nice error).
Sampler implementation gets a presence matrix which contains ALL necessary information to do sampling. No remapping indexes and other stuff.

shaydeci · 2024-02-19T08:10:16Z

First of all - great initiative on adding a samplers support! I feel this can be really helpful in many cases. Regarding the suggested implementation - I do have a few comments:

I will start with a most simple one. I think we should have a generic wrapper around any sampler for DPP case: DistributedSamplerWrapper

Suppose we have a sampler (any subclass of a Sampler), then for DDP case we would do: sampler = DistributedSamplerWrapper(sampler) and that would make sampler compatible with DDP.

The suggested design of wrapper was proposed on pytorch forums, been tested for years and myself and does not require any special knowledge of underlying sampler implementation.

class DatasetFromSampler(Dataset):
    """Dataset to create indexes from `Sampler`.

    Args:
        sampler: PyTorch sampler
    """

    def __init__(self, sampler: Sampler):
        """Initialisation for DatasetFromSampler."""
        self.sampler = sampler
        self.sampler_list = None

    def __getitem__(self, index: int):
        """Gets element of the dataset.

        Args:
            index: index of the element in the dataset

        Returns:
            Single element by index
        """
        if self.sampler_list is None:
            self.sampler_list = list(self.sampler)
        return self.sampler_list[index]

    def __len__(self) -> int:
        """
        Returns:
            int: length of the dataset
        """
        return len(self.sampler)


class DistributedSamplerWrapper(DistributedSampler):
    """
    Wrapper over `Sampler` for distributed training.
    Allows you to use any sampler in distributed mode.

    It is especially useful in conjunction with
    `torch.nn.parallel.DistributedDataParallel`. In such case, each
    process can pass a DistributedSamplerWrapper instance as a DataLoader
    sampler, and load a subset of subsampled data of the original dataset
    that is exclusive to it.

    .. note::
        Sampler is assumed to be of constant size.
    """

    def __init__(
        self,
        sampler,
        num_replicas: Optional[int] = None,
        rank: Optional[int] = None,
        shuffle: bool = True,
    ):
        """

        Args:
            sampler: Sampler used for subsampling
            num_replicas (int, optional): Number of processes participating in
              distributed training
            rank (int, optional): Rank of the current process
              within ``num_replicas``
            shuffle (bool, optional): If true (default),
              sampler will shuffle the indices
        """
        super(DistributedSamplerWrapper, self).__init__(
            DatasetFromSampler(sampler),
            num_replicas=num_replicas,
            rank=rank,
            shuffle=shuffle,
        )
        self.sampler = sampler

    def __iter__(self):

        self.dataset = DatasetFromSampler(self.sampler)
        indexes_of_indexes = super().__iter__()
        subsampler_indexes = self.dataset
        return iter(itemgetter(*indexes_of_indexes)(subsampler_indexes))

I like this ^^

NatanBagrov · 2024-02-20T12:45:49Z

Secondly, I think we should introduce some sort of the interface:
class HasSamplingInformation(ABC):
   def getLabelsPresense() -> np.ndarray:
      """
      :returns: A Numpy array of [Dataset Length, Num Classes] with values corresponding to
                a number of objects at the current sample index.
      """
Why we want to have interface?

We can check whether dataset we are trying to use with sampler implements it (If not - we show nice error).

Sampler implementation gets a presence matrix which contains ALL necessary information to do sampling. No remapping indexes and other stuff.

This is a nice idea, I wonder if an array is too much compared to a generator. A fixed array, when we take Objects365 a an example is of size: 1.7M dataset size * 365 classes * 4 bytes ~= 2.5GB
Another possible downside is cases where SG users have their dataset, and will not get this feature. On the other hand, perhaps it is fair to ask to implement something to get value.

NatanBagrov · 2024-02-25T15:18:04Z

Closing because this PR has turned into two: #1865 #1856

Added a class balanced distributed sampler

f615e31

shaydeci requested changes Feb 18, 2024

View reviewed changes

NatanBagrov closed this Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a class balanced distributed sampler #1844

Added a class balanced distributed sampler #1844

NatanBagrov commented Feb 18, 2024 •

edited

Loading

shaydeci Feb 18, 2024

shaydeci Feb 18, 2024

shaydeci Feb 18, 2024

shaydeci Feb 18, 2024

shaydeci Feb 18, 2024

shaydeci Feb 18, 2024

NatanBagrov Feb 18, 2024 •

edited

Loading

shaydeci Feb 18, 2024

NatanBagrov Feb 18, 2024

shaydeci Feb 18, 2024

BloodAxe commented Feb 19, 2024

BloodAxe commented Feb 19, 2024

shaydeci commented Feb 19, 2024

NatanBagrov commented Feb 20, 2024

NatanBagrov commented Feb 25, 2024



		@register_sampler(Samplers.DISTRIBUTED_DETECTION_CLASS_BALANCING)
		class DetectionClassBalancedDistributedSampler(DistributedSampler):

		return repeat_factors


		class IndexMappingDatasetWrapper(Dataset):

		return torch.tensor([idx, 0]) # class 0 appears everywhere, other classes appear only once.


		class ClassBalancingTest(unittest.TestCase):

		return torch.tensor(idx)


		class DatasetIndexMappingTest(unittest.TestCase):

		from super_gradients.training.datasets.detection_datasets.detection_dataset_class_balancing_wrapper import DetectionClassBalancedDistributedSampler


		class DummyDetectionDataset(Dataset): # NOTE: we implement the needed stuff from DetectionDataset, but we do not inherit it because the ctor is massive


		from super_gradients.training.datasets.balancing_classes_utils import get_repeat_factors

Added a class balanced distributed sampler #1844

Added a class balanced distributed sampler #1844

Conversation

NatanBagrov commented Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NatanBagrov Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BloodAxe commented Feb 19, 2024

BloodAxe commented Feb 19, 2024

shaydeci commented Feb 19, 2024

NatanBagrov commented Feb 20, 2024

NatanBagrov commented Feb 25, 2024

NatanBagrov commented Feb 18, 2024 •

edited

Loading

NatanBagrov Feb 18, 2024 •

edited

Loading