Make cache annotation optional #1332

Louis-Dupont · 2023-07-30T11:21:00Z

Goal

Adding option to run without caching

Notes

Why is it more complex than it may seem at first?

Even when we don't want to cache, ignore_empty_annotations requires to go over the whole dataset to know which index are empty or not in order to know the dataset length.
Caching cannot be done on the fly because it doesnt work out of the box with multiprocessing (pytorch workers).
Some transforms require to know index of empty annotations, even when we want to run training on all samples. This can create situation where we iterate over the whole dataset, but sample randomly over subset
We used to run self.annotation many times in different places of the code. If we want to add option of working without cache, this would create a lot of redundancy and be very inefficient because loading a sample may include many computations (see COCO _load_annotation).

BloodAxe · 2023-07-31T06:51:21Z

So, let's first make a really, really clear what cache annotation does. I feel everyone feels different about what it does. So we should clarify EXACTLY what it does.

I believe initially it was indented to cache resized images on disk, but AFAIK this is never used nowadays. So that part of code can be safely deleted as dead one (_get_cached_image).

Secondly, caching annotations is nothing but reading it from format stored on disk (COCO JSON) to something more usable by a dataset itself. And this is where our problem is coming from.

First issue: We did not measure (benchmarked) how much time does it take to parse COCO annotation file and to get the necessary sample at a specified index. We should check whether increasing dataset size from 1e3 to 1e4, 1e5,1e6, 1e7 obeys O(n) for parsing and O(1) for accessing sample. If it's not - that should be fixed.

Action point 1: Benchmark where we are now

Second issue: I feel the whole concept of the base detection dataset class is plainly wrong and should be rewritten from scratch. It mush receive numpy arrays of boxes/labels/sample ids.
That would make base class much simpler and easier to understand.

Action point 2: I believe we should aim to make the base class as simple as possible:

drop _get_cached_image/_cache_images entirely as dead code.
remove ignore_empty_annotations from base class. It is the derived classes who should respect this flag, but base class much act as a storage for ready to use data and not apply any mapping/filtering under the hood.

That's my take how it could be improved

BloodAxe

See my comment in the discussion

Louis-Dupont · 2023-07-31T07:25:23Z

I believe initially it was indented to cache resized images on disk, but AFAIK this is never used nowadays. So that part of code can be safely deleted as dead one (_get_cached_image).
I believe the original motivation for caching annotation was not just in case of image caching BUT I completely agree on the part of deleting image caching. And I also agree that both are highly coupled since the resizing in annotations is required in the image caching.

First issue: We did not measure (benchmarked) how much time does it take to parse COCO annotation file and to get the necessary sample at a specified index. We should check whether increasing dataset size from 1e3 to 1e4, 1e5,1e6, 1e7 obeys O(n) for parsing and O(1) for accessing sample. If it's not - that should be fixed.

There are 2 things; 1. COCO annotations loading, and 2. how we process these annotations.

The first is currently done with pycocotools. I ran some tests to see if we can improve code efficiency.

Instantiation: is almost exclusively due to opening a huge json. We can improve it by using libs suchs as orjson; for COCO on tzag8 it reduced loading from 15s to 9s. Not that huge, not sure if it is worth adding a new library for this, but maybe for bigger dataset.
Loading: This piece of code is definitely slower when using pycocotools compared to my own custom code (not pushed here). It most likely means it's not O(1) but I did not run a clear benchmark on different dataset size to make sure.

For the second, I also don't have a proper benchmark. What I know is that overall, it takes some time especially when tight_box_rotation = True, and with crowd. I did not benchmark it with idea that we want to keep the caching option anyway but you are right that it can be questioned.

Second issue: I feel the whole concept of the base detection dataset class is plainly wrong and should be rewritten from scratch. It mush receive numpy arrays of boxes/labels/sample ids.
That would make base class much simpler and easier to understand.

I agree that it should be rewritten, but the main issue is backward compatibility.
More precisely, because of wrong design, any inheriting class from DetectionDataset is highly coupled with its design, by requiring to implement this _load_annotations. Changing this would be a huge breaking change for them as it does not only affect the input parameters but also and mainly the code implementation.

shaydeci · 2023-07-31T07:58:50Z

So, let's first make a really, really clear what cache annotation does. I feel everyone feels different about what it does. So we should clarify EXACTLY what it does.

I believe initially it was indented to cache resized images on disk, but AFAIK this is never used nowadays. So that part of code can be safely deleted as dead one (_get_cached_image).

Secondly, caching annotations is nothing but reading it from format stored on disk (COCO JSON) to something more usable by a dataset itself. And this is where our problem is coming from.

First issue: We did not measure (benchmarked) how much time does it take to parse COCO annotation file and to get the necessary sample at a specified index. We should check whether increasing dataset size from 1e3 to 1e4, 1e5,1e6, 1e7 obeys O(n) for parsing and O(1) for accessing sample. If it's not - that should be fixed.

Action point 1: Benchmark where we are now

Second issue: I feel the whole concept of the base detection dataset class is plainly wrong and should be rewritten from scratch. It mush receive numpy arrays of boxes/labels/sample ids. That would make base class much simpler and easier to understand.

Action point 2: I believe we should aim to make the base class as simple as possible:

drop _get_cached_image/_cache_images entirely as dead code.

remove ignore_empty_annotations from base class. It is the derived classes who should respect this flag, but base class much act as a storage for ready to use data and not apply any mapping/filtering under the hood.

That's my take how it could be improved

I don't see your point about the first issue.

"something more usable by the dataset"- I witnessed myself that for small datasets there is a value for caching them (i.e when the RAM is not full) and for larger ones the swap memory gets full and at that point there is not benefit.
Why does simplifying the base class has to be in this PR ?
@BloodAxe

src/super_gradients/recipes/coco2017_yolox.yaml

src/super_gradients/training/datasets/detection_datasets/detection_dataset.py

BloodAxe

LGTM

wip

53a3a56

Louis-Dupont requested review from shaydeci, ofrimasad and BloodAxe as code owners July 30, 2023 11:21

Louis-Dupont added 7 commits July 30, 2023 14:26

remove todos

173bb93

Run over whole dataset when cache=True

57a59ac

fix cache

cf7b84c

fix test

aa4eb7c

improve doc

fe61a07

update doc

60cdf1a

wip

41f1e57

BloodAxe requested changes Jul 31, 2023

View reviewed changes

fix test

562fde0

BloodAxe reviewed Jul 31, 2023

View reviewed changes

src/super_gradients/recipes/coco2017_yolox.yaml Outdated Show resolved Hide resolved

BloodAxe reviewed Jul 31, 2023

View reviewed changes

src/super_gradients/training/datasets/detection_datasets/detection_dataset.py Outdated Show resolved Hide resolved

Louis-Dupont added 6 commits August 1, 2023 08:39

Merge branch 'master' into feature/SG-742-make_cache_label_optional

00a416b

improve backward compatibility

f3d2bb4

update description

59ee4b7

update description

b2d32ce

update description

b68496a

undo

a6bfe03

BloodAxe approved these changes Aug 1, 2023

View reviewed changes

Louis-Dupont merged commit 7afa83d into master Aug 1, 2023
6 checks passed

Louis-Dupont deleted the feature/SG-742-make_cache_label_optional branch August 1, 2023 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cache annotation optional #1332

Make cache annotation optional #1332

Louis-Dupont commented Jul 30, 2023 •

edited

Loading

BloodAxe commented Jul 31, 2023

BloodAxe left a comment

Louis-Dupont commented Jul 31, 2023

shaydeci commented Jul 31, 2023 •

edited

Loading

BloodAxe left a comment

Make cache annotation optional #1332

Make cache annotation optional #1332

Conversation

Louis-Dupont commented Jul 30, 2023 • edited Loading

Goal

Notes

BloodAxe commented Jul 31, 2023

BloodAxe left a comment

Choose a reason for hiding this comment

Louis-Dupont commented Jul 31, 2023

shaydeci commented Jul 31, 2023 • edited Loading

BloodAxe left a comment

Choose a reason for hiding this comment

Louis-Dupont commented Jul 30, 2023 •

edited

Loading

shaydeci commented Jul 31, 2023 •

edited

Loading