Add task type information when importing #1422

wonjuleee · 2024-04-09T02:46:13Z

Summary

How to test

Checklist

I have added unit tests to cover my changes.
I have added integration tests to cover my changes.
I have added the description of my changes into CHANGELOG.
I have updated the documentation accordingly

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2024 Intel Corporation
#
# SPDX-License-Identifier: MIT

jihyeonyi · 2024-04-09T06:10:15Z

src/datumaro/components/task.py

+    caption = 11
+    super_resolution = 12
+    depth_estimation = 13
+    mixed = 14


Is there a way for users to know what tasks are possible when TaskType is mixed?

Mixed task can be transformed to any task types. The reason why we are providing mixed is because Datumaro format can have any AnnotationType when importing.

vinnamkim · 2024-04-09T09:52:34Z

src/datumaro/components/dataset.py

I could see there are many changes in plugins/data_format. However, I'd rather revert them and let a set of annotation types existing in the dataset to be managed by DatasetStorage (Dataset's dataset item container) or StreamDatasetStorage (a correspondent to StreamDataset). This is because

This implementation makes a hidden constraint that every dataset extractor (DatasetBase) should implement an annotation type gatherer such as

datumaro/src/datumaro/plugins/data_formats/ade20k2017.py

Line 72 in f97820b

ann_types = set()

This implementation is not aligned with our dataset transform logics. It currently compute task_type at DatasetBase. Let's assume that some DatasetBase decides that a given dataset has two annotation types, Label and Bbox. However, if an arbitrary dataset transform is applied on top of it and it drops every Bbox, we must re-compute a set of annotation types existed in the dataset after transformation. This should be done by DatasetStorage or StreamDatasetStorage.

Following this idea, it would be

class DatasetStorage: def __init__(self): ... self._set_of_ann_types: set | None = None ... @property def set_of_ann_types(self): if self._set_of_ann_types is None: self._set_of_ann_types = set() # If reset or not computed, run its iterator to compute for item in self: for ann in item.annotations: self._set_of_ann_types.add(ann.type) return self._set_of_ann_types @property def task_type(self): return infer_task_type_from_set_of_ann_types(self.set_of_ann_types) ... def _iter_init_cache_unchecked(self) -> Iterable[DatasetItem]: # Merges the source, source transforms and patch, caches the result # and provides an iterator for the resulting item sequence. ... # Reset if there is a possible change self._set_of_ann_types = None

Thank you for the good idea. When I approached with this way, it needs to have a single iteration of whole dataset for obtaining available task information. So, I have turned to obtain available task during importing. What do you think?

Please see 3d30c1d

jihyeonyi · 2024-04-11T01:44:25Z

I have a question. I'm curious why datasets are designed to have a single task_type. For example, if a dataset has both label and bbox annotations, it can be used for both classification and detection tasks. (And even for anomaly cls/det. tasks if the labels are anomalous and normal). However, according to your implementation, it seems like the task_type becomes detection.

wonjuleee · 2024-04-11T04:31:57Z

I have a question. I'm curious why datasets are designed to have a single task_type. For example, if a dataset has both label and bbox annotations, it can be used for both classification and detection tasks. (And even for anomaly cls/det. tasks if the labels are anomalous and normal). However, according to your implementation, it seems like the task_type becomes detection.

Hi @jihyeonyi, thank you for the question. We are able to identify the mapping between annotation types and tasks in task.py and there is some included or excluded relationships between them. So it is preferable to provide new transformations for changing/filtering dataset items per each desired task types. Let me add more explanations later.

codecov · 2024-04-11T07:46:59Z

Codecov Report

Attention: Patch coverage is 91.09589% with 39 lines in your changes are missing coverage. Please review.

Project coverage is 80.98%. Comparing base (44cc56a) to head (0a313ca).
Report is 25 commits behind head on develop.

Files	Patch %	Lines
src/datumaro/components/dataset_storage.py	80.00%	5 Missing and 3 partials ⚠️
src/datumaro/plugins/data_formats/datumaro/base.py	76.00%	5 Missing and 1 partial ⚠️
...umaro/plugins/data_formats/datumaro/page_mapper.py	42.85%	4 Missing ⚠️
src/datumaro/plugins/data_formats/kitti/base.py	50.00%	1 Missing and 3 partials ⚠️
src/datumaro/plugins/data_formats/roboflow/base.py	70.00%	1 Missing and 2 partials ⚠️
src/datumaro/components/task.py	94.28%	2 Missing ⚠️
src/datumaro/plugins/data_formats/cifar.py	75.00%	1 Missing and 1 partial ⚠️
src/datumaro/plugins/data_formats/imagenet.py	75.00%	1 Missing and 1 partial ⚠️
src/datumaro/plugins/data_formats/mnist.py	75.00%	1 Missing and 1 partial ⚠️
src/datumaro/plugins/data_formats/mnist_csv.py	75.00%	1 Missing and 1 partial ⚠️
... and 4 more

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1422      +/-   ##
===========================================
+ Coverage    80.85%   80.98%   +0.12%     
===========================================
  Files          271      272       +1     
  Lines        30689    31137     +448     
  Branches      6197     6279      +82     
===========================================
+ Hits         24815    25216     +401     
- Misses        4489     4505      +16     
- Partials      1385     1416      +31

Flag	Coverage Δ
ubuntu-20.04_Python-3.10	`80.96% <91.09%> (+0.12%)`	⬆️
windows-2022_Python-3.10	`80.95% <91.05%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jihyeonyi · 2024-04-15T01:18:14Z

src/datumaro/components/dataset_storage.py

+            # # when adding a new item, task_type will be updated automatically
+            # for ann in item.annotations:
+            #     self._set_of_ann_types.add(ann.type)
+            # self._task_type = TaskAnnotationMapping().get_task(self._set_of_ann_types)


I think this can be deleted.

fixed at 0a313ca

jihyeonyi · 2024-04-15T01:18:48Z

src/datumaro/components/dataset_storage.py

@@ -643,7 +705,15 @@ def stacked_transform(self) -> IDataset:
        return transform

    def __iter__(self) -> Iterator[DatasetItem]:
-        yield from self.stacked_transform
+        # yield from self.stacked_transform


This line can be deleted.

fixed at 0a313ca

wonjuleee requested review from a team as code owners April 9, 2024 02:46

wonjuleee requested review from jihyeonyi and removed request for a team April 9, 2024 02:46

wonjuleee marked this pull request as draft April 9, 2024 02:46

jihyeonyi reviewed Apr 9, 2024

View reviewed changes

vinnamkim reviewed Apr 9, 2024

View reviewed changes

wonjuleee marked this pull request as ready for review April 11, 2024 07:34

wonjuleee force-pushed the add_task_types branch from 1d05fe9 to 3d30c1d Compare April 11, 2024 07:34

wonjuleee added 18 commits April 12, 2024 01:52

draft

5267ffa

temp

0a92221

temp2

8ff0616

fix test

cd1a1a3

git stat:q

5a404b1

support imagenet, imagenet_txt, cvat

7608602

add video

b71b9be

add ade20ks

eb96c70

update merger

f35b18e

update cvat

c0b85ca

update datumaro format

b6c9856

add celeba, align_celeba, camvid

cf8b44a

ava, cityscapes

1918d31

enable cifar and coco

643ad93

enable arrow and datumaro

fff805c

support kitti, voc, yolo, tfrecord, roboflow

200dcd2

add synthia, sam, mpii, mvtec, sly, mmvtec

f529bdf

update all

a4c9772

wonjuleee added 7 commits April 12, 2024 01:52

fix unit tests

f6e38be

fix unit tests 2

f8dc1ab

fix unit test3

d2e5599

add init for dataset storage

3d30c1d

update changelog

a891942

update dataset storage

5e7c251

add tasktype into init

db38a8b

jihyeonyi reviewed Apr 15, 2024

View reviewed changes

jihyeonyi approved these changes Apr 15, 2024

View reviewed changes

wonjuleee merged commit 3e72044 into openvinotoolkit:develop Apr 15, 2024
8 checks passed

remove commented lines

0a313ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add task type information when importing #1422

Add task type information when importing #1422

wonjuleee commented Apr 9, 2024 •

edited

Loading

jihyeonyi Apr 9, 2024

wonjuleee Apr 11, 2024

vinnamkim Apr 9, 2024 •

edited

Loading

wonjuleee Apr 9, 2024

wonjuleee Apr 11, 2024

jihyeonyi commented Apr 11, 2024

wonjuleee commented Apr 11, 2024

codecov bot commented Apr 11, 2024 •

edited

Loading

jihyeonyi Apr 15, 2024

wonjuleee Apr 15, 2024

jihyeonyi Apr 15, 2024

wonjuleee Apr 15, 2024

Add task type information when importing #1422

Add task type information when importing #1422

Conversation

wonjuleee commented Apr 9, 2024 • edited Loading

Summary

How to test

Checklist

License

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinnamkim Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihyeonyi commented Apr 11, 2024

wonjuleee commented Apr 11, 2024

codecov bot commented Apr 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wonjuleee commented Apr 9, 2024 •

edited

Loading

vinnamkim Apr 9, 2024 •

edited

Loading

codecov bot commented Apr 11, 2024 •

edited

Loading