Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add task type information when importing #1422

Merged
merged 26 commits into from
Apr 15, 2024

Conversation

wonjuleee
Copy link
Contributor

@wonjuleee wonjuleee commented Apr 9, 2024

Summary

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added the description of my changes into CHANGELOG.​
  • I have updated the documentation accordingly

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2024 Intel Corporation
#
# SPDX-License-Identifier: MIT

@wonjuleee wonjuleee requested review from a team as code owners April 9, 2024 02:46
@wonjuleee wonjuleee requested review from jihyeonyi and removed request for a team April 9, 2024 02:46
@wonjuleee wonjuleee marked this pull request as draft April 9, 2024 02:46
caption = 11
super_resolution = 12
depth_estimation = 13
mixed = 14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way for users to know what tasks are possible when TaskType is mixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed task can be transformed to any task types. The reason why we are providing mixed is because Datumaro format can have any AnnotationType when importing.

Copy link
Contributor

@vinnamkim vinnamkim Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see there are many changes in plugins/data_format. However, I'd rather revert them and let a set of annotation types existing in the dataset to be managed by DatasetStorage (Dataset's dataset item container) or StreamDatasetStorage (a correspondent to StreamDataset). This is because

  1. This implementation makes a hidden constraint that every dataset extractor (DatasetBase) should implement an annotation type gatherer such as
  2. This implementation is not aligned with our dataset transform logics. It currently compute task_type at DatasetBase. Let's assume that some DatasetBase decides that a given dataset has two annotation types, Label and Bbox. However, if an arbitrary dataset transform is applied on top of it and it drops every Bbox, we must re-compute a set of annotation types existed in the dataset after transformation. This should be done by DatasetStorage or StreamDatasetStorage.

Following this idea, it would be

class DatasetStorage:
    def __init__(self):
         ...
         self._set_of_ann_types: set | None = None

...

    @property
    def set_of_ann_types(self):
        if self._set_of_ann_types is None:
             self._set_of_ann_types = set()
             # If reset or not computed, run its iterator to compute
             for item in self:
                 for ann in item.annotations:
                     self._set_of_ann_types.add(ann.type)
        return self._set_of_ann_types

    @property
    def task_type(self):
        return infer_task_type_from_set_of_ann_types(self.set_of_ann_types)

...

    def _iter_init_cache_unchecked(self) -> Iterable[DatasetItem]:
        # Merges the source, source transforms and patch, caches the result
        # and provides an iterator for the resulting item sequence.
        ...

        # Reset if there is a possible change
        self._set_of_ann_types = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the good idea. When I approached with this way, it needs to have a single iteration of whole dataset for obtaining available task information. So, I have turned to obtain available task during importing. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see 3d30c1d

@jihyeonyi
Copy link
Contributor

I have a question. I'm curious why datasets are designed to have a single task_type. For example, if a dataset has both label and bbox annotations, it can be used for both classification and detection tasks. (And even for anomaly cls/det. tasks if the labels are anomalous and normal). However, according to your implementation, it seems like the task_type becomes detection.

@wonjuleee
Copy link
Contributor Author

I have a question. I'm curious why datasets are designed to have a single task_type. For example, if a dataset has both label and bbox annotations, it can be used for both classification and detection tasks. (And even for anomaly cls/det. tasks if the labels are anomalous and normal). However, according to your implementation, it seems like the task_type becomes detection.

Hi @jihyeonyi, thank you for the question. We are able to identify the mapping between annotation types and tasks in task.py and there is some included or excluded relationships between them. So it is preferable to provide new transformations for changing/filtering dataset items per each desired task types. Let me add more explanations later.

@wonjuleee wonjuleee marked this pull request as ready for review April 11, 2024 07:34
Copy link

codecov bot commented Apr 11, 2024

Codecov Report

Attention: Patch coverage is 91.09589% with 39 lines in your changes are missing coverage. Please review.

Project coverage is 80.98%. Comparing base (44cc56a) to head (0a313ca).
Report is 25 commits behind head on develop.

Files Patch % Lines
src/datumaro/components/dataset_storage.py 80.00% 5 Missing and 3 partials ⚠️
src/datumaro/plugins/data_formats/datumaro/base.py 76.00% 5 Missing and 1 partial ⚠️
...umaro/plugins/data_formats/datumaro/page_mapper.py 42.85% 4 Missing ⚠️
src/datumaro/plugins/data_formats/kitti/base.py 50.00% 1 Missing and 3 partials ⚠️
src/datumaro/plugins/data_formats/roboflow/base.py 70.00% 1 Missing and 2 partials ⚠️
src/datumaro/components/task.py 94.28% 2 Missing ⚠️
src/datumaro/plugins/data_formats/cifar.py 75.00% 1 Missing and 1 partial ⚠️
src/datumaro/plugins/data_formats/imagenet.py 75.00% 1 Missing and 1 partial ⚠️
src/datumaro/plugins/data_formats/mnist.py 75.00% 1 Missing and 1 partial ⚠️
src/datumaro/plugins/data_formats/mnist_csv.py 75.00% 1 Missing and 1 partial ⚠️
... and 4 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1422      +/-   ##
===========================================
+ Coverage    80.85%   80.98%   +0.12%     
===========================================
  Files          271      272       +1     
  Lines        30689    31137     +448     
  Branches      6197     6279      +82     
===========================================
+ Hits         24815    25216     +401     
- Misses        4489     4505      +16     
- Partials      1385     1416      +31     
Flag Coverage Δ
ubuntu-20.04_Python-3.10 80.96% <91.09%> (+0.12%) ⬆️
windows-2022_Python-3.10 80.95% <91.05%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 477 to 480
# # when adding a new item, task_type will be updated automatically
# for ann in item.annotations:
# self._set_of_ann_types.add(ann.type)
# self._task_type = TaskAnnotationMapping().get_task(self._set_of_ann_types)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed at 0a313ca

@@ -643,7 +705,15 @@ def stacked_transform(self) -> IDataset:
return transform

def __iter__(self) -> Iterator[DatasetItem]:
yield from self.stacked_transform
# yield from self.stacked_transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed at 0a313ca

@wonjuleee wonjuleee merged commit 3e72044 into openvinotoolkit:develop Apr 15, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants