Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix misimplements for stream importer found while testing on the real-life COCO2017 object detection dataset #1093

Conversation

vinnamkim
Copy link
Contributor

@vinnamkim vinnamkim commented Jul 12, 2023

Summary

  1. Fix broken is_stream test
  2. Fix StreamDataset.import_from(path, format="coco_instances") actually create Dataset not StreamDataset.
  3. To speed up, let the initial length of _CocoBase(stream=True) can be obtained from COCOPageMapper.
  4. Speed up parsing the "categories" section when it is at the end of JSON file.
  5. To speed up, change caching logic and size slightly.
  6. Fix _CocoBase(stream=False) abruptly raising an error when the progress reporter is given.
  7. Add COCOExtractorMerger to handle "coco" import which should merge extractors across the tasks.

How to test

I manually tested the following code on the real-life COCO2017 object detection dataset.

from datumaro.components.dataset_base import DatasetItem
from datumaro.components.dataset import StreamDataset, Dataset
from time import time
from datumaro.components.progress_reporting import TQDMProgressReporter
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--stream", action="store_true", help="Use stream importer")


def upload_to_geti_db(item: DatasetItem) -> None:
    # Hi, I'm mock!
    pass


if __name__ == "__main__":
    args = parser.parse_args()
    start = time()

    dataset = (
        StreamDataset.import_from(
            "coco_json", format="coco_instances", progress_reporter=TQDMProgressReporter()
        )
        if args.stream
        else Dataset.import_from(
            "coco_json", format="coco_instances", progress_reporter=TQDMProgressReporter()
        )
    )

    for item in dataset:
        upload_to_geti_db(item)

    print(f"Done. Elapsed time: {time() - start:.2f}s")

Results:

  • No stream

no_stream

(datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py
mprof: Sampling memory every 0.1s
running new process
running as a Python program...
WARNING:root:File 'coco_json/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
Parsing image info in 'instances_val2017.json': 100%|██████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 77041.13it/s]
Parsing annotations in 'instances_val2017.json': 100%|███████████████████████████████████████████████| 36781/36781 [00:01<00:00, 32391.89it/s]
Parsing image info in 'instances_train2017.json': 100%|████████████████████████████████████████████| 118287/118287 [00:01<00:00, 83969.43it/s]
Parsing annotations in 'instances_train2017.json': 100%|███████████████████████████████████████████| 860001/860001 [00:27<00:00, 31262.75it/s]
Done. Elapsed time: 38.17s
  • Stream

stream

(datumaro-basic) vinnamki@vinnamki:~/datumaro/ws_datum/coco$ mprof run --python python test_perf.py --stream
mprof: Sampling memory every 0.1s
running new process
running as a Python program...
WARNING:root:File 'coco_json/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/person_keypoints_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/captions_train2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File 'coco_json/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
Parsing image info in 'instances_val2017.json': 100%|███████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3650.78it/s]
Parsing image info in 'instances_train2017.json': 100%|█████████████████████████████████████████████| 118287/118287 [00:39<00:00, 3027.10it/s]
Done. Elapsed time: 238.61s

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added the description of my changes into CHANGELOG.​
  • I have updated the documentation accordingly

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
…ataset

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
 - Specify encoding type (utf-8)

Signed-off-by: Vinnam Kim <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
…tream-importer-coco

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
…m independence

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
…aset

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
…simplements-found-from-reallife-coco-stream-exp

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
@vinnamkim vinnamkim added this to the 1.4.0 milestone Jul 12, 2023
@vinnamkim vinnamkim added the BUG Something isn't working label Jul 12, 2023
@vinnamkim vinnamkim changed the title Hotfix: Misimplements for stream importer found while testing on the real-life COCO2017 object detection dataset Fix misimplements for stream importer found while testing on the real-life COCO2017 object detection dataset Jul 12, 2023
@vinnamkim vinnamkim marked this pull request as ready for review July 12, 2023 09:44
@vinnamkim vinnamkim requested review from a team as code owners July 12, 2023 09:44
@vinnamkim vinnamkim requested review from wonjuleee and removed request for a team July 12, 2023 09:44
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
@vinnamkim vinnamkim marked this pull request as draft July 13, 2023 04:41
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Copy link
Contributor Author

@vinnamkim vinnamkim Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change all method to classmethod to be used in COCOExtractorMerger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move some classes here to break the circular dependency

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
@vinnamkim vinnamkim marked this pull request as ready for review July 13, 2023 10:37
@vinnamkim vinnamkim added the FEATURE New feature & functionality label Jul 13, 2023
Copy link
Contributor

@wonjuleee wonjuleee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how COCOExtractorMerger is operated across tasks and what difference from the original ExtractorMerger. Could you clarify this?

@vinnamkim
Copy link
Contributor Author

vinnamkim commented Jul 14, 2023

I am not sure how COCOExtractorMerger is operated across tasks and what difference from the original ExtractorMerger. Could you clarify this?

Let's assume we import tests/assets/coco_dataset/coco as "coco" format.
This will have two extractors for the val subset at least:

  1. annotations/captions_val.json
  2. annotations/instances_val.json

Both extractors have a dataset item (id=40), but different annotations (captions or polygons).
Therefore, COCOExtractorMerger should merge the two dataset item into one by aggregating all annotations.

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
@vinnamkim vinnamkim merged commit bcaca42 into openvinotoolkit:releases/1.4.0 Jul 14, 2023
4 checks passed
@vinnamkim vinnamkim deleted the hotfix/misimplements-found-from-reallife-coco-stream-exp branch July 14, 2023 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working FEATURE New feature & functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants