Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mergeback 1.5.1rc2 #1181

Merged
merged 17 commits into from
Oct 23, 2023
Merged

Mergeback 1.5.1rc2 #1181

merged 17 commits into from
Oct 23, 2023

Conversation

yunchu
Copy link
Contributor

@yunchu yunchu commented Oct 23, 2023

Summary

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have added the description of my changes into CHANGELOG.​
  • I have updated the documentation accordingly

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT

yunchu and others added 16 commits September 11, 2023 13:34
openvinotoolkit#1145)

- Add multi-threading option (`num_workers > 0`) to `ModelTransform` and
`SAMBboxToInstanceMask`.
- It is required if the model launcher can take multiple requests at the
same time and have high throughput.

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
…it#1149)

- One of the tests added in openvinotoolkit#1145 is flaky:
https://github.com/openvinotoolkit/datumaro/actions/runs/6156803415/job/16706221640
```console
=========================== short test summary info ============================
FAILED tests/unit/test_util.py::MultiProcUtilTest::test_raise_exception_in_main_thread
= 1 failed, 1493 passed, 38 skipped, 2 xfailed, 48148 warnings in 407.34s (0:06:47) =
tests-py38-darwin: exit 1 (462.14 seconds) /Users/runner/work/datumaro/datumaro> python -m pytest -v --csv=/Users/runner/work/datumaro/datumaro/.tox/results-tests-py38-darwin.csv tests/unit --cov --cov-report=xml pid=4536
.pkg: _exit> python /Users/runner/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyproject_api/_backend.py True setuptools.build_meta
  tests-py38-darwin: FAIL code 1 (793.18=setup[331.04]+cmd[462.14] seconds)
  evaluation failed :( (803.78 seconds)
```
- This is because `join_timeout` is too short, so that the main thread
tries to assert the error logs before they are created.
- To fix it, set `join_timeout=None` to wait it infinitely until the
producer thread terminates.

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
* update changelog
* update release note
* update version string
…lkit#1153)

 - Ticket no. 120785
- Change streaming import logic with DatumPageMapper implemented in Rust

| Before | After |
| :-: | :-: |
|
![image](https://github.com/openvinotoolkit/datumaro/assets/26541465/0a06ddc0-5256-45b4-af03-e9299b8e61b8)
|
![image](https://github.com/openvinotoolkit/datumaro/assets/26541465/af76210b-8fb5-4b30-aec1-2b5a22856ef7)
|

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
<!-- Contributing guide:
https://github.com/openvinotoolkit/datumaro/blob/develop/CONTRIBUTING.md
-->

### Summary
Color values in the labelmap.txt should be separated by commas, not
colons.

<!--
Resolves openvinotoolkit#111 and openvinotoolkit#222.
Depends on openvinotoolkit#1000 (for series of dependent commits).

This PR introduces this capability to make the project better in this
and that.

- Added this feature
- Removed that feature
- Fixed the problem openvinotoolkit#1234
-->

### How to test
<!-- Describe the testing procedure for reviewers, if changes are
not fully covered by unit tests or manual testing can be complicated.
-->

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [x] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [x] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [ ] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [ ] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```
<!-- Contributing guide:
https://github.com/openvinotoolkit/datumaro/blob/develop/CONTRIBUTING.md
-->

### Summary

<!--
Resolves openvinotoolkit#111 and openvinotoolkit#222.
Depends on openvinotoolkit#1000 (for series of dependent commits).

This PR introduces this capability to make the project better in this
and that.

- Added this feature
- Removed that feature
- Fixed the problem openvinotoolkit#1234
-->

### How to test
<!-- Describe the testing procedure for reviewers, if changes are
not fully covered by unit tests or manual testing can be complicated.
-->

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [ ] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [ ] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [ ] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```
- Apply fixes to 1.5.1 from openvinotoolkit#1159 and openvinotoolkit#1161

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Co-authored-by: Matěj Šmíd <m@matejsmid.cz>
Co-authored-by: Daniil Pastukhov <plus79222222238@gmail.com>
…it#1172)

- Update our CI OS from `windows-2019` to `windows-2022`
- This is because our CI has problem only on `windows-2019` while
cleaning up test directory:
https://github.com/openvinotoolkit/datumaro/actions/runs/6549100531/job/17785292362
and
https://github.com/openvinotoolkit/datumaro/actions/runs/6557126906/job/17808184193?pr=1169

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
…#1169)

- Ticket no. 122601
- Version up Arrow data format export/import from 1.0 to 2.0 to make
them memory bounded

|   | Before | After |
| :-: |  :-: |  :-: | 
| export |
![image](https://github.com/openvinotoolkit/datumaro/assets/26541465/d5641aa7-5c2d-4f3d-899d-01f81cc0a7d1)
|
![image](https://github.com/openvinotoolkit/datumaro/assets/26541465/b0b246a5-9f7a-449a-82d5-2c9893f6bbba)
|
| import |
![image](https://github.com/openvinotoolkit/datumaro/assets/26541465/2c395306-5e8f-4813-a60e-afcbd954a66e)
|
![image](https://github.com/openvinotoolkit/datumaro/assets/26541465/f38e1e73-e304-4586-a0c4-ad6891bbe37f)
|

Used the following script for the above experiment.
<details>
<summary>1. Synthetic data preparation (10000 items with a 224x224 image
and a label are exported to Datumaro data format)</summary>

```python
import numpy as np
from datumaro.components.media import Image
from datumaro.components.project import Dataset
import os
from datumaro.components.dataset_base import DatasetItem
from datumaro.components.annotation import Label

from datumaro.util.image import encode_image

from tempfile import TemporaryDirectory
from datumaro.components.progress_reporting import TQDMProgressReporter


def fxt_large(test_dir, n=5000) -> Dataset:
    items = []
    for i in range(n):
        media = None
        if i % 3 == 0:
            media = Image.from_numpy(data=np.random.randint(0, 255, (224, 224, 3)))
        elif i % 3 == 1:
            media = Image.from_bytes(
                data=encode_image(np.random.randint(0, 255, (224, 224, 3)), ".png")
            )
        elif i % 3 == 2:
            Image.from_numpy(data=np.random.randint(0, 255, (224, 224, 3))).save(
                os.path.join(test_dir, f"test{i}.jpg")
            )
            media = Image.from_file(path=os.path.join(test_dir, f"test{i}.jpg"))

        items.append(
            DatasetItem(
                id=i,
                subset="test",
                media=media,
                annotations=[Label(np.random.randint(0, 3))],
            )
        )

    source_dataset = Dataset.from_iterable(
        items,
        categories=["label"],
        media_type=Image,
    )

    return source_dataset


if __name__ == "__main__":
    source_dir = "source"
    os.makedirs(source_dir, exist_ok=True)
    with TemporaryDirectory() as test_dir:
        source = fxt_large(test_dir, n=10000)
        reporter = TQDMProgressReporter()
        source.export(
            source_dir,
            format="datumaro",
            save_media=True,
            progress_reporter=reporter,
        )
```

</details>

<details>
  <summary>2. Export 10000 items to Arrow data format</summary>

```python
import shutil
import os
from datumaro.components.progress_reporting import TQDMProgressReporter

from datumaro.components.dataset import StreamDataset

if __name__ == "__main__":
    source_dir = "source"

    source = StreamDataset.import_from(source_dir, format="datumaro")

    export_dir = "export"
    if os.path.exists(export_dir):
        shutil.rmtree(export_dir)

    reporter = TQDMProgressReporter()
    source.export(
        export_dir,
        format="arrow",
        save_media=True,
        max_shard_size=1000,
        progress_reporter=reporter,
    )
```

</details>

<details>
  <summary>3. Import 10000 items in the Arrow data format </summary>

```python
import pyarrow as pa
from random import shuffle
from datumaro.components.progress_reporting import TQDMProgressReporter
from time import time
from datumaro.components.dataset import Dataset
import memory_profiler
import shutil

if __name__ == "__main__":
    source_dir = "source"
    dst_dir = "source.backup"
    shutil.move(source_dir, dst_dir)

    export_dir = "export"
    reporter = TQDMProgressReporter()

    start = time()
    dataset = Dataset.import_from(export_dir, format="arrow", progress_reporter=reporter)
    keys = [(item.id, item.subset) for item in dataset]

    shuffle(keys)

    for item_id, subset in keys:
        item = dataset.get(item_id, subset)
        img_data = item.media.data

    dt = time() - start
    print(f"dt={dt:.2f}")
    print(memory_profiler.memory_usage()[0])
    print(pa.total_allocated_bytes())

    shutil.move(dst_dir, source_dir)
```

</details>

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
- Currently, there is discrepancy between the return image data types:
`ImageFromBytes.data` (`np.float32`), `ImageFromNumpy.data`
(`np.float32`), and `ImageFromFile.data` (`np.uint8`).
- This makes the data loader based on the Arrow data format (using
`ImageFromBytes.data`) slower since the image preprocessing will be
conducted on the `np.float32` data (4x larger than `np.uint8`).
- This PR forces `np.uint8` data to be returned for all `Image` classes.

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
<!-- Contributing guide:
https://github.com/openvinotoolkit/datumaro/blob/develop/CONTRIBUTING.md
-->

### Summary

<!--
Resolves openvinotoolkit#111 and openvinotoolkit#222.
Depends on openvinotoolkit#1000 (for series of dependent commits).

This PR introduces this capability to make the project better in this
and that.

- Added this feature
- Removed that feature
- Fixed the problem openvinotoolkit#1234
-->

### How to test
<!-- Describe the testing procedure for reviewers, if changes are
not fully covered by unit tests or manual testing can be complicated.
-->

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [x] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [x] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [ ] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [ ] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```
@yunchu yunchu requested review from a team as code owners October 23, 2023 04:59
@yunchu yunchu requested review from jihyeonyi and removed request for a team October 23, 2023 04:59
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Oct 23, 2023

Codecov Report

Attention: 56 lines in your changes are missing coverage. Please review.

Comparison is base (ba6f0ed) 80.11% compared to head (8faca62) 80.08%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1181      +/-   ##
===========================================
- Coverage    80.11%   80.08%   -0.03%     
===========================================
  Files          268      267       -1     
  Lines        30093    29828     -265     
  Branches      5916     5846      -70     
===========================================
- Hits         24108    23889     -219     
+ Misses        4622     4603      -19     
+ Partials      1363     1336      -27     
Flag Coverage Δ
macos-11_Python-3.8 79.21% <83.57%> (?)
ubuntu-20.04_Python-3.8 80.08% <83.57%> (-0.02%) ⬇️
windows-2019_Python-3.8 ?
windows-2022_Python-3.8 80.06% <83.57%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/datumaro/components/dataset_base.py 96.77% <100.00%> (ø)
src/datumaro/components/format_detection.py 93.68% <ø> (ø)
src/datumaro/components/media.py 79.67% <100.00%> (-0.33%) ⬇️
.../plugins/data_formats/arrow/mapper/dataset_item.py 100.00% <100.00%> (+3.70%) ⬆️
...datumaro/plugins/data_formats/datumaro/exporter.py 95.13% <100.00%> (+0.19%) ⬆️
src/datumaro/plugins/framework_converter.py 89.32% <ø> (-0.21%) ⬇️
src/datumaro/plugins/data_formats/arrow/format.py 82.35% <90.00%> (+9.01%) ⬆️
...atumaro/plugins/data_formats/arrow/mapper/media.py 84.82% <93.75%> (+11.02%) ⬆️
src/datumaro/plugins/data_formats/datumaro/base.py 88.70% <87.50%> (+0.66%) ⬆️
...rc/datumaro/plugins/data_formats/arrow/importer.py 89.47% <88.00%> (-5.13%) ⬇️
... and 3 more

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@vinnamkim vinnamkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yunchu yunchu merged commit c20dc3e into openvinotoolkit:develop Oct 23, 2023
6 checks passed
@yunchu yunchu deleted the mergeback-1.5.1rc2 branch October 23, 2023 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants