Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add dataset format Yolov8 #44

Closed
wants to merge 8 commits into from
Closed

Conversation

ChanBong
Copy link

@ChanBong ChanBong commented Jun 3, 2024

Summary

This PR introduces the capability to export dataset in yolov8 format. This is still in WIP, with only a fraction of features implemented.

  • Adds support for bounding box export
  • Adds converter and extractor for yolov8

How to test

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2022 CVAT.ai Corporation
#
# SPDX-License-Identifier: MIT

Summary by CodeRabbit

  • New Features

    • Added support for converting datasets to various YOLOv8 formats, including detection, segmentation, pose, and oriented box.
  • Documentation

    • Updated user manual to include format specifications, dataset examples, and documentation links for new YOLO formats.
  • Tests

    • Introduced test cases for YOLO detection format converter, extractor, and importer to ensure functionality and correctness.

debug_run.py Outdated Show resolved Hide resolved
@nmanovic nmanovic changed the title Add dataset format Yolov8 [WIP] Add dataset format Yolov8 Jun 7, 2024
@@ -0,0 +1,213 @@
# Copyright (C) 2019-2022 Intel Corporation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update all copyright headers please. See how it is done in other files.

META_FILE = "data.yaml"

@staticmethod
def _parse_config(path: str) -> Dict[str, str]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you start the method with underscore?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is also used only internally in the scope of this plugin folder. I also took inspiration from the existing implementation of yolo_format in the repo datumaro/plugins/yolo_format. Should I remove it ?


config = {}

for line in config_lines:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use pyyaml to parse YAML file. Don't reinvent the wheel: https://pyyaml.org/wiki/PyYAMLDocumentation

Use safe_load method please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Used yaml.safe_load

@@ -0,0 +1,33 @@
# Copyright (C) 2024 CVAT.ai Corporation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid code cloning. I see the same code in 4 format.py files.



class YoloDetectionImporter(Importer):
META_FILE = YoloDetectionPath.META_FILE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the pain. Please find a way to redefine the default config file name. We always should be able to treat it as a configuration parameter. If it isn't specified, I would recommend for looking for *.yaml files inside. If only one file is found, you can proceed. If several files with the same name are found, you need to report an error.

Algorithm and recommendations:

  • please add a CLI argement to specify config file name
  • if you don't have a hint from the command line, please try to find all *.yaml files
  • If you find only one yaml file, it is the config
  • if you find several yaml files, check that data.yaml exists. If it is the case, use it.
  • Otherwise please report an error.
    AR

# # classes = 2
# # train = data/train.txt
# # valid = data/test.txt
# # names = data/obj.names

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove irrelevant comments. Don't copy and past. Try to rethink each line of code.


@classmethod
def find_sources(cls, path: str) -> List[Dict[str, Any]]:
# Check obj.names first

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you check here obj.names?

@nmanovic
Copy link

I tried to convert coco8 to coco and back. There are some difference in coordinates. In general, it is normal. But could you please try to minimize it?

$ wdiff  coco8/labels/train/000000000009.txt test3/labels/train/000000000009.txt
45 0.479492 0.688771 0.955609 [-0.5955-] {+0.595500+}
45 0.736516 0.247188 0.498875 0.476417
50 0.637063 0.732938 0.494125 0.510583
45 0.339438 0.418896 0.678875 [-0.7815-] {+0.781500+}
49 0.646836 0.132552 0.118047 [-0.0969375-] {+0.096937+}
49 0.773148 0.129802 [-0.0907344 0.0972292-] {+0.090734 0.097229+}
49 0.668297 0.226906 0.131281 0.146896
49 0.642859 [-0.0792187 0.148063-] {+0.079219+} 0.148062 {+0.148063+}

@nmanovic
Copy link

Please run locally all linters and check results. You can see how to do that in https://github.com/cvat-ai/datumaro/blob/develop/.github/workflows/linter.yml file

To convert a YOLO-OrientedBox dataset to other formats, use the following commands:

```bash
datum convert -if yolo_orientedbox -i <path/to/dataset> -f coco_instances -o <path/to/dataset>
Copy link

@mdacoca mdacoca Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not work
image

and with renaming default to one of the proposed names too, it shows like the command is accomplished, but the folder with results is empty

```
or
```bash
datum convert -if yolo_detection -i <path/to/dataset> -f coco_instances -o <path/to/dataset>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and with renaming default to one of the proposed names too, it shows like the command is accomplished, but the folder with results is empty

@nmanovic
Copy link

image

@@ -0,0 +1,674 @@
GNU GENERAL PUBLIC LICENSE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the file here?

Copy link

sonarcloud bot commented Jun 23, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots
61.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

@nmanovic
Copy link

@ChanBong , linters are filed.

Please look at comments from SonarCloud. It reports a huge amount of code duplication. Need to fix that.

@nmanovic
Copy link

@CodeRabbit review

@nmanovic
Copy link

@coderabbitai review

Copy link

coderabbitai bot commented Jun 23, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

coderabbitai bot commented Jun 23, 2024

Walkthrough

This update introduces comprehensive support for various YOLO formats (detection, segmented, pose, and oriented box) within the Datumaro framework. It includes new converters, extractors, tests, and user documentation. The changes focus on enabling datasets to be exported, imported, and tested in YOLOv8-compatible formats, significantly enhancing Datumaro's dataset handling capabilities.

Changes

File(s) Summary
datumaro/plugins/yolo_detection_format/converter.py Added new functions and methods for converting datasets to YOLO detection format.
datumaro/plugins/yolo_orientedbox_format/converter.py Introduced functionality to convert dataset annotations to YOLOv8 oriented box format.
datumaro/plugins/yolo_pose_format/converter.py Added YOLO Pose format converter to handle datasets with keypoints and bounding boxes.
datumaro/plugins/yolo_segmentation_format/converter.py Added converter for transforming datasets to YOLOv8 segmentation format.
datumaro/plugins/yolo_orientedbox_format/extractor.py Introduced YoloOrientedboxExtractor class for extracting data in YOLO oriented box format.
site/content/en/docs/user-manual/supported_formats.md Updated documentation to include support for multiple YOLO formats and their specifications.
tests/test_yolo_detection_format.py Introduced test cases for YOLO detection converter, extractor, and importer functionalities.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Converter
    participant Extractor
    participant YOLOFormatHandler

    User->>+Converter: Apply conversion
    Converter->>+YOLOFormatHandler: Convert datasets to YOLO format
    YOLOFormatHandler-->>-Converter: Converted dataset
    Converter-->>-User: Converted dataset

    User->>+Extractor: Extract dataset
    Extractor->>+YOLOFormatHandler: Handle extraction
    YOLOFormatHandler-->>-Extractor: Extracted data
    Extractor-->>-User: Extracted data
Loading

Poem

In fields of code where rabbits play,
New YOLO formats come our way.
Converters, tests, and extractors bright,
Make datasets seamless, just right!
From detection to keypoints and more,
Datumaro's hopping like never before!
🌼🐇🚀✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 23

Outside diff range and nitpick comments (2)
site/content/en/docs/formats/yolo_orientedbox.md (1)

88-88: Consider adding a comma for clarity.

In the sentence discussing the normalization of bounding box coordinates, a comma after "image width" might improve readability and prevent potential confusion.

Tools
LanguageTool

[uncategorized] ~88-~88: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...image widthandybyimage height`. Internally datumaro processes these oriented bound...

site/content/en/docs/user-manual/supported_formats.md (1)

Line range hint 204-204: Suggesting More Expressive Language in Documentation

Consider using a more engaging and informative phrase instead of "To do this," which might appear too direct and less informative.

- To do this, use `dataset_meta.json`.
+ To achieve this, refer to the `dataset_meta.json` configuration as follows.
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d0456d1 and 97cc020.

Files ignored due to path filters (1)
  • tests/assets/yolo_detection_dataset/images/train/1.jpg is excluded by !**/*.jpg
Files selected for processing (27)
  • datumaro/plugins/yolo_detection_format/converter.py (1 hunks)
  • datumaro/plugins/yolo_detection_format/extractor.py (1 hunks)
  • datumaro/plugins/yolo_detection_format/format.py (1 hunks)
  • datumaro/plugins/yolo_detection_format/importer.py (1 hunks)
  • datumaro/plugins/yolo_orientedbox_format/converter.py (1 hunks)
  • datumaro/plugins/yolo_orientedbox_format/extractor.py (1 hunks)
  • datumaro/plugins/yolo_orientedbox_format/format.py (1 hunks)
  • datumaro/plugins/yolo_orientedbox_format/importer.py (1 hunks)
  • datumaro/plugins/yolo_pose_format/converter.py (1 hunks)
  • datumaro/plugins/yolo_pose_format/extractor.py (1 hunks)
  • datumaro/plugins/yolo_pose_format/format.py (1 hunks)
  • datumaro/plugins/yolo_pose_format/importer.py (1 hunks)
  • datumaro/plugins/yolo_segmentation_format/converter.py (1 hunks)
  • datumaro/plugins/yolo_segmentation_format/extractor.py (1 hunks)
  • datumaro/plugins/yolo_segmentation_format/format.py (1 hunks)
  • datumaro/plugins/yolo_segmentation_format/importer.py (1 hunks)
  • datumaro/util/os_util.py (1 hunks)
  • site/content/en/docs/formats/yolo_detection.md (1 hunks)
  • site/content/en/docs/formats/yolo_orientedbox.md (1 hunks)
  • site/content/en/docs/formats/yolo_pose.md (1 hunks)
  • site/content/en/docs/formats/yolo_segmentation.md (1 hunks)
  • site/content/en/docs/user-manual/supported_formats.md (1 hunks)
  • tests/assets/yolo_detection_dataset/LICENSE (1 hunks)
  • tests/assets/yolo_detection_dataset/README.md (1 hunks)
  • tests/assets/yolo_detection_dataset/data.yaml (1 hunks)
  • tests/assets/yolo_detection_dataset/labels/train/1.txt (1 hunks)
  • tests/test_yolo_detection_format.py (1 hunks)
Files not reviewed due to errors (2)
  • datumaro/plugins/yolo_pose_format/converter.py (no review received)
  • datumaro/plugins/yolo_segmentation_format/extractor.py (no review received)
Files skipped from review due to trivial changes (6)
  • datumaro/plugins/yolo_orientedbox_format/format.py
  • site/content/en/docs/formats/yolo_pose.md
  • site/content/en/docs/formats/yolo_segmentation.md
  • tests/assets/yolo_detection_dataset/LICENSE
  • tests/assets/yolo_detection_dataset/README.md
  • tests/assets/yolo_detection_dataset/labels/train/1.txt
Additional context used
yamllint
tests/assets/yolo_detection_dataset/data.yaml

[error] 19-19: no new line character at the end of file (new-line-at-end-of-file)

LanguageTool
site/content/en/docs/formats/yolo_orientedbox.md

[uncategorized] ~88-~88: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...image widthandybyimage height`. Internally datumaro processes these oriented bound...

site/content/en/docs/user-manual/supported_formats.md

[grammar] ~58-~58: This phrase is duplicated. You should probably use “Dataset example” only once. (PHRASE_REPETITION)
Context: ...et (classification, detection) - Dataset example - [Dataset example (txt for classification)](https://githu...


[style] ~204-~204: Consider a more expressive alternative. (DO_ACHIEVE)
Context: ...that are not original to the format. To do this, use dataset_meta.json. ```json...

Additional comments not posted (8)
site/content/en/docs/formats/yolo_detection.md (1)

1-178: Ensure documentation completeness and clarity.

  1. Consistency: Ensure that the documentation style and format are consistent with other format documentation in Datumaro.
  2. Error Handling: Include common errors and troubleshooting steps to help users resolve issues they might encounter while using the YOLO-Detection format.
  3. Examples: The examples provided should be tested to ensure they work as expected, especially commands and code snippets.
datumaro/plugins/yolo_detection_format/converter.py (2)

159-181: Optimize annotation export logic.

The method _export_item_annotation processes each item's annotations. Consider optimizing by pre-calculating values that don't change within the loop, such as image width and height.

- for bbox in item.annotations:
+ height, width = item.media.size
+ for bbox in item.annotations:

Likely invalid or redundant comment.


94-158: Verify dataset integrity and error handling.

The apply method orchestrates the dataset conversion process. It's crucial to ensure that all expected errors are handled gracefully and that the dataset's integrity is verified before proceeding with the conversion.

Verification successful

Verify dataset integrity and error handling.

The apply method in datumaro/plugins/yolo_detection_format/converter.py includes several error-handling mechanisms and dataset integrity checks:

  • Dataset Integrity Check: The method _check_dataset() is called within a try-except block to ensure the dataset's integrity. If a DatumaroError is raised, it is handled by the error policy.
  • Error Handling:
    • Raises MediaTypeError if the media type is not an image.
    • Raises DatasetExportError if the subset name is reserved.
    • Handles exceptions in _export_item_annotation method and reports item errors.

The code appears to handle errors gracefully and checks dataset integrity before proceeding with the conversion.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify dataset integrity and handling in the apply method.

# Test: Check if all necessary checks and error handlings are implemented.
rg --type python 'apply' datumaro/plugins/yolo_detection_format/converter.py

Length of output: 107



Script:

#!/bin/bash
# Description: Verify dataset integrity and handling in the apply method.

# Test: Check if all necessary checks and error handling are implemented.
rg 'def apply' -A 50 datumaro/plugins/yolo_detection_format/converter.py

Length of output: 1976



Script:

#!/bin/bash
# Description: Verify dataset integrity and handling in the apply method.

# Test: Check if all necessary checks and error handling are implemented.
rg 'def apply' -A 100 datumaro/plugins/yolo_detection_format/converter.py

Length of output: 4134

datumaro/plugins/yolo_orientedbox_format/converter.py (2)

171-191: Optimize annotation export logic for oriented boxes.

The method _export_item_annotation processes each item's annotations specifically for oriented boxes. Consider optimizing by pre-calculating values that don't change within the loop, such as image width and height.

- for bbox in item.annotations:
+ height, width = item.media.size
+ for bbox in item.annotations:

Likely invalid or redundant comment.


106-170: Verify dataset integrity and error handling in the apply method.

Ensure that all expected errors are handled gracefully and that the dataset's integrity is verified before proceeding with the conversion.

Verification successful

The apply method in datumaro/plugins/yolo_orientedbox_format/converter.py includes integrity checks and error handling mechanisms.

The following error handling and dataset integrity checks are present:

  • raise MediaTypeError("Media type is not an image")
  • self._check_dataset()
  • self._ctx.error_policy.fail(e)
  • raise DatasetExportError(...)
  • self._ctx.error_policy.report_item_error(e, item_id=(item.id, item.subset))

These cover various aspects such as media type validation, dataset checking, and handling exceptions during the export process.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify dataset integrity and handling in the apply method.

# Test: Check if all necessary checks and error handlings are implemented.
rg --type python 'apply' datumaro/plugins/yolo_orientedbox_format/converter.py

Length of output: 109



Script:

#!/bin/bash
# Description: Verify dataset integrity and error handling in the apply method.

# Fixing the command to search within the file for error handling and dataset integrity checks.
rg 'def apply' -A 100 datumaro/plugins/yolo_orientedbox_format/converter.py | rg 'raise|fail|error|check|verify'

Length of output: 384

datumaro/plugins/yolo_orientedbox_format/extractor.py (2)

45-65: Review Initialization of YoloOrientedboxExtractor

The constructor of YoloOrientedboxExtractor correctly initializes various instance variables and performs essential checks such as ensuring the provided config path is a directory and that URLs are provided. These checks are crucial for ensuring that the extractor is set up with valid configurations.


103-123: Error Handling in Data Iteration

The implementation of the __iter__ method in YoloOrientedboxExtractor includes robust error handling. By using a progress reporter and handling exceptions for each item, the method ensures that errors in individual items do not halt the entire import process. This approach maintains the integrity of the import operation while providing detailed error reporting.

site/content/en/docs/user-manual/supported_formats.md (1)

164-180: Documentation for YOLO Formats

The addition of YOLO formats (detection, segmentation, pose, oriented box) to the supported formats documentation is clear and well-structured. Each format type is linked to its specification, example, and documentation, providing a comprehensive resource for users.

[APROVED]

Comment on lines +230 to +289
def _parse_annotations(
self, anno_path: str, image: Image, *, item_id: Tuple[str, str]
) -> List[Annotation]:
lines = []
with open(anno_path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
lines.append(line)

annotations = []

if lines:
# Use image info as late as possible to avoid unnecessary image loading
if image.size is None:
raise DatasetImportError(f"Can't find image info for '{self.localize_path(image.path)}'"
)
image_height, image_width = image.size

for idx, line in enumerate(lines):
try:
parts = line.split()
if len(parts) != 9:
raise InvalidAnnotationError(
f"Unexpected field count {len(parts)} in the oriented bbox description. "
"Expected 9 fields (label, x1, y1, x2, y2, x3, y3, x4, y4)."
)
label_id, x1, y1, x2, y2, x3, y3, x4, y4 = parts

label_id = self._parse_field(label_id, int, "oriented bbox label id")
if label_id not in self._categories[AnnotationType.label]:
raise UndeclaredLabelError(str(label_id))

x1 = self._parse_field(x1, float, "oriented bbox x1")
y1 = self._parse_field(y1, float, "oriented bbox y1")
x2 = self._parse_field(x2, float, "oriented bbox x2")
y2 = self._parse_field(y2, float, "oriented bbox y2")
x3 = self._parse_field(x3, float, "oriented bbox x3")
y3 = self._parse_field(y3, float, "oriented bbox y3")
x4 = self._parse_field(x4, float, "oriented bbox x4")
y4 = self._parse_field(y4, float, "oriented bbox y4")

(x, y), (w, h), r = xyxyxyxy2xywhr(np.array([[[x1 * image_width, y1 * image_height], [x2 * image_width, y2 * image_height], [x3 * image_width, y3 * image_height], [x4 * image_width, y4 * image_height]]]))

annotations.append(
Bbox(
x,
y,
w,
h,
label=label_id,
id=idx,
group=idx,
attributes={"angle": r},
)
)
except Exception as e:
self._ctx.error_policy.report_annotation_error(e, item_id=item_id)

return annotations
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complex Logic in Annotation Parsing

The method _parse_annotations contains complex logic for parsing oriented bounding box annotations. While the method is comprehensive, it could benefit from further modularization or simplification to enhance readability and maintainability. Consider breaking down this method into smaller, more focused sub-methods, each handling a specific part of the annotation parsing process.

Comment on lines +295 to +313
def _load_categories(self, names_path: str) -> LabelCategories:
if has_meta_file(osp.dirname(names_path)):
return LabelCategories.from_iterable(parse_meta_file(osp.dirname(names_path)).keys())

label_categories = LabelCategories()

with open(names_path, "r") as fp:
loaded = yaml.safe_load(fp.read())
if isinstance(loaded["names"], list):
label_names = loaded["names"]
elif isinstance(loaded["names"], dict):
label_names = list(loaded["names"].values())
else:
raise DatasetImportError(f"Can't read dataset category file '{names_path}'")

for label_name in label_names:
label_categories.add(label_name)

return label_categories
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement in Category Loading

The method _load_categories effectively loads label categories from a YAML file. However, the error handling could be improved by providing more specific messages related to the expected format of the 'names' field in the YAML file. This would help users diagnose issues with their category files more effectively.

-    raise DatasetImportError(f"Can't read dataset category file '{names_path}'")
+    raise DatasetImportError(f"Expected 'names' to be a list or a dictionary in the dataset category file '{names_path}', but got {type(loaded['names']).__name__}")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _load_categories(self, names_path: str) -> LabelCategories:
if has_meta_file(osp.dirname(names_path)):
return LabelCategories.from_iterable(parse_meta_file(osp.dirname(names_path)).keys())
label_categories = LabelCategories()
with open(names_path, "r") as fp:
loaded = yaml.safe_load(fp.read())
if isinstance(loaded["names"], list):
label_names = loaded["names"]
elif isinstance(loaded["names"], dict):
label_names = list(loaded["names"].values())
else:
raise DatasetImportError(f"Can't read dataset category file '{names_path}'")
for label_name in label_names:
label_categories.add(label_name)
return label_categories
def _load_categories(self, names_path: str) -> LabelCategories:
if has_meta_file(osp.dirname(names_path)):
return LabelCategories.from_iterable(parse_meta_file(osp.dirname(names_path)).keys())
label_categories = LabelCategories()
with open(names_path, "r") as fp:
loaded = yaml.safe_load(fp.read())
if isinstance(loaded["names"], list):
label_names = loaded["names"]
elif isinstance(loaded["names"], dict):
label_names = list(loaded["names"].values())
else:
raise DatasetImportError(f"Expected 'names' to be a list or a dictionary in the dataset category file '{names_path}', but got {type(loaded['names']).__name__}")
for label_name in label_names:
label_categories.add(label_name)
return label_categories

# Classes
names:
0: person
1: bicycle
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a newline at the end of the file.

The YAML file should end with a newline to comply with common UNIX text file conventions and to ensure proper parsing by tools that read the file.

+ 
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
1: bicycle
1: bicycle
Tools
yamllint

[error] 19-19: no new line character at the end of file (new-line-at-end-of-file)

Comment on lines +18 to +25
def _parse_config(path: str) -> Dict[str, str]:
with open(path, "r") as fp:
loaded = yaml.safe_load(fp.read())

if not isinstance(loaded, dict):
raise Exception("Invalid config format")

return loaded
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure robust error handling in YAML parsing.

The method _parse_config uses yaml.safe_load which is good for security. However, consider enhancing the error handling to provide more specific feedback about what went wrong if the YAML content is not a dictionary.

-        if not isinstance(loaded, dict):
-            raise Exception("Invalid config format")
+        if not isinstance(loaded, dict):
+            raise ValueError("Expected the configuration to be a dictionary. Please check the YAML structure.")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _parse_config(path: str) -> Dict[str, str]:
with open(path, "r") as fp:
loaded = yaml.safe_load(fp.read())
if not isinstance(loaded, dict):
raise Exception("Invalid config format")
return loaded
def _parse_config(path: str) -> Dict[str, str]:
with open(path, "r") as fp:
loaded = yaml.safe_load(fp.read())
if not isinstance(loaded, dict):
raise ValueError("Expected the configuration to be a dictionary. Please check the YAML structure.")
return loaded

Comment on lines +18 to +33
def _parse_config(path: str) -> Dict[str, str]:
with open(path, "r", encoding="utf-8") as f:
config_lines = f.readlines()

config = {}

for line in config_lines:
match = re.match(r"^\s*(\w+)\s*=\s*(.+)$", line)
if not match:
continue

key = match.group(1)
value = match.group(2)
config[key] = value

return config
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor configuration parsing to use YAML directly.

The manual parsing of the configuration file line by line using regex is error-prone and hard to maintain. Consider using yaml.safe_load for parsing, which is already used in other parts of the project.

-        config = {}
-        for line in config_lines:
-            match = re.match(r"^\s*(\w+)\s*=\s*(.+)$", line)
-            if not match:
-                continue
-            key = match.group(1)
-            value = match.group(2)
-            config[key] = value
-        return config
+        with open(path, "r", encoding="utf-8") as f:
+            return yaml.safe_load(f)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _parse_config(path: str) -> Dict[str, str]:
with open(path, "r", encoding="utf-8") as f:
config_lines = f.readlines()
config = {}
for line in config_lines:
match = re.match(r"^\s*(\w+)\s*=\s*(.+)$", line)
if not match:
continue
key = match.group(1)
value = match.group(2)
config[key] = value
return config
def _parse_config(path: str) -> Dict[str, str]:
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)

Comment on lines +193 to +218
@classmethod
def patch(cls, dataset, patch, save_dir, **kwargs):
conv = cls(dataset, save_dir=save_dir, **kwargs)
conv._patch = patch
conv.apply()

for (item_id, subset), status in patch.updated_items.items():
if status != ItemStatus.removed:
item = patch.data.get(item_id, subset)
else:
item = DatasetItem(item_id, subset=subset)

if not (status == ItemStatus.removed or not item.media):
continue

if subset == DEFAULT_SUBSET_NAME:
subset = YoloOrientedboxPath.DEFAULT_SUBSET_NAME
subset_dir = osp.join(save_dir, "obj_%s_data" % subset)

image_path = osp.join(subset_dir, conv._make_image_filename(item))
if osp.isfile(image_path):
os.remove(image_path)

ann_path = osp.join(subset_dir, "%s.txt" % item.id)
if osp.isfile(ann_path):
os.remove(ann_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle file deletion errors in the patch method.

When removing files in the patch method, there's currently no error handling. It's good practice to handle potential exceptions that can occur during file deletion.

- os.remove(image_path)
+ try:
+     os.remove(image_path)
+ except OSError as e:
+     log.error(f"Failed to delete image file {image_path}: {e}")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@classmethod
def patch(cls, dataset, patch, save_dir, **kwargs):
conv = cls(dataset, save_dir=save_dir, **kwargs)
conv._patch = patch
conv.apply()
for (item_id, subset), status in patch.updated_items.items():
if status != ItemStatus.removed:
item = patch.data.get(item_id, subset)
else:
item = DatasetItem(item_id, subset=subset)
if not (status == ItemStatus.removed or not item.media):
continue
if subset == DEFAULT_SUBSET_NAME:
subset = YoloOrientedboxPath.DEFAULT_SUBSET_NAME
subset_dir = osp.join(save_dir, "obj_%s_data" % subset)
image_path = osp.join(subset_dir, conv._make_image_filename(item))
if osp.isfile(image_path):
os.remove(image_path)
ann_path = osp.join(subset_dir, "%s.txt" % item.id)
if osp.isfile(ann_path):
os.remove(ann_path)
if osp.isfile(image_path):
try:
os.remove(image_path)
except OSError as e:
log.error(f"Failed to delete image file {image_path}: {e}")
ann_path = osp.join(subset_dir, "%s.txt" % item.id)
if osp.isfile(ann_path):
os.remove(ann_path)

Comment on lines +68 to +86
def _check_dataset(self):
if self._extractor.media_type() and not issubclass(self._extractor.media_type(), Image):
raise MediaTypeError("Media type is not an image")

subset_names = set(self._extractor.subsets().keys())

for subset in subset_names:
if subset not in YoloOrientedboxPath.ALLOWED_SUBSET_NAMES:
raise DatasetExportError(
f"The allowed subset name is in {YoloOrientedboxPath.ALLOWED_SUBSET_NAMES}, "
f'so that subset "{subset}" is not allowed.'
)

for must_name in YoloOrientedboxPath.MUST_SUBSET_NAMES:
if must_name not in subset_names:
raise DatasetExportError(
f'Subset "{must_name}" is not in {subset_names}, '
"but Yolov8 requires both of them."
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve error messaging for subset name validation.

The error messages in _check_dataset could be more descriptive and actionable. They currently do not guide the user on how to resolve the issues.

- f"The allowed subset name is in {YoloOrientedboxPath.ALLOWED_SUBSET_NAMES}, "
- f'so that subset "{subset}" is not allowed.'
+ f"Invalid subset name '{subset}'. Allowed subset names are: {YoloOrientedboxPath.ALLOWED_SUBSET_NAMES}. Please rename the subset or adjust the ALLOWED_SUBSET_NAMES in YoloOrientedboxPath."

Committable suggestion was skipped due to low confidence.

Comment on lines +88 to +103
def _export_media(self, item: DatasetItem, subset_img_dir: str) -> str:
try:
if not item.media or not (item.media.has_data or item.media.has_size):
raise DatasetExportError(
"Failed to export item '%s': " "item has no image info" % item.id
)

image_name = self._make_image_filename(item)
image_fpath = osp.join(subset_img_dir, image_name)

if self._save_media:
self._save_image(item, image_fpath)

return image_fpath

except Exception as e:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure proper error handling for media export failures.

Similar to the previous file, the error handling for media export failures is broad and catches all exceptions. Handling specific expected errors separately can provide more targeted feedback or recovery options.

- except Exception as e:
+ except (FileNotFoundError, PermissionError) as e:

Committable suggestion was skipped due to low confidence.

Comment on lines +25 to +43
def _make_yolo_obbox(img_size, box, angle):
# https://github.com/pjreddie/darknet/blob/master/scripts/voc_label.py
# <x> <y> <width> <height> - values relative to width and height of image
# <x> <y> - are center of rectangle
x = (box[0] + box[2]) / 2
y = (box[1] + box[3]) / 2
w = (box[2] - box[0])
h = (box[3] - box[1])

rect = ((x, y), (w, h), angle)
box = cv2.boxPoints(rect)

for corner in box:
corner[0] = corner[0] / img_size[0]
corner[1] = corner[1] / img_size[1]

rotated_corners = box.flatten()

return rotated_corners
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor the oriented bounding box calculation to a method inside the class.

The function _make_yolo_obbox is used extensively within the YoloOrientedboxConverter class. It would improve encapsulation and maintainability to make this a method of the class.

- def _make_yolo_obbox(img_size, box, angle):
+ class YoloOrientedboxConverter(Converter):
+     def _make_yolo_obbox(self, img_size, box, angle):

Committable suggestion was skipped due to low confidence.

Comment on lines +32 to +372
source = Dataset.import_from(DUMMY_DATASET_DIR, format="yolo_detection")

parsed = pickle.loads(pickle.dumps(source)) # nosec

compare_datasets_strict(self, source, parsed)


class YoloDetectionExtractorTest(TestCase):
def _prepare_dataset(self, path: str) -> Dataset:
dataset = Dataset.from_iterable(
[
DatasetItem(
id = "a",
subset="train",
media=Image(data=np.ones((8, 8, 3))),
annotations=[Bbox(0, 2, 4, 2, label=0)],
)
],
categories=["test"],
)
dataset.export(path, "yolo_detection", save_images=True)

return dataset

@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_parse(self):
with TestDir() as test_dir:
expected = self._prepare_dataset(test_dir)

actual = Dataset.import_from(test_dir, "yolo_detection")
compare_datasets(self, expected, actual)

@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_invalid_data_file(self):
with TestDir() as test_dir:
with self.assertRaisesRegex(DatasetImportError, f"Can't find data.yaml in {test_dir}"):
YoloDetectionExtractor(test_dir)

@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_invalid_ann_line_format(self):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
with open(osp.join(test_dir, "labels", "train", "a.txt"), "w") as f:
f.write("1 2 3\n")

with self.assertRaises(AnnotationImportError) as capture:
Dataset.import_from(test_dir, "yolo_detection").init_cache()
self.assertIsInstance(capture.exception.__cause__, InvalidAnnotationError)
self.assertIn("Unexpected field count", str(capture.exception.__cause__))

@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_invalid_label(self):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
with open(osp.join(test_dir, "labels", "train", "a.txt"), "w") as f:
f.write("10 0.5 0.5 0.5 0.5\n")

with self.assertRaises(AnnotationImportError) as capture:
Dataset.import_from(test_dir, "yolo_detection").init_cache()
self.assertIsInstance(capture.exception.__cause__, UndeclaredLabelError)
self.assertEqual(capture.exception.__cause__.id, "10")

@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_invalid_field_type(self):
for field, field_name in [
(1, "bbox center x"),
(2, "bbox center y"),
(3, "bbox width"),
(4, "bbox height"),
]:
with self.subTest(field_name=field_name):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
with open(osp.join(test_dir, "labels", "train", "a.txt"), "w") as f:
values = [0, 0.5, 0.5, 0.5, 0.5]
values[field] = "a"
f.write(" ".join(str(v) for v in values))

with self.assertRaises(AnnotationImportError) as capture:
Dataset.import_from(test_dir, "yolo_detection").init_cache()
self.assertIsInstance(capture.exception.__cause__, InvalidAnnotationError)
self.assertIn(field_name, str(capture.exception.__cause__))

@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_missing_image_info(self):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
os.remove(osp.join(test_dir, "images", "train", "a.jpg"))

with self.assertRaises(ItemImportError) as capture:
Dataset.import_from(test_dir, "yolo_detection").init_cache()

@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_missing_subset_info(self):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
os.remove(osp.join(test_dir, "train.txt"))

with self.assertRaisesRegex(InvalidAnnotationError, "subset list file"):
Dataset.import_from(test_dir, "yolo_detection").init_cache()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhance test coverage and readability.

  1. Test Coverage: Ensure that all edge cases, such as handling of unusual or corrupt data files, are covered by the tests.
  2. Readability and Maintenance: Refactor common test setup tasks into helper methods to reduce redundancy and improve readability. This includes dataset creation, file setup, and assertions.
  3. Documentation: Add docstrings to each test method to describe what each test aims to verify.
-    def test_can_save_and_load(self):
+    def test_can_save_and_load(self):
+        """
+        Test the ability to save a dataset in YOLO detection format and load it back, ensuring data integrity.
+        """
+        source_dataset = self._create_simple_dataset()
+        with TestDir() as test_dir:
+            self._test_save_and_load(source_dataset, test_dir)

+    def _create_simple_dataset(self):
+        return Dataset.from_iterable(
+            [
+                DatasetItem(id=1, subset="train", media=Image(data=np.ones((8, 8, 3))),
+                            annotations=[Bbox(0, 2, 4, 2, label=2), Bbox(0, 1, 2, 3, label=4)]),
+                DatasetItem(id=2, subset="valid", media=Image(data=np.ones((8, 8, 3))),
+                            annotations=[Bbox(0, 1, 5, 2, label=2), Bbox(0, 2, 3, 2, label=5),
+                                         Bbox(0, 2, 4, 2, label=6), Bbox(0, 7, 3, 2, label=7)]),
+            ],
+            categories=["label_" + str(i) for i in range(10)],
+        )

+    def _test_save_and_load(self, source_dataset, test_dir):
+        YoloDetectionConverter.convert(source_dataset, test_dir, save_media=True)
+        parsed_dataset = Dataset.import_from(test_dir, "yolo_detection")
+        compare_datasets(self, source_dataset, parsed_dataset)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class YoloDetectionConvertertTest(TestCase):
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_save_and_load(self):
source_dataset = Dataset.from_iterable(
[
DatasetItem(
id=1,
subset="train",
media=Image(data=np.ones((8, 8, 3))),
annotations=[
Bbox(0, 2, 4, 2, label=2),
Bbox(0, 1, 2, 3, label=4),
],
),
DatasetItem(
id=2,
subset="valid",
media=Image(data=np.ones((8, 8, 3))),
annotations=[
Bbox(0, 1, 5, 2, label=2),
Bbox(0, 2, 3, 2, label=5),
Bbox(0, 2, 4, 2, label=6),
Bbox(0, 7, 3, 2, label=7),
],
),
],
categories=["label_" + str(i) for i in range(10)],
)
with TestDir() as test_dir:
YoloDetectionConverter.convert(source_dataset, test_dir, save_media=True)
parsed_dataset = Dataset.import_from(test_dir, "yolo_detection")
compare_datasets(self, source_dataset, parsed_dataset)
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_save_dataset_with_image_info(self):
source_dataset = Dataset.from_iterable(
[
DatasetItem(
id=1,
subset="train",
media=Image(path="1.jpg", size=(10, 15)),
annotations=[
Bbox(0, 2, 4, 1, label=2, id=0),
],
),
],
categories=["label_" + str(i) for i in range(10)],
)
with TestDir() as test_dir:
YoloDetectionConverter.convert(source_dataset, test_dir)
save_image(
osp.join(test_dir, "images", "train", "1.jpg"), np.ones((10, 15, 3))
) # put the image for dataset
parsed_dataset = Dataset.import_from(test_dir, "yolo_detection")
compare_datasets(self, source_dataset, parsed_dataset)
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_save_dataset_with_cyrillic_and_spaces_in_filename(self):
source_dataset = Dataset.from_iterable(
[
DatasetItem(
id="кириллица c пробелом",
subset="train",
media=Image(data=np.ones((8, 8, 3))),
annotations=[
Bbox(0, 2, 4, 2, label=2),
Bbox(0, 1, 2, 3, label=4),
],
),
],
categories=["label_" + str(i) for i in range(10)],
)
with TestDir() as test_dir:
YoloDetectionConverter.convert(source_dataset, test_dir, save_media=True)
parsed_dataset = Dataset.import_from(test_dir, "yolo_detection")
compare_datasets(self, source_dataset, parsed_dataset, require_media=True)
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_save_and_load_image_with_arbitrary_extension(self):
source_dataset = Dataset.from_iterable(
[
DatasetItem(
id="1",
subset="train",
media=Image(path="1.JPEG", data=np.zeros((4, 3, 3))),
annotations=[
Bbox(0, 2, 3, 2, label=2),
],
),
DatasetItem(
id="2",
subset="valid",
media=Image(path="2.bmp", data=np.zeros((3, 4, 3))),
annotations=[
Bbox(0, 1, 5, 2, label=2),
]
),
],
categories=["label_" + str(i) for i in range(10)],
)
with TestDir() as test_dir:
YoloDetectionConverter.convert(source_dataset, test_dir, save_media=True)
parsed_dataset = Dataset.import_from(test_dir, "yolo_detection")
compare_datasets(self, source_dataset, parsed_dataset)
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_save_and_load_with_meta_file(self):
source_dataset = Dataset.from_iterable(
[
DatasetItem(
id=1,
subset="train",
media=Image(data=np.ones((8, 8, 3))),
annotations=[
Bbox(0, 2, 4, 2, label=2),
Bbox(0, 1, 2, 3, label=4),
],
),
DatasetItem(
id=2,
subset="valid",
media=Image(data=np.ones((8, 8, 3))),
annotations=[
Bbox(0, 1, 5, 2, label=2),
Bbox(0, 2, 3, 2, label=5),
Bbox(0, 2, 4, 2, label=6),
Bbox(0, 7, 3, 2, label=7),
],
),
],
categories=["label_" + str(i) for i in range(10)],
)
with TestDir() as test_dir:
YoloDetectionConverter.convert(source_dataset, test_dir, save_media=True, save_dataset_meta=True)
parsed_dataset = Dataset.import_from(test_dir, "yolo_detection")
self.assertTrue(osp.isfile(osp.join(test_dir, "dataset_meta.json")))
compare_datasets(self, source_dataset, parsed_dataset)
@mark_requirement(Requirements.DATUM_609)
def test_can_save_and_load_without_path_prefix(self):
source_dataset = Dataset.from_iterable(
[
DatasetItem(
id=3,
subset="train",
media=Image(data=np.ones((8, 8, 3))),
annotations=[
Bbox(0, 1, 5, 2, label=1),
],
),
],
categories=["a", "b"],
)
with TestDir() as test_dir:
YoloDetectionConverter.convert(source_dataset, test_dir, save_media=True, add_path_prefix=False)
parsed_dataset = Dataset.import_from(test_dir, "yolo_detection")
with open(osp.join(test_dir, "data.yaml"), "r") as f:
lines = f.readlines()
self.assertIn("train: train.txt\n", lines)
with open(osp.join(test_dir, "train.txt"), "r") as f:
lines = f.readlines()
self.assertIn("./images/train/3.jpg\n", lines)
compare_datasets(self, source_dataset, parsed_dataset)
DUMMY_DATASET_DIR = osp.join(osp.dirname(__file__), "assets", "yolo_detection_dataset")
class YoloDetectionImporterTest(TestCase):
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_import(self):
expected_dataset = Dataset.from_iterable(
[
DatasetItem(
id=1,
subset="train",
media=Image(data=np.ones((10, 15, 3))),
annotations=[
Bbox(0, 3, 14, 5, label=1),
Bbox(7, 0, 7, 4, label=0),
],
),
],
categories=["person", "bicycle"],
)
dataset = Dataset.import_from(DUMMY_DATASET_DIR, "yolo_detection")
compare_datasets(self, expected_dataset, dataset)
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_import_with_exif_rotated_images(self):
expected_dataset = Dataset.from_iterable(
[
DatasetItem(
id=1,
subset="train",
media=Image(data=np.ones((10, 15, 3))),
annotations=[
Bbox(0, 3, 14, 5, label=1),
Bbox(7, 0, 7, 4, label=0),
],
),
],
categories=["person", "bicycle"],
)
with TestDir() as test_dir:
dataset_path = osp.join(test_dir, "dataset")
shutil.copytree(DUMMY_DATASET_DIR, dataset_path)
# Add exif rotation for image
image_path = osp.join(dataset_path, "images", "train", "1.jpg")
img = PILImage.open(image_path)
exif = img.getexif()
exif.update([(296, 3), (282, 28.0), (531, 1), (274, 6), (283, 28.0)])
img.save(image_path, exif=exif)
dataset = Dataset.import_from(dataset_path, "yolo_detection")
compare_datasets(self, expected_dataset, dataset, require_media=True)
@mark_requirement(Requirements.DATUM_673)
def test_can_pickle(self):
source = Dataset.import_from(DUMMY_DATASET_DIR, format="yolo_detection")
parsed = pickle.loads(pickle.dumps(source)) # nosec
compare_datasets_strict(self, source, parsed)
class YoloDetectionExtractorTest(TestCase):
def _prepare_dataset(self, path: str) -> Dataset:
dataset = Dataset.from_iterable(
[
DatasetItem(
id = "a",
subset="train",
media=Image(data=np.ones((8, 8, 3))),
annotations=[Bbox(0, 2, 4, 2, label=0)],
)
],
categories=["test"],
)
dataset.export(path, "yolo_detection", save_images=True)
return dataset
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_parse(self):
with TestDir() as test_dir:
expected = self._prepare_dataset(test_dir)
actual = Dataset.import_from(test_dir, "yolo_detection")
compare_datasets(self, expected, actual)
@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_invalid_data_file(self):
with TestDir() as test_dir:
with self.assertRaisesRegex(DatasetImportError, f"Can't find data.yaml in {test_dir}"):
YoloDetectionExtractor(test_dir)
@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_invalid_ann_line_format(self):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
with open(osp.join(test_dir, "labels", "train", "a.txt"), "w") as f:
f.write("1 2 3\n")
with self.assertRaises(AnnotationImportError) as capture:
Dataset.import_from(test_dir, "yolo_detection").init_cache()
self.assertIsInstance(capture.exception.__cause__, InvalidAnnotationError)
self.assertIn("Unexpected field count", str(capture.exception.__cause__))
@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_invalid_label(self):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
with open(osp.join(test_dir, "labels", "train", "a.txt"), "w") as f:
f.write("10 0.5 0.5 0.5 0.5\n")
with self.assertRaises(AnnotationImportError) as capture:
Dataset.import_from(test_dir, "yolo_detection").init_cache()
self.assertIsInstance(capture.exception.__cause__, UndeclaredLabelError)
self.assertEqual(capture.exception.__cause__.id, "10")
@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_invalid_field_type(self):
for field, field_name in [
(1, "bbox center x"),
(2, "bbox center y"),
(3, "bbox width"),
(4, "bbox height"),
]:
with self.subTest(field_name=field_name):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
with open(osp.join(test_dir, "labels", "train", "a.txt"), "w") as f:
values = [0, 0.5, 0.5, 0.5, 0.5]
values[field] = "a"
f.write(" ".join(str(v) for v in values))
with self.assertRaises(AnnotationImportError) as capture:
Dataset.import_from(test_dir, "yolo_detection").init_cache()
self.assertIsInstance(capture.exception.__cause__, InvalidAnnotationError)
self.assertIn(field_name, str(capture.exception.__cause__))
@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_missing_image_info(self):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
os.remove(osp.join(test_dir, "images", "train", "a.jpg"))
with self.assertRaises(ItemImportError) as capture:
Dataset.import_from(test_dir, "yolo_detection").init_cache()
@mark_requirement(Requirements.DATUM_ERROR_REPORTING)
def test_can_report_missing_subset_info(self):
with TestDir() as test_dir:
self._prepare_dataset(test_dir)
os.remove(osp.join(test_dir, "train.txt"))
with self.assertRaisesRegex(InvalidAnnotationError, "subset list file"):
Dataset.import_from(test_dir, "yolo_detection").init_cache()
class YoloDetectionConvertertTest(TestCase):
@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_can_save_and_load(self):
"""
Test the ability to save a dataset in YOLO detection format and load it back, ensuring data integrity.
"""
source_dataset = self._create_simple_dataset()
with TestDir() as test_dir:
self._test_save_and_load(source_dataset, test_dir)
def _create_simple_dataset(self):
return Dataset.from_iterable(
[
DatasetItem(id=1, subset="train", media=Image(data=np.ones((8, 8, 3))),
annotations=[Bbox(0, 2, 4, 2, label=2), Bbox(0, 1, 2, 3, label=4)]),
DatasetItem(id=2, subset="valid", media=Image(data=np.ones((8, 8, 3))),
annotations=[Bbox(0, 1, 5, 2, label=2), Bbox(0, 2, 3, 2, label=5),
Bbox(0, 2, 4, 2, label=6), Bbox(0, 7, 3, 2, label=7)]),
],
categories=["label_" + str(i) for i in range(10)],
)
def _test_save_and_load(self, source_dataset, test_dir):
YoloDetectionConverter.convert(source_dataset, test_dir, save_media=True)
parsed_dataset = Dataset.import_from(test_dir, "yolo_detection")
compare_datasets(self, source_dataset, parsed_dataset)

@nmanovic nmanovic mentioned this pull request Jul 25, 2024
7 tasks
@zhiltsov-max
Copy link
Collaborator

Closed in favor of #50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants