Introduce a new mechanism for dataset format detection #531

IRDonch · 2021-10-28T17:24:50Z

Summary

The new mechanism allows importers to define requirements for dataset formats using simple method calls, with any unmet requirements automatically reported. This PR updates one format to use this mechanism (icdar_text_localization) as a proof of concept. More will follow afterwards. :-)

See the commit messages for more details.

How to test

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below).

# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

datumaro/components/format_detection.py

datumaro/plugins/icdar_format/extractor.py

datumaro/plugins/coco_format/importer.py

tests/test_ade20k2017_format.py

I'm going to revamp the dataset detection mechanism and make the `detect` method not as easy to call directly. Therefore, change the detection tests to instead use `Environment.detect_dataset` and to make sure that the expected format is one of the detected formats. There is another benefit to this. The revamped detection mechanism will allow for more specific checks, and should result in most test datasets being uniquely detected as belonging to a single format. With the new test structure, we can easily change the check from `assertIn(format, detected_formats)` to `assertEqual([format], detected_formats)` in order to ensure that no other detectors match the dataset.

Currently, format detection uses the `Importer.find_sources` method. (Technically, it uses the `Importer.detect` method, but that only has two implementations, and both of them use `find_sources`.) There are a couple of problems with that: 1. A lot of `find_sources` implementations aren't very picky and simply define a source with the initial directory as the location. As a result, autodetection will always return many formats, most of which have nothing to do with the dataset. 2. The interface of `find_sources` doesn't allow for easy reporting of _why_ the dataset was not detected as a certain format (and AFAIK, none of the implementations report it). This information could be useful to users. As a solution, remove the old `detect` method and introduce a new one, which receives an object that is tailor-made for making checks against a dataset and reporting any failures. Currently, there is only a generic implementation that uses `find_sources`, but now we can add custom `detect` implementations that perform specific checks. The reporting currently only goes to the debug log. We might want to make it more prominent in the future.

…s a file To check whether a dataset conforms to a format we generally need to check that certain files exist at a certain path relative to the root. However, when the source is a file, we don't know where the root is, so it's an inherently ambiguous situation.

* Fix the return type of `FormatDetectionContext.fail`; * Ensure in code that all confidence levels are positive.

zhiltsov-max

Probably, there is room to improve the design, but as there are no strong objections, I think we can merge it and continue adding other formats.

IRDonch · 2021-11-02T15:05:24Z

Probably, there is room to improve the design, but as there are no strong objections, I think we can merge it and continue adding other formats.

Okay. I think the main unresolved question here is about the return type of apply_format_detector; but since that really affects only apply_format_detector itself and Environment.detect_dataset, we can put it on hold. I'll come back to it when it becomes necessary to expose the detection errors to API clients.

IRDonch · 2021-11-02T15:21:25Z

Oh, I inadvertently dismissed your review. Could you approve again?

IRDonch requested a review from zhiltsov-max October 28, 2021 17:41