Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a new mechanism for dataset format detection #531

Merged
merged 6 commits into from
Nov 2, 2021
Merged

Introduce a new mechanism for dataset format detection #531

merged 6 commits into from
Nov 2, 2021

Conversation

IRDonch
Copy link

@IRDonch IRDonch commented Oct 28, 2021

Summary

The new mechanism allows importers to define requirements for dataset formats using simple method calls, with any unmet requirements automatically reported. This PR updates one format to use this mechanism (icdar_text_localization) as a proof of concept. More will follow afterwards. :-)

See the commit messages for more details.

How to test

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

Roman Donchenko added 6 commits November 1, 2021 18:13
I'm going to revamp the dataset detection mechanism and make the `detect`
method not as easy to call directly. Therefore, change the detection tests
to instead use `Environment.detect_dataset` and to make sure that the
expected format is one of the detected formats.

There is another benefit to this. The revamped detection mechanism will
allow for more specific checks, and should result in most test datasets
being uniquely detected as belonging to a single format. With the new test
structure, we can easily change the check from
`assertIn(format, detected_formats)` to
`assertEqual([format], detected_formats)` in order to ensure that no other
detectors match the dataset.
Currently, format detection uses the `Importer.find_sources` method.
(Technically, it uses the `Importer.detect` method, but that only has
two implementations, and both of them use `find_sources`.) There are
a couple of problems with that:

1. A lot of `find_sources` implementations aren't very picky and simply
   define a source with the initial directory as the location. As a result,
   autodetection will always return many formats, most of which have nothing
   to do with the dataset.

2. The interface of `find_sources` doesn't allow for easy reporting of
   _why_ the dataset was not detected as a certain format (and AFAIK,
   none of the implementations report it). This information could be useful
   to users.

As a solution, remove the old `detect` method and introduce a new one, which
receives an object that is tailor-made for making checks against a dataset
and reporting any failures. Currently, there is only a generic implementation
that uses `find_sources`, but now we can add custom `detect` implementations
that perform specific checks.

The reporting currently only goes to the debug log. We might want to make it
more prominent in the future.
…s a file

To check whether a dataset conforms to a format we generally need to check
that certain files exist at a certain path relative to the root. However,
when the source is a file, we don't know where the root is, so it's an
inherently ambiguous situation.
* Fix the return type of `FormatDetectionContext.fail`;

* Ensure in code that all confidence levels are positive.
zhiltsov-max
zhiltsov-max previously approved these changes Nov 2, 2021
Copy link
Contributor

@zhiltsov-max zhiltsov-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, there is room to improve the design, but as there are no strong objections, I think we can merge it and continue adding other formats.

@IRDonch
Copy link
Author

IRDonch commented Nov 2, 2021

Probably, there is room to improve the design, but as there are no strong objections, I think we can merge it and continue adding other formats.

Okay. I think the main unresolved question here is about the return type of apply_format_detector; but since that really affects only apply_format_detector itself and Environment.detect_dataset, we can put it on hold. I'll come back to it when it becomes necessary to expose the detection errors to API clients.

@IRDonch
Copy link
Author

IRDonch commented Nov 2, 2021

Oh, I inadvertently dismissed your review. Could you approve again?

@IRDonch IRDonch merged commit 257bb38 into openvinotoolkit:develop Nov 2, 2021
@IRDonch IRDonch deleted the precise-detection branch November 2, 2021 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants