-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a new mechanism for dataset format detection #531
Conversation
I'm going to revamp the dataset detection mechanism and make the `detect` method not as easy to call directly. Therefore, change the detection tests to instead use `Environment.detect_dataset` and to make sure that the expected format is one of the detected formats. There is another benefit to this. The revamped detection mechanism will allow for more specific checks, and should result in most test datasets being uniquely detected as belonging to a single format. With the new test structure, we can easily change the check from `assertIn(format, detected_formats)` to `assertEqual([format], detected_formats)` in order to ensure that no other detectors match the dataset.
Currently, format detection uses the `Importer.find_sources` method. (Technically, it uses the `Importer.detect` method, but that only has two implementations, and both of them use `find_sources`.) There are a couple of problems with that: 1. A lot of `find_sources` implementations aren't very picky and simply define a source with the initial directory as the location. As a result, autodetection will always return many formats, most of which have nothing to do with the dataset. 2. The interface of `find_sources` doesn't allow for easy reporting of _why_ the dataset was not detected as a certain format (and AFAIK, none of the implementations report it). This information could be useful to users. As a solution, remove the old `detect` method and introduce a new one, which receives an object that is tailor-made for making checks against a dataset and reporting any failures. Currently, there is only a generic implementation that uses `find_sources`, but now we can add custom `detect` implementations that perform specific checks. The reporting currently only goes to the debug log. We might want to make it more prominent in the future.
…s a file To check whether a dataset conforms to a format we generally need to check that certain files exist at a certain path relative to the root. However, when the source is a file, we don't know where the root is, so it's an inherently ambiguous situation.
* Fix the return type of `FormatDetectionContext.fail`; * Ensure in code that all confidence levels are positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, there is room to improve the design, but as there are no strong objections, I think we can merge it and continue adding other formats.
Okay. I think the main unresolved question here is about the return type of |
Oh, I inadvertently dismissed your review. Could you approve again? |
Summary
The new mechanism allows importers to define requirements for dataset formats using simple method calls, with any unmet requirements automatically reported. This PR updates one format to use this mechanism (
icdar_text_localization
) as a proof of concept. More will follow afterwards. :-)See the commit messages for more details.
How to test
Checklist
develop
branchLicense
Feel free to contact the maintainers if that's a concern.