Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imagenet_txt_format: add support for a custom labels file path #434

Merged
merged 5 commits into from
Aug 31, 2021
Merged

imagenet_txt_format: add support for a custom labels file path #434

merged 5 commits into from
Aug 31, 2021

Conversation

IRDonch
Copy link

@IRDonch IRDonch commented Aug 26, 2021

Summary

This will be needed to enable Accuracy Checker integration, since Accuracy Checker is separately configured with the annotation and class file paths.

As a prerequisite, also change the Importer interface so that importers can access extra parameters when searching for source files. I introduced a new find_sources_with_params function instead of adding a new parameter to find_sources, so that compatibility with old importers could be preserved.

How to test

N/A

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

Copy link
Contributor

@zhiltsov-max zhiltsov-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if it works through CLI? I suspect, it is the first real application for import parameters.

@@ -81,10 +87,13 @@ def _load_items(self, path):

class ImagenetTxtImporter(Importer):
Copy link
Contributor

@zhiltsov-max zhiltsov-max Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ImagenetTxtImporter(Importer):
class ImagenetTxtImporter(Importer, CliPlugin):
@classmethod
def build_cmdline_parser(cls, **kwargs):
parser = super().build_cmdline_parser(**kwargs)
parser.add_argument('-l', '--labelmap', dest='labels_file',
help="Path to the labelmap (synsets) file")
return parser

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the key be called --labelmap when the parameter is labels_file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it how you see fits better. Label map is kind of a general term for such things.

Copy link
Author

@IRDonch IRDonch Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at existing occurrences of the term "label map" in the codebase, and it seems like that's usually used to describe maps of labels to colors (or vice versa). Whereas in this case we just have a list of labels, so I ended up going with --labels_file. I also didn't add a short option, since I don't expect this option to be frequently used.

labels = self._parse_labels(labels)
if not labels_file:
labels_file = osp.join(osp.dirname(path), ImagenetTxtPath.LABELS_FILE)
labels = self._parse_labels(labels_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, it would be useful to support passing just the filename instead of the full path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented (with the caveat that it's a relative path and not just the filename).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest labels_file or ImagenetTxtPath.LABELS_FILE.

How about checking if there is just a file at the given path, then, if no, falling back to prepending a dataset directory?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking if there is just a file at the given path, then, if no, falling back to prepending a dataset directory?

I don't think that would lead to useful semantics, because then you would not be able to reliably use a dataset-relative path as the value of the parameter. If you wrote ImagenetTxtExtractor(..., labels_file='foobar.txt'), then it could unexpectedly load foobar.txt from the current directory if such a file existed there.

The semantics should be either "path relative to the current directory" or "path relative to the dataset directory", not both. In fact, I'm considering changing the code back to use the former semantics, since it's more consistent with how paths are usually handled in APIs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I'm considering changing the code back to use the former semantics, since it's more consistent with how paths are usually handled in APIs.

Did that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assess which variant you like better:

datum add -f imagenet_txt /full/path/to/dataset -- --labels-file /full/path/to/dataset/labels.txt
datum add -f imagenet_txt /full/path/to/dataset -- --labels-file labels.txt

My preference is the second. It also allows to relocate the dataset without breaking import.

Imagine, there is an URL instead of the path.
datum add -f imagenet_txt https://my.storage/path/to/dataset/ -- --labels-file ???

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assess which variant you like better:

I don't think there's a clear winner here, really. Sure, it's nice to not repeat the dataset path, but then you can have situations like this:

datum add -f imagenet_txt a/b/c -- --labels-file d/e/f.txt

where there are two paths on the command line, and the first one is relative to the current directory, while the second one is relative to the first one. That seems like surprising behavior to me. I could see a user trying to load a labels file from the current directory and failing because of this.

It also allows to relocate the dataset without breaking import.

What do you mean? If the dataset is moved, then the import will be broken regardless, since the path to the dataset will be out of date, won't it?

Imagine, there is an URL instead of the path.

It is a compelling use case, but how would that work in the first place? You need to be able to get the list of files to load ImageNet, and you can't get it over HTTP.


All in all, I'm not married to either approach, so if you think that interpreting the path relative to the dataset path is the better option, I can do that. I just wanted to air my concerns first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I changed my mind. I just saw that the image directory is already interpreted relative to the dataset. In that case, it obviously makes sense to do the same for the labels file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where there are two paths on the command line, and the first one is relative to the current directory, while the second one is relative to the first one. That seems like surprising behavior to me. I could see a user trying to load a labels file from the current directory and failing because of this.

That's why I suggested to do 2 file checks above - one for the local/absolute path, and another for the dataset-relative one. Actually, I don't see much sense in pointing to a file outside the dataset directory.

What do you mean? If the dataset is moved, then the import will be broken regardless, since the path to the dataset will be out of date, won't it?

I meant it in the scope of copying the dataset inside the project directory. Once it is copied, it is relocated, and then it can be relocated further (together with the project). The dataset is also going to be in an arbitrary places after it is downloaded, or materialized in the project cache.

It is a compelling use case, but how would that work in the first place? You need to be able to get the list of files to load ImageNet, and you can't get it over HTTP.

OK, it is a good point. Let's change HTTP to any CSP then - s3, azure, etc.

I added a separate `find_sources_with_params` function instead of just
adding `extra_params` to `find_sources`, because this way we don't have to
update every single importer with the new parameters (and also any
third-party importers that might already exist don't have to be rewritten).
Roman Donchenko added 2 commits August 30, 2021 20:25
This will be needed to enable Accuracy Checker integration, since Accuracy
Checker is separately configured with the annotation and class file paths.
@IRDonch IRDonch merged commit 675826c into openvinotoolkit:develop Aug 31, 2021
@IRDonch IRDonch deleted the imagenet-custom-labels-file branch August 31, 2021 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants