imagenet_txt_format: add support for a custom labels file path #434

IRDonch · 2021-08-26T17:25:26Z

Summary

This will be needed to enable Accuracy Checker integration, since Accuracy Checker is separately configured with the annotation and class file paths.

As a prerequisite, also change the Importer interface so that importers can access extra parameters when searching for source files. I introduced a new find_sources_with_params function instead of adding a new parameter to find_sources, so that compatibility with old importers could be preserved.

How to test

N/A

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below)

# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

tests/test_imagenet_txt_format.py

datumaro/plugins/imagenet_txt_format.py

zhiltsov-max

Could you check if it works through CLI? I suspect, it is the first real application for import parameters.

tests/assets/imagenet_txt_dataset_custom_labels/train.txt

datumaro/plugins/imagenet_txt_format.py

tests/test_imagenet_txt_format.py

zhiltsov-max · 2021-08-30T09:47:23Z

datumaro/plugins/imagenet_txt_format.py

@@ -81,10 +87,13 @@ def _load_items(self, path):

 class ImagenetTxtImporter(Importer):


Suggested change

class ImagenetTxtImporter(Importer):

class ImagenetTxtImporter(Importer, CliPlugin):

@classmethod

def build_cmdline_parser(cls, **kwargs):

parser = super().build_cmdline_parser(**kwargs)

parser.add_argument('-l', '--labelmap', dest='labels_file',

help="Path to the labelmap (synsets) file")

return parser

Why should the key be called --labelmap when the parameter is labels_file?

Call it how you see fits better. Label map is kind of a general term for such things.

I looked at existing occurrences of the term "label map" in the codebase, and it seems like that's usually used to describe maps of labels to colors (or vice versa). Whereas in this case we just have a list of labels, so I ended up going with --labels_file. I also didn't add a short option, since I don't expect this option to be frequently used.

zhiltsov-max · 2021-08-30T09:48:47Z

datumaro/plugins/imagenet_txt_format.py

-            labels = self._parse_labels(labels)
+            if not labels_file:
+                labels_file = osp.join(osp.dirname(path), ImagenetTxtPath.LABELS_FILE)
+            labels = self._parse_labels(labels_file)


Probably, it would be useful to support passing just the filename instead of the full path.

Implemented (with the caveat that it's a relative path and not just the filename).

I suggest labels_file or ImagenetTxtPath.LABELS_FILE.

How about checking if there is just a file at the given path, then, if no, falling back to prepending a dataset directory?

How about checking if there is just a file at the given path, then, if no, falling back to prepending a dataset directory?

I don't think that would lead to useful semantics, because then you would not be able to reliably use a dataset-relative path as the value of the parameter. If you wrote ImagenetTxtExtractor(..., labels_file='foobar.txt'), then it could unexpectedly load foobar.txt from the current directory if such a file existed there.

The semantics should be either "path relative to the current directory" or "path relative to the dataset directory", not both. In fact, I'm considering changing the code back to use the former semantics, since it's more consistent with how paths are usually handled in APIs.

In fact, I'm considering changing the code back to use the former semantics, since it's more consistent with how paths are usually handled in APIs.

Did that.

Assess which variant you like better:

datum add -f imagenet_txt /full/path/to/dataset -- --labels-file /full/path/to/dataset/labels.txt
datum add -f imagenet_txt /full/path/to/dataset -- --labels-file labels.txt

My preference is the second. It also allows to relocate the dataset without breaking import.

Imagine, there is an URL instead of the path.
datum add -f imagenet_txt https://my.storage/path/to/dataset/ -- --labels-file ???

Assess which variant you like better:

I don't think there's a clear winner here, really. Sure, it's nice to not repeat the dataset path, but then you can have situations like this:

datum add -f imagenet_txt a/b/c -- --labels-file d/e/f.txt

where there are two paths on the command line, and the first one is relative to the current directory, while the second one is relative to the first one. That seems like surprising behavior to me. I could see a user trying to load a labels file from the current directory and failing because of this.

It also allows to relocate the dataset without breaking import.

What do you mean? If the dataset is moved, then the import will be broken regardless, since the path to the dataset will be out of date, won't it?

Imagine, there is an URL instead of the path.

It is a compelling use case, but how would that work in the first place? You need to be able to get the list of files to load ImageNet, and you can't get it over HTTP.

All in all, I'm not married to either approach, so if you think that interpreting the path relative to the dataset path is the better option, I can do that. I just wanted to air my concerns first.

Okay, I changed my mind. I just saw that the image directory is already interpreted relative to the dataset. In that case, it obviously makes sense to do the same for the labels file.

where there are two paths on the command line, and the first one is relative to the current directory, while the second one is relative to the first one. That seems like surprising behavior to me. I could see a user trying to load a labels file from the current directory and failing because of this.

That's why I suggested to do 2 file checks above - one for the local/absolute path, and another for the dataset-relative one. Actually, I don't see much sense in pointing to a file outside the dataset directory.

What do you mean? If the dataset is moved, then the import will be broken regardless, since the path to the dataset will be out of date, won't it?

I meant it in the scope of copying the dataset inside the project directory. Once it is copied, it is relocated, and then it can be relocated further (together with the project). The dataset is also going to be in an arbitrary places after it is downloaded, or materialized in the project cache.

It is a compelling use case, but how would that work in the first place? You need to be able to get the list of files to load ImageNet, and you can't get it over HTTP.

OK, it is a good point. Let's change HTTP to any CSP then - s3, azure, etc.

I added a separate `find_sources_with_params` function instead of just adding `extra_params` to `find_sources`, because this way we don't have to update every single importer with the new parameters (and also any third-party importers that might already exist don't have to be rewritten).

datumaro/plugins/imagenet_txt_format.py

This will be needed to enable Accuracy Checker integration, since Accuracy Checker is separately configured with the annotation and class file paths.

datumaro/plugins/imagenet_txt_format.py