-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
imagenet_txt_format: add support for a custom labels file path #434
imagenet_txt_format: add support for a custom labels file path #434
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check if it works through CLI? I suspect, it is the first real application for import parameters.
@@ -81,10 +87,13 @@ def _load_items(self, path): | |||
|
|||
class ImagenetTxtImporter(Importer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ImagenetTxtImporter(Importer): | |
class ImagenetTxtImporter(Importer, CliPlugin): | |
@classmethod | |
def build_cmdline_parser(cls, **kwargs): | |
parser = super().build_cmdline_parser(**kwargs) | |
parser.add_argument('-l', '--labelmap', dest='labels_file', | |
help="Path to the labelmap (synsets) file") | |
return parser | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the key be called --labelmap
when the parameter is labels_file
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call it how you see fits better. Label map is kind of a general term for such things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at existing occurrences of the term "label map" in the codebase, and it seems like that's usually used to describe maps of labels to colors (or vice versa). Whereas in this case we just have a list of labels, so I ended up going with --labels_file
. I also didn't add a short option, since I don't expect this option to be frequently used.
labels = self._parse_labels(labels) | ||
if not labels_file: | ||
labels_file = osp.join(osp.dirname(path), ImagenetTxtPath.LABELS_FILE) | ||
labels = self._parse_labels(labels_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, it would be useful to support passing just the filename instead of the full path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented (with the caveat that it's a relative path and not just the filename).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest labels_file or ImagenetTxtPath.LABELS_FILE
.
How about checking if there is just a file at the given path, then, if no, falling back to prepending a dataset directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about checking if there is just a file at the given path, then, if no, falling back to prepending a dataset directory?
I don't think that would lead to useful semantics, because then you would not be able to reliably use a dataset-relative path as the value of the parameter. If you wrote ImagenetTxtExtractor(..., labels_file='foobar.txt')
, then it could unexpectedly load foobar.txt
from the current directory if such a file existed there.
The semantics should be either "path relative to the current directory" or "path relative to the dataset directory", not both. In fact, I'm considering changing the code back to use the former semantics, since it's more consistent with how paths are usually handled in APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I'm considering changing the code back to use the former semantics, since it's more consistent with how paths are usually handled in APIs.
Did that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assess which variant you like better:
datum add -f imagenet_txt /full/path/to/dataset -- --labels-file /full/path/to/dataset/labels.txt
datum add -f imagenet_txt /full/path/to/dataset -- --labels-file labels.txt
My preference is the second. It also allows to relocate the dataset without breaking import.
Imagine, there is an URL instead of the path.
datum add -f imagenet_txt https://my.storage/path/to/dataset/ -- --labels-file ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assess which variant you like better:
I don't think there's a clear winner here, really. Sure, it's nice to not repeat the dataset path, but then you can have situations like this:
datum add -f imagenet_txt a/b/c -- --labels-file d/e/f.txt
where there are two paths on the command line, and the first one is relative to the current directory, while the second one is relative to the first one. That seems like surprising behavior to me. I could see a user trying to load a labels file from the current directory and failing because of this.
It also allows to relocate the dataset without breaking import.
What do you mean? If the dataset is moved, then the import will be broken regardless, since the path to the dataset will be out of date, won't it?
Imagine, there is an URL instead of the path.
It is a compelling use case, but how would that work in the first place? You need to be able to get the list of files to load ImageNet, and you can't get it over HTTP.
All in all, I'm not married to either approach, so if you think that interpreting the path relative to the dataset path is the better option, I can do that. I just wanted to air my concerns first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I changed my mind. I just saw that the image directory is already interpreted relative to the dataset. In that case, it obviously makes sense to do the same for the labels file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where there are two paths on the command line, and the first one is relative to the current directory, while the second one is relative to the first one. That seems like surprising behavior to me. I could see a user trying to load a labels file from the current directory and failing because of this.
That's why I suggested to do 2 file checks above - one for the local/absolute path, and another for the dataset-relative one. Actually, I don't see much sense in pointing to a file outside the dataset directory.
What do you mean? If the dataset is moved, then the import will be broken regardless, since the path to the dataset will be out of date, won't it?
I meant it in the scope of copying the dataset inside the project directory. Once it is copied, it is relocated, and then it can be relocated further (together with the project). The dataset is also going to be in an arbitrary places after it is downloaded, or materialized in the project cache.
It is a compelling use case, but how would that work in the first place? You need to be able to get the list of files to load ImageNet, and you can't get it over HTTP.
OK, it is a good point. Let's change HTTP to any CSP then - s3, azure, etc.
I added a separate `find_sources_with_params` function instead of just adding `extra_params` to `find_sources`, because this way we don't have to update every single importer with the new parameters (and also any third-party importers that might already exist don't have to be rewritten).
This will be needed to enable Accuracy Checker integration, since Accuracy Checker is separately configured with the annotation and class file paths.
Summary
This will be needed to enable Accuracy Checker integration, since Accuracy Checker is separately configured with the annotation and class file paths.
As a prerequisite, also change the
Importer
interface so that importers can access extra parameters when searching for source files. I introduced a newfind_sources_with_params
function instead of adding a new parameter tofind_sources
, so that compatibility with old importers could be preserved.How to test
N/A
Checklist
develop
branchLicense
Feel free to contact the maintainers if that's a concern.