Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validator for classification & detection tasks #160

Merged
merged 8 commits into from
Mar 19, 2021

Conversation

seongjun-park-dl
Copy link
Contributor

Summary

This PR includes:

  • Validator for classification and detection tasks.
  • CLI support for validator.

How to test

python -m unittest -v tests/test_validator.py

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

datumaro/cli/contexts/project/__init__.py Outdated Show resolved Hide resolved
tests/assets/validator_test_data/stats.py Outdated Show resolved Hide resolved
datumaro/components/validator.py Show resolved Hide resolved
datumaro/components/validator.py Outdated Show resolved Hide resolved
@zhiltsov-max
Copy link
Contributor

  1. Generally, looks ok. I've tried to validate VOC and this is what I got:
"summary": {
        "errors": 319372,
        "warnings": 81169
    },
  1. The other thing I found is:
{
            "anomaly_type": "MissingBboxAnnotation",
            "description": "Item needs one or more bounding box annotations, but not found.",
            "item_id": "2007_000032",
            "severity": "warning"
        },

I think, it would be useful to add item subset as well (or group results by subsets)

  1. This check
{
            "anomaly_type": "FarFromLabelMean",
            "description": "Bounding box annotation '11' in the item has a value of bounding box 'y' that is too far from the label average. (mean of 'person' label: 108.82, got '286.0').",
            "item_id": "2008_004948",
            "severity": "warning"
        },

looks quite strange for 'x' and 'y' bbox attributes - why images in the dataset must have anything common for them? 'width' and 'height' are more or less understandable, but it might be good to check against clusters, instead of a single value of average. Another thought is to allow setting acceptable boundaries.
'long' and 'short' are hard to understand in bbox context.

  1. This check
{
            "anomaly_type": "UndefinedAttribute",
            "description": "Item has the attribute 'difficult' for the label 'person' which is not defined in metadata.",
            "item_id": "2011_002650",
            "severity": "error"
        },

should consider attributes in LabelCategories.attributes as general, i.e. applicable to all labels. Label-specific ones are written in each label.

  1. As a further step, I'd recommend think of providing an option to control enabled checks.

@jihyeonyi
Copy link
Contributor

For the 2nd comment (grouping missing bbox cases), your idea seems really nice but I don't think we can apply the groupping function at this PR.
For now, the concept of validator is just listing all the warnings verbosely, so the user should group or filter items by parsing the validation results file.
And I can't think of a nice interface? process? to do that.
We can implement this as a filtering function, transformer, or utility function. I don't know which is the best.

@jihyeonyi
Copy link
Contributor

For 3rd comment, I agree with you.
And for the short(=min(w,h)) and long(=max(w,h)), some researchers think they are important for the detection task.
So we include those terms.

For 4rd comment, we didn't know that. Seongjun would fix that.

And for the last comment, we'll consider that later, because it is a design issue.

@seongjun-park-dl
Copy link
Contributor Author

I've addressed 3rd and 4th comments in my most recent commit.

@zhiltsov-max
Copy link
Contributor

zhiltsov-max commented Mar 18, 2021

For now, the concept of validator is just listing all the warnings verbosely, so the user should group or filter items by parsing the validation results file.

I mean, it would simplify/clarify things, if error messages contained a "subset" field, because a single id can (theoretically) be found in several subsets. And even if subsets do not have same item ids, it still makes the process of error localization simpler.

{
"anomaly_type": "...",
"description": "...",
"item_id": "2007_000032",
"subset": "train",

"severity": "..."
},

@jihyeonyi
Copy link
Contributor

For now, the concept of validator is just listing all the warnings verbosely, so the user should group or filter items by parsing the validation results file.

I mean, it would simplify/clarify things, if error messages contained a "subset" field, because a single id can (theoretically) be found in several subsets. And even if subsets do not have same item ids, it still makes the process of error localization simpler.

{
"anomaly_type": "...",
"description": "...",
"item_id": "2007_000032", "subset": "train",
"severity": "..."
},

Oh, I can understand now. It's not difficult to add "subset" information and it seems useful.

@seongjun-park-dl
Copy link
Contributor Author

seongjun-park-dl commented Mar 19, 2021

For now, the concept of validator is just listing all the warnings verbosely, so the user should group or filter items by parsing the validation results file.

I mean, it would simplify/clarify things, if error messages contained a "subset" field, because a single id can (theoretically) be found in several subsets. And even if subsets do not have same item ids, it still makes the process of error localization simpler.

{
"anomaly_type": "...",
"description": "...",
"item_id": "2007_000032", "subset": "train",
"severity": "..."
},

Added the subset field to the error messages!

zhiltsov-max
zhiltsov-max previously approved these changes Mar 19, 2021
@zhiltsov-max zhiltsov-max merged commit cdd5184 into develop Mar 19, 2021
@zhiltsov-max zhiltsov-max deleted the add-validator-classification-detection branch March 24, 2021 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants