Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sampler Plugin #115

Merged
merged 7 commits into from
Mar 2, 2021
Merged

Add Sampler Plugin #115

merged 7 commits into from
Mar 2, 2021

Conversation

harimkang
Copy link

@harimkang harimkang commented Feb 25, 2021

Summary

This PR includes

  • Adding Sampler plugin that analyzes inference result from the given dataset and selects the ‘best’ and the ‘least amount of’ samples for annotation
  • Sampling with entropy based algorithm method
  • Supporting CLI for sampler

How to test

Unittest

python3 -m unittest -v tests/test_sampler.py

Testing sampling with dataset (After obtaining the inference result)

Notes: DatasetItem is assumed to always have an annotations, each annotation has a 'score' key in its attributes, and the value of that key must have a probability list for all classes in the data.

$ pip install .
$ datum project create -o proj
$ datum source add path <path-to-source> -f <dataset-format> -p proj
$ datum model add -l openvino -p proj -- -d <path-to-model.xml> -w <path-to-model.bin> -i <interpreter-file-path>
$ datum model run -p proj -m <model-name>
$ datum transform -p proj-inference -t sampler -- 
               -algo <algorithm-name>
               -subset_name <subset-name> 
               -sample_name <sampled-set-name> 
               -unsample_name <unsampled-set-name> 
               -m <sampling-method> 
               -k <num-of-samples>

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

Copy link
Contributor

@zhiltsov-max zhiltsov-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the contribution!
Please try to keep lines no longer than 80 characters. Consider using enums instead of string constants.

temp_rank = temp_rank[-k:]
elif method == "randk":
return self.data.sample(n=k).reset_index(drop=True)
elif method in ["mixk", "randtopk"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using an enum for instead of string constants.

def build_cmdline_parser(cls, **kwargs):
parser = super().build_cmdline_parser(**kwargs)
parser.add_argument(
"-algo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using only -a and --long-name variants.

raise Exception(msg)

# 1. check the empty of the data set
if len(extractor) < 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't consider this an error.


# 2. Import data into a subset name and convert it
# to a format that will be used in the sampler algorithm with the inference result.
data_df, infer_df = self._load_inference_from_subset(extractor, subset_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the operation to be traversing the input dataset only when __iter__ is called.

for data in subset:
data_df["ImageID"].append(data.id)

if data.image is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use item.has_image.

if data.image is None:
msg = f"Invalid data, some data.image is None"
raise Exception(msg)
width, height = data.image.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size can return None in some cases (only path provided, but the image file is not available).

# Checking and creating algorithms
algorithms = ["entropy"]
if algorithm == "entropy":
from datumaro.plugins.sampler.algorithm.entropy import SampleEntropy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use relative imports in plugins for intra-plugin imports.

@harimkang
Copy link
Author

harimkang commented Feb 26, 2021

@zhiltsov-max Thank you for your review.

I modified the contents in commit 'code review update #1' (#f43b64f). If you have any questions after checking, please comment.

And CI checking is currently failing because the sampler uses pandas. If there is a guide for the external library, please reply. Currently, it is only added to requirements.txt.

@zhiltsov-max
Copy link
Contributor

Plugin dependencies are considered optional for Datumaro, so putting them to the requirements.txt is the right solution. To test you can modify travis.yml and add corresponding commands to install plugin dependencies and test it. Make tests in this plugin skipped if they lack a dependency.
The same way we did it for TensorFlow. OpenVINO and Accuracy Checker plugins aren't covered by tests.


# check the existence of "ImageID" in data & inference
if "ImageID" not in data:
msg = "Invalid Data, ImageID not found in data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to avoid this pattern, because it spreads reader's attention. Just do raise Exception("some text")

- Requesting a sample larger than the number of all images will return all images.|n
|n
Example:|n
|s|s%(prog)s -algo entropy -subset_name train -sample_name sample -unsampled_name unsampled -m topk -k 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this line.

"--algorithm",
type=str,
default="entropy",
choices=["entropy"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also introduce an enum for this.

infer_df = defaultdict(list)

# 2. Fill the data_df and infer_df to fit the sampler algorithm input format.
for data in subset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest calling it item.

data_df["ImageID"].append(data.id)

if not data.has_image or data.image.size is None:
msg = "Invalid data, the image file is not available"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the error messages here it would be nice to also print data.id.

- `randtopk`: First, select 3 times the number of k randomly, and return the topk among them.

``` bash
datum transform -t sampler -- \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the example.

@@ -0,0 +1,501 @@
ImageID,ClassProbability1,ClassProbability2,ClassProbability3,Uncertainty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce the example to few lines?

@harimkang
Copy link
Author

harimkang commented Mar 2, 2021

@zhiltsov-max Thank you for review.
I modified the contents in commit code review update #2 (#ccc460f).

  1. Modify the pattern of exception messages
  2. Update multiple examples.
  3. Modify the algorithm input parameters String to Enum.
  4. Rename Variable (data->item)
  5. In case of data image check exception, modify to print data.id together
  6. Reduce the number of data in the inference.csv used for the test. (500->30) Correct the test code accordingly and complete the test verification
  7. As you advised, modify travis.yml

If you have any questions, please comment.
Have a nice day :)

@zhiltsov-max zhiltsov-max merged commit 3bbf056 into openvinotoolkit:develop Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants