Add Sampler Plugin #115

harimkang · 2021-02-25T07:20:53Z

Summary

This PR includes

Adding Sampler plugin that analyzes inference result from the given dataset and selects the ‘best’ and the ‘least amount of’ samples for annotation
Sampling with entropy based algorithm method
Supporting CLI for sampler

How to test

Unittest

python3 -m unittest -v tests/test_sampler.py

Testing sampling with dataset (After obtaining the inference result)

Notes: DatasetItem is assumed to always have an annotations, each annotation has a 'score' key in its attributes, and the value of that key must have a probability list for all classes in the data.

$ pip install .
$ datum project create -o proj
$ datum source add path <path-to-source> -f <dataset-format> -p proj
$ datum model add -l openvino -p proj -- -d <path-to-model.xml> -w <path-to-model.bin> -i <interpreter-file-path>
$ datum model run -p proj -m <model-name>
$ datum transform -p proj-inference -t sampler -- 
               -algo <algorithm-name>
               -subset_name <subset-name> 
               -sample_name <sampled-set-name> 
               -unsample_name <unsampled-set-name> 
               -m <sampling-method> 
               -k <num-of-samples>

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues)

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below)

# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

zhiltsov-max

Hi, thanks for the contribution!
Please try to keep lines no longer than 80 characters. Consider using enums instead of string constants.

zhiltsov-max · 2021-02-25T13:33:28Z

datumaro/plugins/sampler/algorithm/entropy.py

+                temp_rank = temp_rank[-k:]
+            elif method == "randk":
+                return self.data.sample(n=k).reset_index(drop=True)
+            elif method in ["mixk", "randtopk"]:


Consider using an enum for instead of string constants.

zhiltsov-max · 2021-02-25T13:37:17Z

datumaro/plugins/sampler/sampler.py

+    def build_cmdline_parser(cls, **kwargs):
+        parser = super().build_cmdline_parser(**kwargs)
+        parser.add_argument(
+            "-algo",


Consider using only -a and --long-name variants.

zhiltsov-max · 2021-02-25T13:39:10Z

datumaro/plugins/sampler/sampler.py

+                raise Exception(msg)
+
+        # 1. check the empty of the data set
+        if len(extractor) < 1:


I wouldn't consider this an error.

zhiltsov-max · 2021-02-25T13:42:34Z

datumaro/plugins/sampler/sampler.py

+
+        # 2. Import data into a subset name and convert it
+        # to a format that will be used in the sampler algorithm with the inference result.
+        data_df, infer_df = self._load_inference_from_subset(extractor, subset_name)


Please make the operation to be traversing the input dataset only when __iter__ is called.

zhiltsov-max · 2021-02-25T13:43:51Z

datumaro/plugins/sampler/sampler.py

+        for data in subset:
+            data_df["ImageID"].append(data.id)
+
+            if data.image is None:


Please use item.has_image.

zhiltsov-max · 2021-02-25T13:45:12Z

datumaro/plugins/sampler/sampler.py

+            if data.image is None:
+                msg = f"Invalid data, some data.image is None"
+                raise Exception(msg)
+            width, height = data.image.size


size can return None in some cases (only path provided, but the image file is not available).

zhiltsov-max · 2021-02-25T13:46:06Z

datumaro/plugins/sampler/sampler.py

+        # Checking and creating algorithms
+        algorithms = ["entropy"]
+        if algorithm == "entropy":
+            from datumaro.plugins.sampler.algorithm.entropy import SampleEntropy


Use relative imports in plugins for intra-plugin imports.

harimkang · 2021-02-26T04:52:57Z

@zhiltsov-max Thank you for your review.

I modified the contents in commit 'code review update #1' (#f43b64f). If you have any questions after checking, please comment.

And CI checking is currently failing because the sampler uses pandas. If there is a guide for the external library, please reply. Currently, it is only added to requirements.txt.

zhiltsov-max · 2021-02-26T09:41:24Z

Plugin dependencies are considered optional for Datumaro, so putting them to the requirements.txt is the right solution. To test you can modify travis.yml and add corresponding commands to install plugin dependencies and test it. Make tests in this plugin skipped if they lack a dependency.
The same way we did it for TensorFlow. OpenVINO and Accuracy Checker plugins aren't covered by tests.

zhiltsov-max · 2021-02-26T16:39:58Z

datumaro/plugins/sampler/algorithm/entropy.py

+
+        # check the existence of "ImageID" in data & inference
+        if "ImageID" not in data:
+            msg = "Invalid Data, ImageID not found in data"


I'd suggest to avoid this pattern, because it spreads reader's attention. Just do raise Exception("some text")

zhiltsov-max · 2021-02-26T16:44:05Z

datumaro/plugins/sampler/sampler.py

+    - Requesting a sample larger than the number of all images will return all images.|n
+    |n
+    Example:|n
+    |s|s%(prog)s -algo entropy -subset_name train -sample_name sample -unsampled_name unsampled -m topk -k 20


Please update this line.

zhiltsov-max · 2021-02-26T16:45:01Z

datumaro/plugins/sampler/sampler.py

+            "--algorithm",
+            type=str,
+            default="entropy",
+            choices=["entropy"],


I'd also introduce an enum for this.

zhiltsov-max · 2021-02-26T16:47:02Z

datumaro/plugins/sampler/sampler.py

+        infer_df = defaultdict(list)
+
+        # 2. Fill the data_df and infer_df to fit the sampler algorithm input format.
+        for data in subset:


I suggest calling it item.

zhiltsov-max · 2021-02-26T16:48:31Z

datumaro/plugins/sampler/sampler.py

+            data_df["ImageID"].append(data.id)
+
+            if not data.has_image or data.image.size is None:
+                msg = "Invalid data, the image file is not available"


In the error messages here it would be nice to also print data.id.

zhiltsov-max · 2021-02-26T16:49:45Z

docs/user_manual.md

+    - `randtopk`: First, select 3 times the number of k randomly, and return the topk among them.
+
+``` bash
+datum transform -t sampler -- \


Please update the example.

zhiltsov-max · 2021-02-26T16:50:39Z

tests/assets/sampler/inference.csv

@@ -0,0 +1,501 @@
+ImageID,ClassProbability1,ClassProbability2,ClassProbability3,Uncertainty


Can we reduce the example to few lines?

harimkang · 2021-03-02T01:41:15Z

@zhiltsov-max Thank you for review.
I modified the contents in commit code review update #2 (#ccc460f).

Modify the pattern of exception messages
Update multiple examples.
Modify the algorithm input parameters String to Enum.
Rename Variable (data->item)
In case of data image check exception, modify to print data.id together
Reduce the number of data in the inference.csv used for the test. (500->30) Correct the test code accordingly and complete the test verification
As you advised, modify travis.yml

If you have any questions, please comment.
Have a nice day :)

…into harim/sampler

sampler initial commit

305d775

zhiltsov-max suggested changes Feb 25, 2021

View reviewed changes

harimkang added 3 commits February 25, 2021 23:18

update CHANGELOG.md

18bd3fc

update documentations

eff78a2

Adding pandas update requirements.txt

caeed04

code review update #1

f43b64f

zhiltsov-max suggested changes Feb 26, 2021

View reviewed changes

harimkang added 2 commits March 2, 2021 16:17

Merge branch 'develop' of https://github.com/openvinotoolkit/datumaro …

6c94db8

…into harim/sampler

code review update #2

ccc460f

zhiltsov-max merged commit 3bbf056 into openvinotoolkit:develop Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sampler Plugin #115

Add Sampler Plugin #115

harimkang commented Feb 25, 2021 •

edited

Loading

zhiltsov-max left a comment

zhiltsov-max Feb 25, 2021

zhiltsov-max Feb 25, 2021

zhiltsov-max Feb 25, 2021

zhiltsov-max Feb 25, 2021

zhiltsov-max Feb 25, 2021

zhiltsov-max Feb 25, 2021

zhiltsov-max Feb 25, 2021

harimkang commented Feb 26, 2021 •

edited

Loading

zhiltsov-max commented Feb 26, 2021

zhiltsov-max Feb 26, 2021

zhiltsov-max Feb 26, 2021

zhiltsov-max Feb 26, 2021

zhiltsov-max Feb 26, 2021

zhiltsov-max Feb 26, 2021

zhiltsov-max Feb 26, 2021

zhiltsov-max Feb 26, 2021

harimkang commented Mar 2, 2021 •

edited

Loading

		@@ -0,0 +1,501 @@
		ImageID,ClassProbability1,ClassProbability2,ClassProbability3,Uncertainty

Add Sampler Plugin #115

Add Sampler Plugin #115

Conversation

harimkang commented Feb 25, 2021 • edited Loading

Summary

How to test

Checklist

License

zhiltsov-max left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harimkang commented Feb 26, 2021 • edited Loading

zhiltsov-max commented Feb 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harimkang commented Mar 2, 2021 • edited Loading

harimkang commented Feb 25, 2021 •

edited

Loading

harimkang commented Feb 26, 2021 •

edited

Loading

harimkang commented Mar 2, 2021 •

edited

Loading