Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kate/splitter cli #81

Merged
merged 7 commits into from
Jan 14, 2021
Merged

Kate/splitter cli #81

merged 7 commits into from
Jan 14, 2021

Conversation

jihyeonyi
Copy link
Contributor

@jihyeonyi jihyeonyi commented Jan 12, 2021

Summary

This PR includes

  • supporting CLI for task-specific split
  • Revise re-identification split
  • Update documentation regarding the task-specific split

How to test

Unittest

$ python -m unittest -v tests/test_splitter.py

Testing classification split with imagenet dataset.

Notes: Imagenet doesn't support subsets but, checking subsets at the project level is enough here.

$ pip install .
$ datum project create -o imagenet
$ datum source add path <path-to-source> -f imagenet -p imagenet/
$ datum project transform -t classification_split -p imagenet/ -- --subset train:.5 --subset val:.2 --subset test:.3
$ datum project info -p imagenet-classification_split

Testing detection split with voc dataset

$ pip install .
$ datum project import -i <path-to-voc> -f voc
$ cd voc/
$ datum project transform -t detection_split -- --subset train:.5 --subset val:.2 --subset test:.3
$ datum project info -p voc-detection_split

Testing re-identification split with imagenet dataset.

Notes: Datumaro doesn't support re-id dataset now, so the classification dataset is used instead.

$ pip install .
$ datum project create -o imagenet
$ datum source add path <path-to-imagenet> -f imagenet -p imagenet/
$ datum project transform -t reidentification_split -p imagenet/ -- --subset train:.5 --subset val:.2 --subset test:.3 --query .5
$ datum project info -p imagenet-reidentification_split

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

zhiltsov-max
zhiltsov-max previously approved these changes Jan 13, 2021
datumaro/plugins/splitter.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
tests/test_splitter.py Outdated Show resolved Hide resolved
Copy link
Contributor

@zhiltsov-max zhiltsov-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the updated class descriptions for correctness.

Future updates could include:

  • ignoring attributes in classification split (for captions, descriptions and other technical attributes)
  • splitting using an attribute as label in classification split
  • using polygons and masks in detection split

README.md Show resolved Hide resolved
Comment on lines +232 to +233
Produces a split with a specified ratio of images, avoiding having same
labels in different subsets.|n
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we avoid having the same person id or object id. It could be label or attribute if attr_for_id is specified.

Copy link
Contributor Author

@jihyeonyi jihyeonyi Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing is, actually train and val set share person id or object id. (Most person re-identification data doesn't have val set though). But they do not share IDs with test set.
I'm not sure how accurate the explanation should be.
If you feel the current explanation is sufficient, please leave it as it is.

@jihyeonyi
Copy link
Contributor Author

Please check the updated class descriptions for correctness.

Future updates could include:

  • ignoring attributes in classification split (for captions, descriptions and other technical attributes)
  • splitting using an attribute as label in classification split
  • using polygons and masks in detection split

Thank you for revising the descriptions.
And for future updates,

  1. Would you like to remove the attribute-based splitting or just make it optional?
    I think the latter is better.
  2. When you say 'splitting using an attribute as label', do you mean splitting using only attributes, regardless of labels?
  3. Does the detection task have polygons or masks? I thought it is for the segmentation task. Maybe I'm wrong.
    For your information, I'll add a splitter for the segmentation task. So why don't you add polygons or masks later?

@zhiltsov-max
Copy link
Contributor

  1. Would you like to remove the attribute-based splitting or just make it optional?

Optional, enabled by default.

  1. When you say 'splitting using an attribute as label', do you mean splitting using only attributes, regardless of labels?

I mean using a single attribute, like in re-id. Maybe, using some subset of them / ignoring some attributes.

  1. Does the detection task have polygons or masks?

In Mask R-CNN they are intermixed with segmentation task. I, personally, consider these types of annotations more or less interchangeable, because all these types can be used for training a segmentation and a detection algorithm.

@zhiltsov-max zhiltsov-max merged commit 1ee908f into develop Jan 14, 2021
@zhiltsov-max zhiltsov-max deleted the kate/splitter-cli branch February 16, 2021 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants