Implement public dataset downloading via TFDS #582

IRDonch · 2021-12-14T10:30:46Z

Summary

This adds a new download command, which downloads public datasets. In terms of syntax & semantics, it's comparable to the convert command, except that instead of a source directory and source format, it accepts a "dataset ID".

Currently, the downloading is done through an external library, TensorFlow Datasets. However, I expect that this might not always be the case: we might implement native downloading later, or add other download backends. Therefore, the dataset ID must begin with a namespace (currently, the only such namespace is tfds:) that signifies the download backend.

We could probably make use of the Datumaro plugin manager functionality to implement download backends, but I don't want to bother with it until there are at least two of them.

The way TFDS represents dataset items is much more flexible than the way Datumaro does it; an item can have features with arbitrary names and types. To map those features onto Datumaro's fixed annotation types, this code uses adapters (which are basically just sequences of predefined callbacks that each convert a certain feature or a metadata element into the Datumaro representation).

How to test

Try the download command. :-)

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below)

# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

datumaro/cli/commands/download.py

IRDonch · 2021-12-16T13:33:36Z

Tests are now implemented.

This will be used by a future `download` command. Note that TFDS does not explicitly depend on TensorFlow, and neither does this extra. This is because there are multiple TF distributions (`tensorflow`, `tensorflow-gpu`, `tf-nightly`), and not having an explicit dependency allows the user to decide which distribution to install.

datumaro/cli/commands/download.py

datumaro/components/extractor_tfds.py

zhiltsov-max · 2021-12-20T15:12:10Z

I failed to download mnist dataset on windows behind proxy, unless I set the env variable. It is actual for pip too. Maybe, it should be documented somewhere, because it is not very typical for Windows users to set env vars manually.

Yes, I see new there is a user cache directory for TFDS.

In terms of syntax & semantics, this command is comparable to the `convert` command, except that instead of a source directory and source format, it accepts a "dataset ID". Currently, the downloading is done through an external library, TensorFlow Datasets. However, I expect that this might not always be the case: we might implement native downloading later, or add other download backends. Therefore, the dataset ID must begin with a namespace (currently, the only such namespace is `tfds:`) that signifies the download backend. We could probably make use of the Datumaro plugin manager functionality to implement download backends, but I don't want to bother with it until there are at least two of them. The way TFDS represents dataset items is much more flexible than the way Datumaro does it; an item can have features with arbitrary names and types. To map those features onto Datumaro's fixed annotation types, this code uses adapters (which are basically just sequences of predefined callbacks that each convert a certain feature or a metadata element into the Datumaro representation).

IRDonch · 2021-12-20T15:54:54Z

I failed to download mnist dataset on windows behind proxy, unless I set the env variable. It is actual for pip too. Maybe, it should be documented somewhere, because it is not very typical for Windows users to set env vars manually.

I guess that's fair, but where?

(Interestingly, TFDS itself doesn't seem to document it at all.)

zhiltsov-max · 2021-12-21T09:01:13Z

I guess that's fair, but where?

I'll vote for just a comment in the online command docs.

IRDonch · 2021-12-21T10:26:23Z

I guess that's fair, but where?

I'll vote for just a comment in the online command docs.

Okay, I added one.

Use the format of the original dataset by default. Rename `_TfdsAdapter.metadata_transformers` to `category_transformers` to avoid confusion with `metadata`.

zhiltsov-max · 2021-12-23T09:12:42Z

datumaro/util/test_utils.py

@@ -263,3 +265,37 @@ def compare_dirs(test, expected: str, actual: str):
 def run_datum(test, *args, expected_code=0):
    from datumaro.cli.__main__ import main
    test.assertEqual(expected_code, main(args), str(args))
+
+@contextlib.contextmanager
+def mock_tfds_data(example=None):


Probably, we need to place such functionality (mocks, fixtures, pytest-dependents etc.) in the tests/ directory. run_datum is also a good candidate for moving. This is not a blocker, but we should do it at some point.

datumaro/cli/commands/download.py

IRDonch requested review from yasakova-anastasia and zhiltsov-max December 14, 2021 10:30

zhiltsov-max reviewed Dec 14, 2021

View reviewed changes

datumaro/cli/commands/download.py Outdated Show resolved Hide resolved

zhiltsov-max reviewed Dec 14, 2021

View reviewed changes

datumaro/cli/commands/download.py Show resolved Hide resolved

IRDonch marked this pull request as ready for review December 17, 2021 14:17

zhiltsov-max reviewed Dec 20, 2021

View reviewed changes

datumaro/cli/commands/download.py Outdated Show resolved Hide resolved

zhiltsov-max reviewed Dec 20, 2021

View reviewed changes

datumaro/cli/commands/download.py Show resolved Hide resolved

zhiltsov-max reviewed Dec 20, 2021

View reviewed changes

datumaro/components/extractor_tfds.py Outdated Show resolved Hide resolved

Roman Donchenko added 5 commits December 20, 2021 18:52

Add tests for the dataset download functionality

7450b8f

Add documentation for the download command

c0c7c69

download: Print a more specific error message when TFDS is not installed

8d63a4a

Add a changelog entry

bac3c16

download: Mention the proxy environment variables in the documentation

f7c83a9

download: make the format parameter optional

43e51aa

Use the format of the original dataset by default. Rename `_TfdsAdapter.metadata_transformers` to `category_transformers` to avoid confusion with `metadata`.

zhiltsov-max reviewed Dec 23, 2021

View reviewed changes

datumaro/cli/commands/download.py Outdated Show resolved Hide resolved

Maxim Zhiltsov added 2 commits December 23, 2021 12:54

Apply suggestions from code review

24697f7

Merge branch 'develop' into tfds-download

1cf1659

zhiltsov-max approved these changes Dec 23, 2021

View reviewed changes

zhiltsov-max merged commit a28b32d into openvinotoolkit:develop Dec 23, 2021

IRDonch deleted the tfds-download branch September 9, 2022 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement public dataset downloading via TFDS #582

Implement public dataset downloading via TFDS #582

IRDonch commented Dec 14, 2021 •

edited

Loading

IRDonch commented Dec 16, 2021

zhiltsov-max commented Dec 20, 2021

IRDonch commented Dec 20, 2021

zhiltsov-max commented Dec 21, 2021

IRDonch commented Dec 21, 2021

zhiltsov-max Dec 23, 2021

Implement public dataset downloading via TFDS #582

Implement public dataset downloading via TFDS #582

Conversation

IRDonch commented Dec 14, 2021 • edited Loading

Summary

How to test

Checklist

License

IRDonch commented Dec 16, 2021

zhiltsov-max commented Dec 20, 2021

IRDonch commented Dec 20, 2021

zhiltsov-max commented Dec 21, 2021

IRDonch commented Dec 21, 2021

zhiltsov-max Dec 23, 2021

Choose a reason for hiding this comment

IRDonch commented Dec 14, 2021 •

edited

Loading